Skip to content

[os_must_gather] Add OOMKill log#3776

Open
danpawlik wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
danpawlik:add-oom-kill-info
Open

[os_must_gather] Add OOMKill log#3776
danpawlik wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
danpawlik:add-oom-kill-info

Conversation

@danpawlik
Copy link
Contributor

It might happen that there would be an OOMKill of the container. We would like to be aware about that issue.

@danpawlik danpawlik requested review from a team, arxcruz, averdagu and fmount March 18, 2026 14:29
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign pkomarov for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@danpawlik danpawlik force-pushed the add-oom-kill-info branch 2 times, most recently from 2c71749 to fdad102 Compare March 18, 2026 14:31
@averdagu
Copy link
Contributor

Looks good to me. Did we found any occurrence of this?

@danpawlik
Copy link
Contributor Author

@averdagu in some cases, we see that tempest tests can not pass, where we wondering what can be root cause of it. Maybe it is oomkill, maybe bad health check? Something to discover. We just want to more aware whats going on.

@averdagu
Copy link
Contributor

/lgtm

evallesp
evallesp previously approved these changes Mar 18, 2026
oc get events
--all-namespaces
--sort-by=.lastTimestamp |
grep -i -E 'OOMKill|Killing.*out of memory|Pressure' >
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) concenr: I'm unsure here but it seems that if we don't have any OOMKill, this might fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I think we should add ignore_errors on this task.
I've checked on my machine and if grep don't find anything it will return 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't understand why we sort by timestamp.
You're interessted in OOMKill related events, so you should perform a server side pre-filtering with something like:

oc get events --all-namespaces  --field-selector type=Warning  --sort-by=.reason

You can get the json and use jq to filter the "Reason" you're interested to, but to me that kind of post- processing should happen later, as an additional task, on the gathered file.

Copy link
Contributor Author

@danpawlik danpawlik Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have pipefail so would be good.
Adding ignore error in case if cluster utilization is high and oc will get an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-opening, afaik redirection are not taken into consideration for pipefail.

ALso I see here a @fmount comment that you might skipped.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2026

New changes are detected. LGTM label has been removed.

It might happen that there would be an OOMKill of the container.
We would like to be aware about that issue.

Signed-off-by: Daniel Pawlik <dpawlik@redhat.com>
Copy link
Contributor

@fmount fmount left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against the idea of being aware of any OOMKill that might happen in the cluster, but I think just getting the events is not enough to have a clear idea of the nodes status.
Instead considering to stores the output of oc describe nodes that contains more data about Conditions and Capacity and can drive through a resolution of the problem.

- name: Collect pod usage
ansible.builtin.include_tasks: get_top.yml

- name: Check if there were some OOMKill
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1: events are already collected within the must-gather, so unless you're interested in a specific workload (if you give me an example I might understand better what you have in mind).
I think just reviewing the collected nodes that report already memory or disk pressure, or the existing events, might give you an idea of what's going on.

- name: Check if there were some OOMKill
ansible.builtin.shell: >
oc get events
--all-namespaces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--all-namespaces in a large environment might take a bit, and I'm not sure you need to check which Pod is oomikilled all over the cluster.
Did you try in a DC or DZ environment?
Perhaps you can consider running the regular openshift must gather along with the openstack one if you're looking for this kind of data. [1]

[1] e.g.

oc adm must-gather \
   --image-stream=openshift/must-gather \
   --image=quay.io/openstack-k8s-operators/openstack-must-gather:latest

If you get OOMKill all over the cluster I expect must-gather does not work and you've probably lost access to the OCP API at this point.

oc get events
--all-namespaces
--sort-by=.lastTimestamp |
grep -i -E 'OOMKill|Killing.*out of memory|Pressure' >
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't understand why we sort by timestamp.
You're interessted in OOMKill related events, so you should perform a server side pre-filtering with something like:

oc get events --all-namespaces  --field-selector type=Warning  --sort-by=.reason

You can get the json and use jq to filter the "Reason" you're interested to, but to me that kind of post- processing should happen later, as an additional task, on the gathered file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants