Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions roles/os_must_gather/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,17 @@
- name: Collect pod usage
ansible.builtin.include_tasks: get_top.yml

- name: Check if there were some OOMKill
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1: events are already collected within the must-gather, so unless you're interested in a specific workload (if you give me an example I might understand better what you have in mind).
I think just reviewing the collected nodes that report already memory or disk pressure, or the existing events, might give you an idea of what's going on.

ansible.builtin.shell: >
oc get events
--all-namespaces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--all-namespaces in a large environment might take a bit, and I'm not sure you need to check which Pod is oomikilled all over the cluster.
Did you try in a DC or DZ environment?
Perhaps you can consider running the regular openshift must gather along with the openstack one if you're looking for this kind of data. [1]

[1] e.g.

oc adm must-gather \
   --image-stream=openshift/must-gather \
   --image=quay.io/openstack-k8s-operators/openstack-must-gather:latest

If you get OOMKill all over the cluster I expect must-gather does not work and you've probably lost access to the OCP API at this point.

--sort-by=.lastTimestamp |
grep -i -E 'OOMKill|Killing.*out of memory|Pressure' >
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) concenr: I'm unsure here but it seems that if we don't have any OOMKill, this might fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I think we should add ignore_errors on this task.
I've checked on my machine and if grep don't find anything it will return 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't understand why we sort by timestamp.
You're interessted in OOMKill related events, so you should perform a server side pre-filtering with something like:

oc get events --all-namespaces  --field-selector type=Warning  --sort-by=.reason

You can get the json and use jq to filter the "Reason" you're interested to, but to me that kind of post- processing should happen later, as an additional task, on the gathered file.

Copy link
Contributor Author

@danpawlik danpawlik Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have pipefail so would be good.
Adding ignore error in case if cluster utilization is high and oc will get an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-opening, afaik redirection are not taken into consideration for pipefail.

ALso I see here a @fmount comment that you might skipped.

{{ cifmw_os_must_gather_output_log_dir }}/latest/OOMKill-events.log
environment:
KUBECONFIG: "{{ cifmw_openshift_kubeconfig | default(cifmw_os_must_gather_kubeconfig) }}"
ignore_errors: true # noqa: ignore-errors

rescue:
- name: Openstack-must-gather failure
block:
Expand All @@ -123,6 +134,18 @@
--dest-dir {{ ansible_user_dir }}/ci-framework-data/must-gather
--timeout {{ cifmw_os_must_gather_timeout }}
--volume-percentage={{ cifmw_os_must_gather_volume_percentage }}

- name: Check if there were some OOMKill
ansible.builtin.shell: >
oc get events
--all-namespaces
--sort-by=.lastTimestamp |
grep -i -E 'OOMKill|Killing.*out of memory|Pressure' >
{{ cifmw_os_must_gather_output_log_dir }}/latest/OOMKill-events.log
environment:
KUBECONFIG: "{{ cifmw_openshift_kubeconfig | default(cifmw_os_must_gather_kubeconfig) }}"
ignore_errors: true # noqa: ignore-errors

always:
- name: Create oc_inspect log directory
ansible.builtin.file:
Expand Down
Loading