-
Notifications
You must be signed in to change notification settings - Fork 150
[os_must_gather] Add OOMKill log #3776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -105,6 +105,17 @@ | |
| - name: Collect pod usage | ||
| ansible.builtin.include_tasks: get_top.yml | ||
|
|
||
| - name: Check if there were some OOMKill | ||
| ansible.builtin.shell: > | ||
danpawlik marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| oc get events | ||
| --all-namespaces | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
[1] e.g. oc adm must-gather \
--image-stream=openshift/must-gather \
--image=quay.io/openstack-k8s-operators/openstack-must-gather:latestIf you get OOMKill all over the cluster I expect must-gather does not work and you've probably lost access to the OCP API at this point. |
||
| --sort-by=.lastTimestamp | | ||
| grep -i -E 'OOMKill|Killing.*out of memory|Pressure' > | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (blocking) concenr: I'm unsure here but it seems that if we don't have any OOMKill, this might fail.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, I think we should add ignore_errors on this task.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, I don't understand why we sort by timestamp. You can get the json and use
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we don't have pipefail so would be good.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re-opening, afaik redirection are not taken into consideration for pipefail. ALso I see here a @fmount comment that you might skipped. |
||
| {{ cifmw_os_must_gather_output_log_dir }}/latest/OOMKill-events.log | ||
| environment: | ||
| KUBECONFIG: "{{ cifmw_openshift_kubeconfig | default(cifmw_os_must_gather_kubeconfig) }}" | ||
| ignore_errors: true # noqa: ignore-errors | ||
|
|
||
| rescue: | ||
| - name: Openstack-must-gather failure | ||
| block: | ||
|
|
@@ -123,6 +134,18 @@ | |
| --dest-dir {{ ansible_user_dir }}/ci-framework-data/must-gather | ||
| --timeout {{ cifmw_os_must_gather_timeout }} | ||
| --volume-percentage={{ cifmw_os_must_gather_volume_percentage }} | ||
|
|
||
| - name: Check if there were some OOMKill | ||
| ansible.builtin.shell: > | ||
| oc get events | ||
| --all-namespaces | ||
| --sort-by=.lastTimestamp | | ||
| grep -i -E 'OOMKill|Killing.*out of memory|Pressure' > | ||
| {{ cifmw_os_must_gather_output_log_dir }}/latest/OOMKill-events.log | ||
| environment: | ||
| KUBECONFIG: "{{ cifmw_openshift_kubeconfig | default(cifmw_os_must_gather_kubeconfig) }}" | ||
| ignore_errors: true # noqa: ignore-errors | ||
|
|
||
| always: | ||
| - name: Create oc_inspect log directory | ||
| ansible.builtin.file: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1: events are already collected within the must-gather, so unless you're interested in a specific workload (if you give me an example I might understand better what you have in mind).
I think just reviewing the collected nodes that report already memory or disk pressure, or the existing events, might give you an idea of what's going on.