OCPBUGS-91825: monitor/haproxy: tolerate install-phase apiserver bounces in all-down test#31326
OCPBUGS-91825: monitor/haproxy: tolerate install-phase apiserver bounces in all-down test#31326mkowalski wants to merge 1 commit into
Conversation
… test During cluster installation, kube-apiserver static pod installers roll through multiple revisions in quick succession. The apiservers may briefly come up between revisions only to go back down, producing multiple short all-backends-down windows that are all part of the same installation phase. Previously, evaluateFullAPIOutages only tolerated the very first all-down window per haproxy instance. Any subsequent window — even seconds later during the same installation — was flagged as a failure. Add a 20-minute sliding grace period after each tolerated install-time window. Windows starting within the grace period are also tolerated (and extend the deadline), while any window starting after the grace period is correctly reported as a real outage. This fixes false-positive regressions observed on both metal-ipi and vsphere platforms in 5.0 nightly jobs where installer revision rollouts cause apiservers to bounce 2-4 times during bootstrap. Signed-off-by: Mateusz Kowalski <mko@redhat.com> Generated-by: AI
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughAdds a 20-minute ChangesInstall-time outage grace period for HAProxy monitor
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/monitortests/network/onpremhaproxy/monitortest.go (1)
435-437: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winUpdate failure text to match new grace-period semantics.
The failure message still says any occurrence after the first is a failure. After this change, occurrences within the sliding install grace period are tolerated, so the output is now misleading during triage.
Suggested fix
- Output: "Haproxy detected all kube-apiserver backends down at the same time after the initial startup window. " + - "The first occurrence for every haproxy instance is expected: when haproxy starts during the installation, all kube-apiservers are down until they come up for the first time. " + - "Any subsequent occurrence means the API was completely unreachable through the on-prem loadbalancer. " + + Output: "Haproxy detected all kube-apiserver backends down at the same time outside the install grace window. " + + "The first occurrence for every haproxy instance is expected: when haproxy starts during installation, all kube-apiservers are down until they come up for the first time. " + + "Additional all-down windows are tolerated only within the sliding install grace period; occurrences after that mean the API was completely unreachable through the on-prem loadbalancer. " + "Look at the onprem-haproxy rows in the intervals chart to see which haproxy instance detected which kube-apiserver as down.",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/network/onpremhaproxy/monitortest.go` around lines 435 - 437, The Output field in the haproxy monitor test contains a failure message that needs to be updated to reflect the new grace-period semantics. The current message states "Any subsequent occurrence means the API was completely unreachable" which is now misleading because occurrences within the sliding install grace period are now tolerated and should not be considered failures. Modify the Output text to clarify that only occurrences after the sliding install grace period has ended are failures, while occurrences during the grace period are expected and acceptable.
🧹 Nitpick comments (1)
pkg/monitortests/network/onpremhaproxy/monitortest_test.go (1)
226-241: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winAdd an explicit grace-deadline boundary case (
start == deadline).Current cases cover clearly-inside and clearly-outside windows, but not the exact boundary. A dedicated case would lock in the intended
Before(strict) semantics and prevent future regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/network/onpremhaproxy/monitortest_test.go` around lines 226 - 241, Add a new test case after the current "outage well after install grace period fails" case that explicitly tests the boundary condition where an outage begins exactly at the grace period deadline (start == deadline). Create a test case with a similar structure to the existing one, but set the haproxyDownInterval start time to occur exactly at the grace period deadline (20 minutes after the install outage ends), with expectFailure set to true to verify the strict "Before" semantics. Include appropriate expectedOutputs with the node name, deadline timestamp, and end time to ensure the boundary behavior is correctly validated and prevent future regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/network/onpremhaproxy/monitortest.go`:
- Around line 415-416: The condition at line 415 using
`window.from.Before(graceDeadline)` excludes the boundary case where a window
starts exactly at the grace deadline, causing false failures at the 20-minute
mark. Change the comparison to be inclusive of the boundary by replacing the
`.Before()` check with a condition that also includes the equal case, such as
checking `!window.from.After(graceDeadline)` which is equivalent to "before or
equal to" the grace deadline. This ensures windows starting exactly at the
deadline are properly handled within the grace period logic.
---
Outside diff comments:
In `@pkg/monitortests/network/onpremhaproxy/monitortest.go`:
- Around line 435-437: The Output field in the haproxy monitor test contains a
failure message that needs to be updated to reflect the new grace-period
semantics. The current message states "Any subsequent occurrence means the API
was completely unreachable" which is now misleading because occurrences within
the sliding install grace period are now tolerated and should not be considered
failures. Modify the Output text to clarify that only occurrences after the
sliding install grace period has ended are failures, while occurrences during
the grace period are expected and acceptable.
---
Nitpick comments:
In `@pkg/monitortests/network/onpremhaproxy/monitortest_test.go`:
- Around line 226-241: Add a new test case after the current "outage well after
install grace period fails" case that explicitly tests the boundary condition
where an outage begins exactly at the grace period deadline (start == deadline).
Create a test case with a similar structure to the existing one, but set the
haproxyDownInterval start time to occur exactly at the grace period deadline (20
minutes after the install outage ends), with expectFailure set to true to verify
the strict "Before" semantics. Include appropriate expectedOutputs with the node
name, deadline timestamp, and end time to ensure the boundary behavior is
correctly validated and prevent future regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 30e05abf-df56-42db-9117-8956bcad5842
📒 Files selected for processing (2)
pkg/monitortests/network/onpremhaproxy/monitortest.gopkg/monitortests/network/onpremhaproxy/monitortest_test.go
|
Scheduling required tests: |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the on-prem HAProxy monitor test that detects “all kube-apiserver backends down” so it doesn’t falsely fail during cluster installation, when apiservers can bounce repeatedly during static-pod revision rollouts.
Changes:
- Add a 20-minute sliding grace period after the first tolerated “all backends down” window to tolerate chained install-time bounces.
- Update
evaluateFullAPIOutageslogic to extend the grace deadline after each tolerated window. - Add new unit tests covering install-time bounce tolerance and post-grace real outage detection.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
pkg/monitortests/network/onpremhaproxy/monitortest.go |
Implements the sliding install grace period when evaluating full API outage windows. |
pkg/monitortests/network/onpremhaproxy/monitortest_test.go |
Adds unit tests for bounce chains within grace period and failures outside the grace period. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@mkowalski: This pull request references Jira Issue OCPBUGS-91825, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-91825, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified later |
|
@mkowalski: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified later @mkowalski |
|
@mkowalski: This PR has been marked to be verified later by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cybertron, mkowalski The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@mkowalski: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Problem
The
[Monitor:on-prem-haproxy] Haproxy should not encounter all kube apiservers down simultaneouslytest is producing false-positive regressions on both metal-ipi and vsphere platforms in 5.0 nightly jobs (Component Readiness regressions #42568, #42586, #42587).During cluster installation, the kube-apiserver static pod installer rolls through 6-8 revisions across 3 masters in rapid succession. The apiservers briefly come up between revisions, then go back down — producing multiple short all-backends-down windows that are all part of the same installation phase.
The test currently only tolerates the very first all-down window per haproxy instance (
outagesPerNode[node][1:]). Any subsequent window — even seconds later during the same installation — is flagged as a test failure.Evidence from CI analysis
Analyzed 6 Prow job runs across both platforms:
20692096865859051522069028163790311424206912323632234905620688534000904478722069028409551360000Key findings:
Fix
Add a 20-minute sliding grace period after each tolerated install-time window:
Why 20 minutes?
Tests
Added 5 new test cases in
TestEvaluateFullAPIOutages:🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.
Summary by CodeRabbit
Bug Fixes
Tests