CORENET-6572: only report Progressing for active network rollouts#2937
CORENET-6572: only report Progressing for active network rollouts#2937jluhrsen wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
@jluhrsen: This pull request references CORENET-6572 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
WalkthroughRemoved global MCP-progressing check; per-rendered-MC processing now uses source-based machineconfig helpers and defers pruning until a non-paused pool confirms removal. Persisted per-pod snapshot fields ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.38.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope ... [truncated 17231 characters] ... ired in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20250922181213-ec3ebc5fd46b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jluhrsen The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/controller/statusmanager/status_manager_test.go (1)
1700-1736: Please add a cold-start mid-rollout regression case.Current tests validate rollout continuation with existing state and reboot churn without rollout, but they don’t cover first-observation behavior when last-seen state is empty and workload is already at
observedGeneration==generation,updated==replicas,unavailable>0. That scenario should be pinned to avoid false-negative Progressing.As per coding guidelines, "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 1826-1856
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/statusmanager/status_manager_test.go` around lines 1700 - 1736, Add a new unit test that covers the cold-start mid-rollout regression: create a Deployment (reuse depB or name it depColdStart) with Status.ObservedGeneration == Generation, Status.UpdatedReplicas == Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty last-seen state (i.e., do not pre-populate any prior status in the status manager). Then call setStatus(t, client, depColdStart), run status.SetFromPods(), and call getStatuses(client, "testing"); finally assert via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing is ConditionTrue (and other expected conditions mirror the other tests). Use the same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses, conditionsInclude) and place the test near the existing rollout tests so it executes in the same context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Around line 93-110: The dsState.RolloutGeneration must be initialized when the
controller first sees a DaemonSet mid-convergence even if
generation==observedGeneration; change the initialization logic in the DaemonSet
handling (variables: dsState, RolloutGeneration, currentRolloutGeneration) so
that when a state entry is new or RolloutGeneration==0 and the DaemonSet has
in-flight convergence (ds.Status.NumberUnavailable > 0 OR
(ds.Status.DesiredNumberScheduled > 0 && ds.Status.UpdatedNumberScheduled <
ds.Status.DesiredNumberScheduled)), you set dsState.RolloutGeneration =
currentRolloutGeneration (respecting the existing status.installComplete guard
if needed); apply the same initialization pattern to the analogous StatefulSet
and Deployment blocks (the corresponding variables and checks around
observedGeneration, UpdatedNumberScheduled/Ready/DesiredNumberScheduled, and
NumberUnavailable/NotReady).
---
Nitpick comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 1700-1736: Add a new unit test that covers the cold-start
mid-rollout regression: create a Deployment (reuse depB or name it depColdStart)
with Status.ObservedGeneration == Generation, Status.UpdatedReplicas ==
Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty
last-seen state (i.e., do not pre-populate any prior status in the status
manager). Then call setStatus(t, client, depColdStart), run
status.SetFromPods(), and call getStatuses(client, "testing"); finally assert
via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing
is ConditionTrue (and other expected conditions mirror the other tests). Use the
same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses,
conditionsInclude) and place the test near the existing rollout tests so it
executes in the same context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c8f95521-c9ef-41f3-bf7d-f168f710f1c4
📒 Files selected for processing (3)
pkg/controller/statusmanager/machineconfig_status.gopkg/controller/statusmanager/pod_status.gopkg/controller/statusmanager/status_manager_test.go
da9acf2 to
2e2113f
Compare
|
@danwinship , if you have a chance, please check this one out. looks like network operator will hit this issue in 90% of the 4.21->4.22 aws-ovn-upgrade jobs so running it here a few times will be enough to validate the fix from that high level. /payload-job-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
/retest |
2e2113f to
bb1965f
Compare
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c24b550-23ae-11f1-97df-f16d2b83fd79-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Line 520: The test is asserting co.Status.Versions from a stale
ClusterOperator object after calling getStatuses; update the test to re-fetch or
assign the returned updated ClusterOperator before asserting Versions (e.g., use
the returned oc from getStatuses to set co = oc or call getStatuses again) so
that assertions against Status.Versions use the latest ClusterOperator state;
specifically adjust the places around the getStatuses call and the subsequent
assertions that reference co.Status.Versions (occurrences around getStatuses, co
variable, and Status.Versions) to use the refreshed object.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e3654515-9f48-47fd-8ed3-357ead25e962
📒 Files selected for processing (3)
pkg/controller/statusmanager/machineconfig_status.gopkg/controller/statusmanager/pod_status.gopkg/controller/statusmanager/status_manager_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/controller/statusmanager/pod_status.go
|
sorry about the |
|
I hope it shoudl work now /payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@petr-muller: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5e2eeb00-23b2-11f1-9a0c-ae4d6f10dbf5-0 |
no worries @petr-muller . thanks for kicking it off again |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2429a040-2582-11f1-86c7-0543ceeba5f2-0 |
|
/retest /payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2c409070-261a-11f1-8f79-8f7632132838-0 |
danwinship
left a comment
There was a problem hiding this comment.
I feel like #2936 was probably closer to correct...
| sourceNames.Insert(source.Name) | ||
| } | ||
| return sourceNames.IsSuperset(machineConfigs) | ||
| } |
There was a problem hiding this comment.
So this is just duplicating the logic of mcutil.AreMachineConfigsRemovedFromPool but removing the check that status.MachineCount == status.UpdatedMachineCount. We should not be duplicating that logic here; it should remain in mcutil.
If it is actually correct that we need to check the machine count here, but not from the place in pkg/network/ovn_kubernetes.go that calls mcutil.AreMachineConfigsRemovedFromPool, then we should fix this by splitting out that check from mcutil.AreMachineConfigsRemovedFromPool. But if we don't want that check here, then I'm not convinced we want that check in pkg/network/ovn_kubernetes.go either.
There was a problem hiding this comment.
@danwinship , I've spent some back and forth on this and it's a bit different than your first review now. Will try to respond to each comment though:
we've moved the helpers AreMachineConfigsRenderedOnPoolSource() / AreMachineConfigsRemovedFromPoolSource() to mcutil.
logic should be that we only care whether the CNO MC is part of the pool's rendered source yet, not if it's finished converging on nodes. once it is, we don't need to keep CNO as progressing even in case of generic MCP node churn
| sourceNames.Insert(source.Name) | ||
| } | ||
| return !sourceNames.HasAny(machineConfigs.UnsortedList()...) | ||
| } |
There was a problem hiding this comment.
hope the above reply makes sense.
| default: | ||
| return 1 | ||
| } | ||
| } |
There was a problem hiding this comment.
observedGeneration must be less than or equal to generation, so this function never returns observedGeneration. It's just max(generation, 1).
There was a problem hiding this comment.
got it. should be good now
| (ds.Status.DesiredNumberScheduled == 0 || ds.Status.NumberAvailable >= ds.Status.DesiredNumberScheduled) { | ||
| dsState.RolloutGeneration = 0 | ||
| } | ||
| dsRolloutActive := dsState.RolloutGeneration != 0 |
There was a problem hiding this comment.
This seems like it must be way more complicated than it needs to be. I don't remember the exact details of how the daemonset controller reports status, but you should not need all of this just to distinguish the cases of "daemonset controller is doing stuff because the DaemonSet was updated" vs "daemonset controller is doing stuff because the nodes were updated".
What does dsState.RolloutGeneration represent? It seems that sometimes it is 0 and sometimes it is 1 and sometimes it is the current ds.Generation. This seems like it's overloaded and trying to represent multiple things at once.
There was a problem hiding this comment.
yes, TBH I was not very happy with how this was showing up, but it is a little complicated to look at. The newest
update puts the work in to dedicated helpers instead with a comment block hopefully helping explain it better.
there is no more RolloutGeneration now, but we do now have a RolloutActive bool now to tell us if we already saw
a rollout start and set in these new dedicated helpers. keeping track of that should help with knowing if we are
in a real rollout vs reboot churn.
| reachedAvailableLevel = false | ||
|
|
||
| dsState, exists := daemonsetStates[dsName] | ||
| if !exists || !reflect.DeepEqual(dsState.LastSeenStatus, ds.Status) { |
There was a problem hiding this comment.
(with the dsState, exists := daemonsetStates[dsName] moved, it's no longer clear what exists means at this point, so if the code stays like this it would need to be renamed)
There was a problem hiding this comment.
agreed. it's changed to hadState here now if that helps it make more sense?
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3d993720-27f8-11f1-9bbb-e6c2fe987086-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/controller/statusmanager/status_manager_test.go (1)
520-520:⚠️ Potential issue | 🟡 MinorRefresh
cobefore theStatus.Versionsassertions.These two calls still discard the updated
ClusterOperator, but Lines 540 and 576 keep readingco.Status.Versions. That means the test is still validating the object fetched earlier in the method and can miss a regression in version mirroring.Suggested fix
- _, oc, err = getStatuses(client, "testing") + co, oc, err = getStatuses(client, "testing") @@ - _, oc, err = getStatuses(client, "testing") + co, oc, err = getStatuses(client, "testing")Also applies to: 564-564
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/statusmanager/status_manager_test.go` at line 520, Test reads stale ClusterOperator into variable co then asserts on co.Status.Versions; refresh co from the API before those assertions. After the call to getStatuses (or where the operator update occurs), re-fetch the updated ClusterOperator (e.g., call getStatuses again or perform a fresh GET for the ClusterOperator) and use that refreshed co when asserting on co.Status.Versions so the test validates the latest mirrored versions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Line 41: The InstallComplete field is a plain bool so legacy annotations
deserialize as "missing" => false; change the field type to *bool
(InstallComplete *bool) and update all restore/read sites to treat nil as
"unknown" (fall back to existing ClusterOperator availability or previous
versioned state) before deciding to flip/install-complete; specifically update
the annotation restore logic, any IsInstallComplete checks, and the reconcile
paths that currently assume false (references to InstallComplete usage around
the earlier reconcile check and the bootstrap re-entry logic) to handle nil
safely and only treat explicit false when the pointer is non-nil.
---
Duplicate comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Line 520: Test reads stale ClusterOperator into variable co then asserts on
co.Status.Versions; refresh co from the API before those assertions. After the
call to getStatuses (or where the operator update occurs), re-fetch the updated
ClusterOperator (e.g., call getStatuses again or perform a fresh GET for the
ClusterOperator) and use that refreshed co when asserting on co.Status.Versions
so the test validates the latest mirrored versions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 73fd88cf-e693-4789-96a6-299b7c1fb92d
📒 Files selected for processing (4)
pkg/controller/statusmanager/machineconfig_status.gopkg/controller/statusmanager/pod_status.gopkg/controller/statusmanager/status_manager_test.gopkg/util/machineconfig/util.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/controller/statusmanager/machineconfig_status.go
727c270 to
075dccd
Compare
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/41aa3610-27f9-11f1-8f25-d6288f44483f-0 |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/controller/statusmanager/machineconfig_status.go (1)
160-175:⚠️ Potential issue | 🟠 MajorWait for every matching pool before pruning removal state.
With the new source-only check, Line 161 can return true as soon as one pool's rendered source drops the machine config. Lines 163-174 then delete it from both tracking maps, so later pools for the same role are skipped. A cluster with the base
workerpool plus a custom worker-derived pool can clearProgressingand remove the persisted annotation before all non-paused pools have removed the config; defer that cleanup until every matching pool reports removal.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/statusmanager/machineconfig_status.go` around lines 160 - 175, The code currently prunes role-level state as soon as AreMachineConfigsRemovedFromPoolSource returns true for a single pool, which can cause premature deletion when other pools for the same role still reference the machineConfig; change the logic so you only delete from status.machineConfigsBeingRemoved and status.renderedMachineConfigs (and call setLastRenderedMachineConfigState) after verifying the machineConfig is removed from every pool for that role. Concretely, replace the single-pool check that uses removed (computed from mcutil.AreMachineConfigsRemovedFromPoolSource(pool.Status, mcSet)) with a check that iterates all pools for the same role (the same collection you use to reconcile pools where the loop provides variable pool) and calls mcutil.AreMachineConfigsRemovedFromPoolSource on each pool.Status/mcSet, setting a removedFromAll flag; only if removedFromAll is true perform the Delete(...) calls and the delete(map) cleanup and invoke status.setLastRenderedMachineConfigState(status.renderedMachineConfigs).
♻️ Duplicate comments (1)
pkg/controller/statusmanager/status_manager_test.go (1)
520-542:⚠️ Potential issue | 🟡 MinorReassign
cobefore theStatus.Versionschecks.
cois still the object fetched on Line 489. After the calls on Line 520 and Line 564, Lines 540 and 576 are asserting stale data, so these checks can miss a regression in ClusterOperator version mirroring.Suggested fix
- _, oc, err = getStatuses(client, "testing") + co, oc, err = getStatuses(client, "testing") @@ - _, oc, err = getStatuses(client, "testing") + co, oc, err = getStatuses(client, "testing")Also applies to: 564-578
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/statusmanager/status_manager_test.go` around lines 520 - 542, The Status.Versions assertions are using a stale co (it wasn’t reassigned after calling getStatuses), so change the getStatuses call(s) that currently assign only oc (e.g., "_, oc, err = getStatuses(client, \"testing\")") to capture and reassign co (e.g., "co, oc, err = getStatuses(...)") before the Status.Versions checks; update both occurrences (the checks around the first block and the later block referenced) so the assertions run against the fresh ClusterOperator object returned by getStatuses.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@pkg/controller/statusmanager/machineconfig_status.go`:
- Around line 160-175: The code currently prunes role-level state as soon as
AreMachineConfigsRemovedFromPoolSource returns true for a single pool, which can
cause premature deletion when other pools for the same role still reference the
machineConfig; change the logic so you only delete from
status.machineConfigsBeingRemoved and status.renderedMachineConfigs (and call
setLastRenderedMachineConfigState) after verifying the machineConfig is removed
from every pool for that role. Concretely, replace the single-pool check that
uses removed (computed from
mcutil.AreMachineConfigsRemovedFromPoolSource(pool.Status, mcSet)) with a check
that iterates all pools for the same role (the same collection you use to
reconcile pools where the loop provides variable pool) and calls
mcutil.AreMachineConfigsRemovedFromPoolSource on each pool.Status/mcSet, setting
a removedFromAll flag; only if removedFromAll is true perform the Delete(...)
calls and the delete(map) cleanup and invoke
status.setLastRenderedMachineConfigState(status.renderedMachineConfigs).
---
Duplicate comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 520-542: The Status.Versions assertions are using a stale co (it
wasn’t reassigned after calling getStatuses), so change the getStatuses call(s)
that currently assign only oc (e.g., "_, oc, err = getStatuses(client,
\"testing\")") to capture and reassign co (e.g., "co, oc, err =
getStatuses(...)") before the Status.Versions checks; update both occurrences
(the checks around the first block and the later block referenced) so the
assertions run against the fresh ClusterOperator object returned by getStatuses.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cffe2900-a7c1-40f1-aadc-a0ae96c81244
📒 Files selected for processing (4)
pkg/controller/statusmanager/machineconfig_status.gopkg/controller/statusmanager/pod_status.gopkg/controller/statusmanager/status_manager_test.gopkg/util/machineconfig/util.go
✅ Files skipped from review due to trivial changes (1)
- pkg/util/machineconfig/util.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/controller/statusmanager/pod_status.go
|
@jluhrsen: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/158a1d90-286f-11f1-94f7-b2058184163e-0 |
075dccd to
4cd4c8d
Compare
|
@jluhrsen: This pull request references CORENET-6572 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/controller/statusmanager/status_manager_test.go (1)
1966-2218: Add the StatefulSet restart regression too.
pod_status.gonow has the same persisted rollout-active path for StatefulSets, but the new restart-resilience coverage here only exercises Deployments. Since the StatefulSet branch uses different readiness fields (ReadyReplicas/AvailableReplicas), it can regress independently.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/statusmanager/status_manager_test.go` around lines 1966 - 2218, Add a new test mirroring TestStatusManagerRestoresActiveRolloutAfterRestart but for StatefulSets: create an initial stable StatefulSet to mark installComplete, call status.SetFromPods, create a second StatefulSet that begins an active rollout by having Spec/Generation imply an update and setting Status.ReadyReplicas/Status.AvailableReplicas to show it is not yet ready, then persist the updated status via setStatus (or set) and recreate the StatusManager (New + SetFromPods) to simulate restart; finally call getStatuses and assert ClusterOperator Progressing condition is True. Ensure the test manipulates the StatefulSet-specific readiness fields (ReadyReplicas and AvailableReplicas) rather than Deployment fields so the statefulset branch in SetFromPods is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 1966-2218: Add a new test mirroring
TestStatusManagerRestoresActiveRolloutAfterRestart but for StatefulSets: create
an initial stable StatefulSet to mark installComplete, call status.SetFromPods,
create a second StatefulSet that begins an active rollout by having
Spec/Generation imply an update and setting
Status.ReadyReplicas/Status.AvailableReplicas to show it is not yet ready, then
persist the updated status via setStatus (or set) and recreate the StatusManager
(New + SetFromPods) to simulate restart; finally call getStatuses and assert
ClusterOperator Progressing condition is True. Ensure the test manipulates the
StatefulSet-specific readiness fields (ReadyReplicas and AvailableReplicas)
rather than Deployment fields so the statefulset branch in SetFromPods is
exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e80b3f2f-2d23-4cb4-b40a-b7d47913bcd2
📒 Files selected for processing (4)
pkg/controller/statusmanager/machineconfig_status.gopkg/controller/statusmanager/pod_status.gopkg/controller/statusmanager/status_manager_test.gopkg/util/machineconfig/util.go
Keep pod-based Progressing tied to an actual CNO rollout instead of temporary unavailability during node reboot churn. Persist rollout-active state so Progressing stays true until the rollout is fully available, while still ignoring plain reboot churn after install. Restore install-complete state safely across upgrades by treating older last-seen annotations that omit InstallComplete as legacy data instead of assuming install is incomplete again. For machine config status, stop treating generic MCP node convergence as a CNO rollout signal. Reuse shared mcutil helpers for source-only checks, and only prune removed machine-config state after every matching non-paused pool has dropped that config from its rendered source. Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com> Co-authored-by: Codex
4cd4c8d to
02f6536
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/19590540-2882-11f1-8509-0de07b7aa1aa-0 |
|
/test e2e-aws-ovn-hypershift-conformance |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/33214f60-28c7-11f1-8461-c1170cad5148-0 |
|
/payload 4.22 ci blocking |
|
@jluhrsen: trigger 5 job(s) of type blocking for the ci release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3949b210-28c7-11f1-8e18-d11a8e9823d9-0 trigger 14 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3949b210-28c7-11f1-8e18-d11a8e9823d9-1 |
@danwinship if you could please take a look at this now. I bounced these changes through coderabbit, claude and codex a few times. Seems to be catching a few extra scenarios now as well. I've checked aprox 7 or more of the I've got payload jobs running now as some extra testing. |
avoid false Progressing during reboot churn