CORENET-6572: only report Progressing for active network rollouts by jluhrsen · Pull Request #2937 · openshift/cluster-network-operator

jluhrsen · 2026-03-17T21:45:18Z

avoid false Progressing during reboot churn

Keep pod-based Progressing tied to an actual CNO rollout instead of
temporary unavailability during node reboot churn. Persist rollout-active
state so Progressing stays true until the rollout is fully available,
while still ignoring plain reboot churn after install.

Restore install-complete state safely across upgrades by treating older
last-seen annotations that omit InstallComplete as legacy data instead
of assuming install is incomplete again.

For machine config status, stop treating generic MCP node convergence
as a CNO rollout signal. Reuse shared mcutil helpers for source-only
checks, and only prune removed machine-config state after every matching
non-paused pool has dropped that config from its rendered source.

Co-authored-by: Codex

openshift-ci-robot · 2026-03-17T21:45:22Z

@jluhrsen: This pull request references CORENET-6572 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Keep pod-based Progressing tied to an actual CNO rollout instead of temporary unavailability during node reboot churn. Persist the rollout generation in status manager state so Progressing stays true until the rollout is both observed and fully available.

For machine config status, stop treating generic MCP node convergence as a CNO rollout signal. Check whether the CNO machine config is still present in the pool's rendered source list so routine MCO updates do not flip network Progressing to true.

Co-Authored-by: Claude Code and Codex

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-03-17T21:45:32Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fe81743f-9e85-421c-84ef-14f8c3bd3abb

📥 Commits

Reviewing files that changed from the base of the PR and between b8d21fa and 02f6536.

📒 Files selected for processing (4)

pkg/controller/statusmanager/machineconfig_status.go
pkg/controller/statusmanager/pod_status.go
pkg/controller/statusmanager/status_manager_test.go
pkg/util/machineconfig/util.go

Walkthrough

Removed global MCP-progressing check; per-rendered-MC processing now uses source-based machineconfig helpers and defers pruning until a non-paused pool confirms removal. Persisted per-pod snapshot fields (InstallComplete, RolloutActive) and added rollout-active logic; tests expanded for reboot-churn and restart-resilience.

Changes

Cohort / File(s)	Summary
MachineConfig status logic `pkg/controller/statusmanager/machineconfig_status.go`	Removed global MCP-progressing check and early return; now iterates per role and per rendered MachineConfig, tracks `beingRemoved`, uses source-based helpers, skips paused pools for gating/pruning, defers cache deletions until a non-paused pool confirms removal, and adds `forgetRemovedMachineConfig`. Deleted `isAnyMachineConfigPoolProgressing`.
MachineConfig utility helpers `pkg/util/machineconfig/util.go`	Added `AreMachineConfigsRenderedOnPoolSource` and `AreMachineConfigsRemovedFromPoolSource`. Refactored `AreMachineConfigsRenderedOnPool`/`AreMachineConfigsRemovedFromPool` to retain count-equality gating and delegate source presence/absence checks to the new helpers.
Pod status & snapshot persistence `pkg/controller/statusmanager/pod_status.go`	Extended persisted pod snapshot with `InstallComplete`; added `RolloutActive` to daemonset/deployment/statefulset snapshots and recompute/persist logic; gated progress/availability messages on `RolloutActive`; refined install-complete timing and restore-from-snapshot behavior; introduced rollout/activity helper functions and legacy-compat install-complete reader.
Tests: status manager `pkg/controller/statusmanager/status_manager_test.go`	Added tests for ignoring node-reboot churn, waiting for all matching MCPs before clearing rendered-MC state, restoring `InstallComplete`/`RolloutActive` after restart, adjusted daemonset/deployment generation/observedGeneration fixtures, and updated expectations to match new semantics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.38.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope

... [truncated 17231 characters] ...

ired in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20250922181213-ec3ebc5fd46b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.34.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-03-17T21:48:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jluhrsen
Once this PR has been reviewed and has the lgtm label, please assign danwinship for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/controller/statusmanager/status_manager_test.go (1)
1700-1736: Please add a cold-start mid-rollout regression case.

Current tests validate rollout continuation with existing state and reboot churn without rollout, but they don’t cover first-observation behavior when last-seen state is empty and workload is already at observedGeneration==generation, updated==replicas, unavailable>0. That scenario should be pinned to avoid false-negative Progressing.

As per coding guidelines, "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 1826-1856
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/statusmanager/status_manager_test.go` around lines 1700 -
1736, Add a new unit test that covers the cold-start mid-rollout regression:
create a Deployment (reuse depB or name it depColdStart) with
Status.ObservedGeneration == Generation, Status.UpdatedReplicas ==
Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty
last-seen state (i.e., do not pre-populate any prior status in the status
manager). Then call setStatus(t, client, depColdStart), run
status.SetFromPods(), and call getStatuses(client, "testing"); finally assert
via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing
is ConditionTrue (and other expected conditions mirror the other tests). Use the
same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses,
conditionsInclude) and place the test near the existing rollout tests so it
executes in the same context.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Around line 93-110: The dsState.RolloutGeneration must be initialized when the
controller first sees a DaemonSet mid-convergence even if
generation==observedGeneration; change the initialization logic in the DaemonSet
handling (variables: dsState, RolloutGeneration, currentRolloutGeneration) so
that when a state entry is new or RolloutGeneration==0 and the DaemonSet has
in-flight convergence (ds.Status.NumberUnavailable > 0 OR
(ds.Status.DesiredNumberScheduled > 0 && ds.Status.UpdatedNumberScheduled <
ds.Status.DesiredNumberScheduled)), you set dsState.RolloutGeneration =
currentRolloutGeneration (respecting the existing status.installComplete guard
if needed); apply the same initialization pattern to the analogous StatefulSet
and Deployment blocks (the corresponding variables and checks around
observedGeneration, UpdatedNumberScheduled/Ready/DesiredNumberScheduled, and
NumberUnavailable/NotReady).

---

Nitpick comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 1700-1736: Add a new unit test that covers the cold-start
mid-rollout regression: create a Deployment (reuse depB or name it depColdStart)
with Status.ObservedGeneration == Generation, Status.UpdatedReplicas ==
Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty
last-seen state (i.e., do not pre-populate any prior status in the status
manager). Then call setStatus(t, client, depColdStart), run
status.SetFromPods(), and call getStatuses(client, "testing"); finally assert
via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing
is ConditionTrue (and other expected conditions mirror the other tests). Use the
same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses,
conditionsInclude) and place the test near the existing rollout tests so it
executes in the same context.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8f95521-c9ef-41f3-bf7d-f168f710f1c4

📥 Commits

Reviewing files that changed from the base of the PR and between 3b5ef2d and da9acf2.

📒 Files selected for processing (3)

pkg/controller/statusmanager/machineconfig_status.go
pkg/controller/statusmanager/pod_status.go
pkg/controller/statusmanager/status_manager_test.go

pkg/controller/statusmanager/pod_status.go

jluhrsen · 2026-03-18T22:47:36Z

@danwinship , if you have a chance, please check this one out. looks like network operator will hit this issue in 90% of the 4.21->4.22 aws-ovn-upgrade jobs so running it here a few times will be enough to validate the fix from that high level.

/payload-job-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

jluhrsen · 2026-03-19T04:19:25Z

/retest

jluhrsen · 2026-03-19T16:09:44Z

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-19T16:10:05Z

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c24b550-23ae-11f1-97df-f16d2b83fd79-0

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Line 520: The test is asserting co.Status.Versions from a stale
ClusterOperator object after calling getStatuses; update the test to re-fetch or
assign the returned updated ClusterOperator before asserting Versions (e.g., use
the returned oc from getStatuses to set co = oc or call getStatuses again) so
that assertions against Status.Versions use the latest ClusterOperator state;
specifically adjust the places around the getStatuses call and the subsequent
assertions that reference co.Status.Versions (occurrences around getStatuses, co
variable, and Status.Versions) to use the refreshed object.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3654515-9f48-47fd-8ed3-357ead25e962

📥 Commits

Reviewing files that changed from the base of the PR and between 2e2113f and bb1965f.

📒 Files selected for processing (3)

pkg/controller/statusmanager/machineconfig_status.go
pkg/controller/statusmanager/pod_status.go
pkg/controller/statusmanager/status_manager_test.go

✅ Files skipped from review due to trivial changes (1)

pkg/controller/statusmanager/pod_status.go

pkg/controller/statusmanager/status_manager_test.go

petr-muller · 2026-03-19T16:40:27Z

sorry about the /payload breakage, that was unfortunately me

petr-muller · 2026-03-19T16:40:40Z

I hope it shoudl work now

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-19T16:40:44Z

@petr-muller: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5e2eeb00-23b2-11f1-9a0c-ae4d6f10dbf5-0

jluhrsen · 2026-03-19T17:06:50Z

I hope it shoudl work now

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

no worries @petr-muller . thanks for kicking it off again

jluhrsen · 2026-03-22T00:00:29Z

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-22T00:00:38Z

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2429a040-2582-11f1-86c7-0543ceeba5f2-0

jluhrsen · 2026-03-22T18:08:46Z

/retest

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-22T18:08:51Z

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2c409070-261a-11f1-8f79-8f7632132838-0

danwinship

I feel like #2936 was probably closer to correct...

danwinship · 2026-03-23T00:45:28Z

pkg/controller/statusmanager/machineconfig_status.go

+		sourceNames.Insert(source.Name)
+	}
+	return sourceNames.IsSuperset(machineConfigs)
+}


So this is just duplicating the logic of mcutil.AreMachineConfigsRemovedFromPool but removing the check that status.MachineCount == status.UpdatedMachineCount. We should not be duplicating that logic here; it should remain in mcutil.

If it is actually correct that we need to check the machine count here, but not from the place in pkg/network/ovn_kubernetes.go that calls mcutil.AreMachineConfigsRemovedFromPool, then we should fix this by splitting out that check from mcutil.AreMachineConfigsRemovedFromPool. But if we don't want that check here, then I'm not convinced we want that check in pkg/network/ovn_kubernetes.go either.

@danwinship , I've spent some back and forth on this and it's a bit different than your first review now. Will try to respond to each comment though:

we've moved the helpers AreMachineConfigsRenderedOnPoolSource() / AreMachineConfigsRemovedFromPoolSource() to mcutil.

logic should be that we only care whether the CNO MC is part of the pool's rendered source yet, not if it's finished converging on nodes. once it is, we don't need to keep CNO as progressing even in case of generic MCP node churn

danwinship · 2026-03-23T00:45:48Z

pkg/controller/statusmanager/machineconfig_status.go

+		sourceNames.Insert(source.Name)
+	}
+	return !sourceNames.HasAny(machineConfigs.UnsortedList()...)
+}


(same as above)

hope the above reply makes sense.

danwinship · 2026-03-23T00:52:34Z

pkg/controller/statusmanager/pod_status.go

+	default:
+		return 1
 	}
 }


observedGeneration must be less than or equal to generation, so this function never returns observedGeneration. It's just max(generation, 1).

got it. should be good now

danwinship · 2026-03-23T01:16:23Z

pkg/controller/statusmanager/pod_status.go

+			(ds.Status.DesiredNumberScheduled == 0 || ds.Status.NumberAvailable >= ds.Status.DesiredNumberScheduled) {
+			dsState.RolloutGeneration = 0
+		}
+		dsRolloutActive := dsState.RolloutGeneration != 0


This seems like it must be way more complicated than it needs to be. I don't remember the exact details of how the daemonset controller reports status, but you should not need all of this just to distinguish the cases of "daemonset controller is doing stuff because the DaemonSet was updated" vs "daemonset controller is doing stuff because the nodes were updated".

What does dsState.RolloutGeneration represent? It seems that sometimes it is 0 and sometimes it is 1 and sometimes it is the current ds.Generation. This seems like it's overloaded and trying to represent multiple things at once.

yes, TBH I was not very happy with how this was showing up, but it is a little complicated to look at. The newest
update puts the work in to dedicated helpers instead with a comment block hopefully helping explain it better.

there is no more RolloutGeneration now, but we do now have a RolloutActive bool now to tell us if we already saw
a rollout start and set in these new dedicated helpers. keeping track of that should help with knowing if we are
in a real rollout vs reboot churn.

danwinship · 2026-03-23T01:20:15Z

pkg/controller/statusmanager/pod_status.go

 			reachedAvailableLevel = false

-			dsState, exists := daemonsetStates[dsName]
 			if !exists || !reflect.DeepEqual(dsState.LastSeenStatus, ds.Status) {


(with the dsState, exists := daemonsetStates[dsName] moved, it's no longer clear what exists means at this point, so if the code stays like this it would need to be renamed)

agreed. it's changed to hadState here now if that helps it make more sense?

jluhrsen · 2026-03-25T03:10:54Z

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-25T03:10:58Z

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3d993720-27f8-11f1-9bbb-e6c2fe987086-0

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

pkg/controller/statusmanager/status_manager_test.go (1)
520-520: ⚠️ Potential issue | 🟡 Minor

Refresh co before the Status.Versions assertions.

These two calls still discard the updated ClusterOperator, but Lines 540 and 576 keep reading co.Status.Versions. That means the test is still validating the object fetched earlier in the method and can miss a regression in version mirroring.
Suggested fix
-	_, oc, err = getStatuses(client, "testing")
+	co, oc, err = getStatuses(client, "testing")
@@
-	_, oc, err = getStatuses(client, "testing")
+	co, oc, err = getStatuses(client, "testing")
Also applies to: 564-564
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/statusmanager/status_manager_test.go` at line 520, Test reads
stale ClusterOperator into variable co then asserts on co.Status.Versions;
refresh co from the API before those assertions. After the call to getStatuses
(or where the operator update occurs), re-fetch the updated ClusterOperator
(e.g., call getStatuses again or perform a fresh GET for the ClusterOperator)
and use that refreshed co when asserting on co.Status.Versions so the test
validates the latest mirrored versions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Line 41: The InstallComplete field is a plain bool so legacy annotations
deserialize as "missing" => false; change the field type to *bool
(InstallComplete *bool) and update all restore/read sites to treat nil as
"unknown" (fall back to existing ClusterOperator availability or previous
versioned state) before deciding to flip/install-complete; specifically update
the annotation restore logic, any IsInstallComplete checks, and the reconcile
paths that currently assume false (references to InstallComplete usage around
the earlier reconcile check and the bootstrap re-entry logic) to handle nil
safely and only treat explicit false when the pointer is non-nil.

---

Duplicate comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Line 520: Test reads stale ClusterOperator into variable co then asserts on
co.Status.Versions; refresh co from the API before those assertions. After the
call to getStatuses (or where the operator update occurs), re-fetch the updated
ClusterOperator (e.g., call getStatuses again or perform a fresh GET for the
ClusterOperator) and use that refreshed co when asserting on co.Status.Versions
so the test validates the latest mirrored versions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73fd88cf-e693-4789-96a6-299b7c1fb92d

📥 Commits

Reviewing files that changed from the base of the PR and between bb1965f and cfb4dae.

📒 Files selected for processing (4)

pkg/controller/statusmanager/machineconfig_status.go
pkg/controller/statusmanager/pod_status.go
pkg/controller/statusmanager/status_manager_test.go
pkg/util/machineconfig/util.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/controller/statusmanager/machineconfig_status.go

pkg/controller/statusmanager/pod_status.go

jluhrsen · 2026-03-25T03:18:11Z

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-25T03:18:17Z

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/41aa3610-27f9-11f1-8f25-d6288f44483f-0

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/controller/statusmanager/machineconfig_status.go (1)
160-175: ⚠️ Potential issue | 🟠 Major

Wait for every matching pool before pruning removal state.

With the new source-only check, Line 161 can return true as soon as one pool's rendered source drops the machine config. Lines 163-174 then delete it from both tracking maps, so later pools for the same role are skipped. A cluster with the base worker pool plus a custom worker-derived pool can clear Progressing and remove the persisted annotation before all non-paused pools have removed the config; defer that cleanup until every matching pool reports removal.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/statusmanager/machineconfig_status.go` around lines 160 - 175,
The code currently prunes role-level state as soon as
AreMachineConfigsRemovedFromPoolSource returns true for a single pool, which can
cause premature deletion when other pools for the same role still reference the
machineConfig; change the logic so you only delete from
status.machineConfigsBeingRemoved and status.renderedMachineConfigs (and call
setLastRenderedMachineConfigState) after verifying the machineConfig is removed
from every pool for that role. Concretely, replace the single-pool check that
uses removed (computed from
mcutil.AreMachineConfigsRemovedFromPoolSource(pool.Status, mcSet)) with a check
that iterates all pools for the same role (the same collection you use to
reconcile pools where the loop provides variable pool) and calls
mcutil.AreMachineConfigsRemovedFromPoolSource on each pool.Status/mcSet, setting
a removedFromAll flag; only if removedFromAll is true perform the Delete(...)
calls and the delete(map) cleanup and invoke
status.setLastRenderedMachineConfigState(status.renderedMachineConfigs).

♻️ Duplicate comments (1)

pkg/controller/statusmanager/status_manager_test.go (1)
520-542: ⚠️ Potential issue | 🟡 Minor

Reassign co before the Status.Versions checks.

co is still the object fetched on Line 489. After the calls on Line 520 and Line 564, Lines 540 and 576 are asserting stale data, so these checks can miss a regression in ClusterOperator version mirroring.
Suggested fix
-	_, oc, err = getStatuses(client, "testing")
+	co, oc, err = getStatuses(client, "testing")
@@
-	_, oc, err = getStatuses(client, "testing")
+	co, oc, err = getStatuses(client, "testing")
Also applies to: 564-578
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/statusmanager/status_manager_test.go` around lines 520 - 542,
The Status.Versions assertions are using a stale co (it wasn’t reassigned after
calling getStatuses), so change the getStatuses call(s) that currently assign
only oc (e.g., "_, oc, err = getStatuses(client, \"testing\")") to capture and
reassign co (e.g., "co, oc, err = getStatuses(...)") before the Status.Versions
checks; update both occurrences (the checks around the first block and the later
block referenced) so the assertions run against the fresh ClusterOperator object
returned by getStatuses.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pkg/controller/statusmanager/machineconfig_status.go`:
- Around line 160-175: The code currently prunes role-level state as soon as
AreMachineConfigsRemovedFromPoolSource returns true for a single pool, which can
cause premature deletion when other pools for the same role still reference the
machineConfig; change the logic so you only delete from
status.machineConfigsBeingRemoved and status.renderedMachineConfigs (and call
setLastRenderedMachineConfigState) after verifying the machineConfig is removed
from every pool for that role. Concretely, replace the single-pool check that
uses removed (computed from
mcutil.AreMachineConfigsRemovedFromPoolSource(pool.Status, mcSet)) with a check
that iterates all pools for the same role (the same collection you use to
reconcile pools where the loop provides variable pool) and calls
mcutil.AreMachineConfigsRemovedFromPoolSource on each pool.Status/mcSet, setting
a removedFromAll flag; only if removedFromAll is true perform the Delete(...)
calls and the delete(map) cleanup and invoke
status.setLastRenderedMachineConfigState(status.renderedMachineConfigs).

---

Duplicate comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 520-542: The Status.Versions assertions are using a stale co (it
wasn’t reassigned after calling getStatuses), so change the getStatuses call(s)
that currently assign only oc (e.g., "_, oc, err = getStatuses(client,
\"testing\")") to capture and reassign co (e.g., "co, oc, err =
getStatuses(...)") before the Status.Versions checks; update both occurrences
(the checks around the first block and the later block referenced) so the
assertions run against the fresh ClusterOperator object returned by getStatuses.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cffe2900-a7c1-40f1-aadc-a0ae96c81244

📥 Commits

Reviewing files that changed from the base of the PR and between cfb4dae and 727c270.

📒 Files selected for processing (4)

pkg/controller/statusmanager/machineconfig_status.go
pkg/controller/statusmanager/pod_status.go
pkg/controller/statusmanager/status_manager_test.go
pkg/util/machineconfig/util.go

✅ Files skipped from review due to trivial changes (1)

pkg/util/machineconfig/util.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/controller/statusmanager/pod_status.go

openshift-ci · 2026-03-25T11:33:28Z

@jluhrsen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade	`075dccd`	link	false	`/test 4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade`
ci/prow/e2e-aws-ovn-rhcos10-techpreview	`075dccd`	link	false	`/test e2e-aws-ovn-rhcos10-techpreview`
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw	`075dccd`	link	true	`/test e2e-metal-ipi-ovn-dualstack-bgp-local-gw`
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp	`075dccd`	link	true	`/test e2e-metal-ipi-ovn-dualstack-bgp`
ci/prow/4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade	`075dccd`	link	false	`/test 4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade`
ci/prow/security	`075dccd`	link	false	`/test security`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

jluhrsen · 2026-03-25T17:21:38Z

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-25T17:21:49Z

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/158a1d90-286f-11f1-94f7-b2058184163e-0

openshift-ci-robot · 2026-03-25T18:08:56Z

@jluhrsen: This pull request references CORENET-6572 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

avoid false Progressing during reboot churn

Keep pod-based Progressing tied to an actual CNO rollout instead of
temporary unavailability during node reboot churn. Persist rollout-active
state so Progressing stays true until the rollout is fully available,
while still ignoring plain reboot churn after install.

Restore install-complete state safely across upgrades by treating older
last-seen annotations that omit InstallComplete as legacy data instead
of assuming install is incomplete again.

For machine config status, stop treating generic MCP node convergence
as a CNO rollout signal. Reuse shared mcutil helpers for source-only
checks, and only prune removed machine-config state after every matching
non-paused pool has dropped that config from its rendered source.

Co-authored-by: Codex

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jluhrsen · 2026-03-25T18:28:27Z

@coderabbitai review

coderabbitai · 2026-03-25T18:28:36Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

jluhrsen · 2026-03-25T18:30:47Z

@coderabbitai full review

coderabbitai · 2026-03-25T18:30:54Z

✅ Actions performed

Full review triggered.

coderabbitai

🧹 Nitpick comments (1)

pkg/controller/statusmanager/status_manager_test.go (1)
1966-2218: Add the StatefulSet restart regression too.

pod_status.go now has the same persisted rollout-active path for StatefulSets, but the new restart-resilience coverage here only exercises Deployments. Since the StatefulSet branch uses different readiness fields (ReadyReplicas/AvailableReplicas), it can regress independently.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/statusmanager/status_manager_test.go` around lines 1966 -
2218, Add a new test mirroring
TestStatusManagerRestoresActiveRolloutAfterRestart but for StatefulSets: create
an initial stable StatefulSet to mark installComplete, call status.SetFromPods,
create a second StatefulSet that begins an active rollout by having
Spec/Generation imply an update and setting
Status.ReadyReplicas/Status.AvailableReplicas to show it is not yet ready, then
persist the updated status via setStatus (or set) and recreate the StatusManager
(New + SetFromPods) to simulate restart; finally call getStatuses and assert
ClusterOperator Progressing condition is True. Ensure the test manipulates the
StatefulSet-specific readiness fields (ReadyReplicas and AvailableReplicas)
rather than Deployment fields so the statefulset branch in SetFromPods is
exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 1966-2218: Add a new test mirroring
TestStatusManagerRestoresActiveRolloutAfterRestart but for StatefulSets: create
an initial stable StatefulSet to mark installComplete, call status.SetFromPods,
create a second StatefulSet that begins an active rollout by having
Spec/Generation imply an update and setting
Status.ReadyReplicas/Status.AvailableReplicas to show it is not yet ready, then
persist the updated status via setStatus (or set) and recreate the StatusManager
(New + SetFromPods) to simulate restart; finally call getStatuses and assert
ClusterOperator Progressing condition is True. Ensure the test manipulates the
StatefulSet-specific readiness fields (ReadyReplicas and AvailableReplicas)
rather than Deployment fields so the statefulset branch in SetFromPods is
exercised.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e80b3f2f-2d23-4cb4-b40a-b7d47913bcd2

📥 Commits

Reviewing files that changed from the base of the PR and between b8d21fa and 4cd4c8d.

📒 Files selected for processing (4)

pkg/controller/statusmanager/machineconfig_status.go
pkg/controller/statusmanager/pod_status.go
pkg/controller/statusmanager/status_manager_test.go
pkg/util/machineconfig/util.go

Keep pod-based Progressing tied to an actual CNO rollout instead of temporary unavailability during node reboot churn. Persist rollout-active state so Progressing stays true until the rollout is fully available, while still ignoring plain reboot churn after install. Restore install-complete state safely across upgrades by treating older last-seen annotations that omit InstallComplete as legacy data instead of assuming install is incomplete again. For machine config status, stop treating generic MCP node convergence as a CNO rollout signal. Reuse shared mcutil helpers for source-only checks, and only prune removed machine-config state after every matching non-paused pool has dropped that config from its rendered source. Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com> Co-authored-by: Codex

jluhrsen · 2026-03-25T19:32:49Z

@coderabbitai full review

coderabbitai · 2026-03-25T19:32:57Z

✅ Actions performed

Full review triggered.

jluhrsen · 2026-03-25T19:37:45Z

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-25T19:37:48Z

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/19590540-2882-11f1-8509-0de07b7aa1aa-0

jluhrsen · 2026-03-26T03:51:06Z

/test e2e-aws-ovn-hypershift-conformance
/test e2e-aws-ovn-rhcos10-techpreview
/test e2e-azure-ovn-upgrade
/test e2e-metal-ipi-ovn-dualstack-bgp
/test e2e-metal-ipi-ovn-dualstack-bgp-local-gw

jluhrsen · 2026-03-26T03:52:23Z

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5

openshift-ci · 2026-03-26T03:52:27Z

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/33214f60-28c7-11f1-8461-c1170cad5148-0

jluhrsen · 2026-03-26T03:52:33Z

/payload 4.22 ci blocking
/payload 4.22 nightly blocking

openshift-ci · 2026-03-26T03:52:38Z

@jluhrsen: trigger 5 job(s) of type blocking for the ci release of OCP 4.22

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aks
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3949b210-28c7-11f1-8e18-d11a8e9823d9-0

trigger 14 job(s) of type blocking for the nightly release of OCP 4.22

periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-1of2
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-2of2
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips-no-nat-instance
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3949b210-28c7-11f1-8e18-d11a8e9823d9-1

jluhrsen · 2026-03-26T05:56:13Z

I feel like #2936 was probably closer to correct...

@danwinship if you could please take a look at this now. I bounced these changes through coderabbit, claude and codex a few times. Seems to be catching a few extra scenarios now as well. I've checked aprox 7 or more of the e2e-aws-ovn-upgrade jobs I've been running in batches and CNO is not going Progressing=True in any of them.

I've got payload jobs running now as some extra testing.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 17, 2026

openshift-ci bot requested review from arghosh93 and arkadeepsen March 17, 2026 21:48

jluhrsen mentioned this pull request Mar 17, 2026

CORENET-6572: avoid reporting Progressing during node reboots #2936

Open

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

pkg/controller/statusmanager/pod_status.go Outdated Show resolved Hide resolved

jluhrsen force-pushed the CORENET-6572-codex branch from da9acf2 to 2e2113f Compare March 18, 2026 22:30

jluhrsen force-pushed the CORENET-6572-codex branch from 2e2113f to bb1965f Compare March 19, 2026 16:07

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

pkg/controller/statusmanager/status_manager_test.go Outdated Show resolved Hide resolved

danwinship reviewed Mar 23, 2026

View reviewed changes

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

pkg/controller/statusmanager/pod_status.go Show resolved Hide resolved

jluhrsen force-pushed the CORENET-6572-codex branch 2 times, most recently from 727c270 to 075dccd Compare March 25, 2026 03:16

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

jluhrsen force-pushed the CORENET-6572-codex branch from 075dccd to 4cd4c8d Compare March 25, 2026 18:08

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

jluhrsen force-pushed the CORENET-6572-codex branch from 4cd4c8d to 02f6536 Compare March 25, 2026 19:32

Conversation

jluhrsen commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 17, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci bot commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jluhrsen commented Mar 18, 2026

Uh oh!

jluhrsen commented Mar 19, 2026

Uh oh!

jluhrsen commented Mar 19, 2026

Uh oh!

openshift-ci bot commented Mar 19, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

petr-muller commented Mar 19, 2026

Uh oh!

petr-muller commented Mar 19, 2026

Uh oh!

openshift-ci bot commented Mar 19, 2026

Uh oh!

jluhrsen commented Mar 19, 2026

Uh oh!

jluhrsen commented Mar 22, 2026

Uh oh!

openshift-ci bot commented Mar 22, 2026

Uh oh!

jluhrsen commented Mar 22, 2026

Uh oh!

openshift-ci bot commented Mar 22, 2026

Uh oh!

danwinship left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jluhrsen commented Mar 25, 2026

Uh oh!

openshift-ci bot commented Mar 25, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jluhrsen commented Mar 25, 2026

jluhrsen commented Mar 17, 2026 •

edited

Loading

openshift-ci-robot commented Mar 17, 2026 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

openshift-ci-robot commented Mar 25, 2026 •

edited by openshift-ci bot

Loading