Skip to content

Conversation

@pablintino
Copy link
Contributor

@pablintino pablintino commented Dec 11, 2025

- What I did

Template-generated MachineConfigs (e.g., 01-master-kubelet, 01-master-container-runtime) were setting OSImageURL on the final rendered MachineConfig, even though only user-provided MachineConfigs should be able to override this field.

The root cause was that template generation explicitly sets OSImageURL on all generated MCs, and the merge logic (MergeMachineConfigs) was treating all MCs equally when determining the final OSImageURL value. This meant template-generated MCs would always propagate the base OS image URL to the rendered MC, making it impossible for the system to distinguish between a default value and an intentional override.

This commit fixes the issue by modifying MergeMachineConfigs to skip any MachineConfig with the
machineconfiguration.openshift.io/generated-by-controller-version annotation when evaluating OSImageURL and BaseOSExtensionsContainerImage overrides. This ensures that only user-provided MachineConfigs can override these fields, while still allowing template-generated MCs to have the field populated (which is necessary due to resourcemerge not blanking out previously-set values during upgrades).

The same logic is applied to BaseOSExtensionsContainerImage for consistency.

- How to verify it

Ideally, we would like to test that running clusters:

  1. A freshly deployed cluster with this change should not have MCs with the osImageURL field set. A test-provided MC to override the URL should do what's expected that is to trigger a rendered-MC rollout with the image for the nodes.
  2. A running cluster deployed without this change properly updates to an image with this change. The MCs of both worker and master pool should not have the osImageURL field set (only the render MCs should have it). A test-provided MC to override the URL should do what's expected that is to trigger a rendered-MC rollout with the image for the nodes.
  3. A running cluster deployed without this image, with a test-provided MC to override the osImageURL already rolledout should preserve the image after upgrading the cluster to a cluster with this image.

- Description for the changelog

Prevent template MachineConfigs from overriding OSImageURL

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 11, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 11, 2025

@pablintino: This pull request references MCO-1961 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

- What I did

Template-generated MachineConfigs (e.g., 01-master-kubelet, 01-master-container-runtime) were setting OSImageURL on the final rendered MachineConfig, even though only user-provided MachineConfigs should be able to override this field.

The root cause was that template generation explicitly sets OSImageURL on all generated MCs, and the merge logic (MergeMachineConfigs) was treating all MCs equally when determining the final OSImageURL value. This meant template-generated MCs would always propagate the base OS image URL to the rendered MC, making it impossible for the system to distinguish between a default value and an intentional override.

This commit fixes the issue by modifying MergeMachineConfigs to skip any MachineConfig with the
machineconfiguration.openshift.io/generated-by-controller-version annotation when evaluating OSImageURL and BaseOSExtensionsContainerImage overrides. This ensures that only user-provided MachineConfigs can override these fields, while still allowing template-generated MCs to have the field populated (which is necessary due to resourcemerge not blanking out previously-set values during upgrades).

The same logic is applied to BaseOSExtensionsContainerImage for consistency.

- How to verify it

TBD

- Description for the changelog

Prevent template MachineConfigs from overriding OSImageURL

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2025
@pablintino
Copy link
Contributor Author

/retest-required

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally makes sense to me, one thought inline

// Note: Previously, the OSImageURL was set here (as well as in the rendered MC) to facilitate
// overrides. Now, all the OSImageURL's from generated MCs are ignored and only user provided
// MCs with the OSImageURL set are considered during the merge process.
// mcfg.Spec.OSImageURL = ctrlcommon.GetDefaultBaseImageContainer(config.ControllerConfigSpec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through the possibilities, I think this PR should be safe and gets us to a more clear state. The two issues I can think of with this change is:

  1. as the original comment states, for clusters that upgrade to a version with this change, their OSImageURL in any MCs with GeneratedByControllerVersionAnnotationKey will no longer get updated, essentially freezing them in time. Now, they don't actually get used anymore with your above change, but it may be confusing for a user to see a really old osimage referenced in MCs, and not be used. Perhaps we should add logic to explicitly disable that somewhere
  2. with this change in place, new clusters will no longer have any osImageURL by default in any component MachineConfig. So if I e.g. had an OSImage override and now want to remove it, I can't immediately see from MCs what image I'll be going back to until the MCO renders it. This is probably fine since I doubt anyone actually needed that info, and we can point users to look at available osimagestreams instead with the new changes. Just wanted to leave that as a note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuqi-zhang I agree on both points. For the first one I don't have a better option than documenting how the image URLs work (maybe we can try to leverage the documentation of the upcoming dual-stream).
For the former, I've explicitly cleared the URL so upgrading clusters will reset their template MC's URLs.

@pablintino pablintino force-pushed the rework-osimageurl-merge branch 2 times, most recently from 0cf18d5 to 5844300 Compare December 15, 2025 12:26
Template-generated MachineConfigs (e.g., 01-master-kubelet,
01-master-container-runtime) were setting OSImageURL on the final
rendered MachineConfig, even though only user-provided MachineConfigs
should be able to override this field.

The root cause was that template generation explicitly sets OSImageURL
on all generated MCs, and the merge logic (MergeMachineConfigs) was
treating all MCs equally when determining the final OSImageURL value.
This meant template-generated MCs would always propagate the base OS
image URL to the rendered MC, making it impossible for the system to
distinguish between a default value and an intentional override.

This commit fixes the issue by modifying MergeMachineConfigs to skip
any MachineConfig with the
machineconfiguration.openshift.io/generated-by-controller-version
annotation when evaluating OSImageURL and BaseOSExtensionsContainerImage
overrides. This ensures that only user-provided MachineConfigs can
override these fields, while still allowing template-generated MCs to
have the field populated (which is necessary due to resourcemerge not
blanking out previously-set values during upgrades).

The same logic is applied to BaseOSExtensionsContainerImage for
consistency.
@pablintino pablintino force-pushed the rework-osimageurl-merge branch from 5844300 to 9d52d68 Compare December 15, 2025 12:31
return ctrl.syncFailingStatus(cfg, err)
}

// TODO: To be removed when 4.22 is EOL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we just be able to remove this in 4.23? Or is that what you meant here (4.22 EoL seems to imply the full 3-year(?) cycle)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuqi-zhang Well, I think this code should be in place till all the legacy supported clusters are migrated or EoL. I mean, if someone with a 4.12 cluster upgrades, I'll want to clear his MCs.
Cause, now thinking, is it mandatory to upgrade versions one by one? If so, yes, if they need to pass through 4.22, we can remove the code in 4.23.

// clearImageURLs clears OSImageURL and BaseOSExtensionsContainerImage from template-generated
// MachineConfigs. This is a one-time migration operation to remove these fields from generated
// MachineConfigs, as they should only be set on rendered or user-provided MachineConfigs.
// The function uses sync.Once to ensure it runs only once per controller lifecycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller lifecycle = each time the pod restarts in this case, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct!

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 15, 2025

@pablintino: This pull request references MCO-1961 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

- What I did

Template-generated MachineConfigs (e.g., 01-master-kubelet, 01-master-container-runtime) were setting OSImageURL on the final rendered MachineConfig, even though only user-provided MachineConfigs should be able to override this field.

The root cause was that template generation explicitly sets OSImageURL on all generated MCs, and the merge logic (MergeMachineConfigs) was treating all MCs equally when determining the final OSImageURL value. This meant template-generated MCs would always propagate the base OS image URL to the rendered MC, making it impossible for the system to distinguish between a default value and an intentional override.

This commit fixes the issue by modifying MergeMachineConfigs to skip any MachineConfig with the
machineconfiguration.openshift.io/generated-by-controller-version annotation when evaluating OSImageURL and BaseOSExtensionsContainerImage overrides. This ensures that only user-provided MachineConfigs can override these fields, while still allowing template-generated MCs to have the field populated (which is necessary due to resourcemerge not blanking out previously-set values during upgrades).

The same logic is applied to BaseOSExtensionsContainerImage for consistency.

- How to verify it

Ideally, we would like to test that running clusters:

  1. A freshly deployed cluster with this change should not have MCs with the osImageURL field set. A test-provided MC to override the URL should do what's expected that is to trigger a rendered-MC rollout with the image for the nodes.
  2. A running cluster deployed without this change properly updates to an image with this change. The MCs of both worker and master pool should not have the osImageURL field set (only the render MCs should have it). A test-provided MC to override the URL should do what's expected that is to trigger a rendered-MC rollout with the image for the nodes.
  3. A running cluster deployed without this image, with a test-provided MC to override the osImageURL already rolledout should preserve the image after upgrading the cluster to a cluster with this image.

- Description for the changelog

Prevent template MachineConfigs from overriding OSImageURL

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sergiordlr
Copy link
Contributor

sergiordlr commented Dec 15, 2025

  • Full regression AWS:

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-aws-ipi-longduration-mco-p3-f7/2000598859025223680
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-aws-ipi-longduration-mco-p2-f7/2000599502330793984
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-aws-ipi-longduration-mco-p1-f7/2000599690252390400

  • GPC critical

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-gcp-ipi-f7-longduration-mco-critical/2000600213802192896

  • Vsphere prow jobs are being reconfigured right now. I hope tomorrow they will be merged.

We launch generic openshift e2e tests for agent installer
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/test-platform-results/logs/periodic-ci-openshift-eng-agent-qe-infra-release-4.21-amd64-nightly-vsphere-agent-dualstack-sno-f7/2000600700941242368

  • Upgrade from 4.18 -> 4.21

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.21-multi-nightly-4.21-upgrade-from-stable-4.18-aws-ipi-proxy-sts-mini-perm-arm-f28/2000602439417335808

  • OCL upgrade from 4.20 and 4.21. Manually executed

Both upgrades worked as expected. No issues found. The osImageURL value was removed from the the MCs generated by MCO, except from the rendered MCs.

@pablintino
Copy link
Contributor Author

/retest-required

@yuqi-zhang
Copy link
Contributor

/lgtm
/label acknowledge-critical-fixes-only
/verified by @sergiordlr

Thanks for the discussion and testing. Let's get this in for dual stream

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Dec 15, 2025
@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 15, 2025
@openshift-ci-robot
Copy link
Contributor

@yuqi-zhang: This PR has been marked as verified by @sergiordlr.

Details

In response to this:

/lgtm
/label acknowledge-critical-fixes-only
/verified by @sergiordlr

Thanks for the discussion and testing. Let's get this in for dual stream

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 15, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pablintino, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [pablintino,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ae23a39 and 2 for PR HEAD 9d52d68 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

@pablintino: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/bootstrap-unit 9d52d68 link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 2c3a1eb into openshift:main Dec 16, 2025
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants