Skip to content

Conversation

@pablintino
Copy link
Contributor

@pablintino pablintino commented Dec 9, 2025

- What I did

This change integrates the already existing OSImageStream logic into the MCO reconcile logic.
The change covers:

  • Integration of the OSImageStreams into the boostrapping logic.
  • Integration of the OSImageStreams into the runtime config.
  • Extra required changes to the already existing OSImageStream logic:
    • Added extra error types to the "retriable" errors as during cluster boot some network timeouts and DNS errors may happen because the network is still booting.
    • Removed the OSImageStream and moved all its logic to a dedicated sync function in the operator.

- How to verify it

TBD.

- Description for the changelog

Integrated OSImageStream logic into MCO reconcile flow, covering bootstrapping and runtime config, with error handling improvements for network stability and a refactored sync function.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 9, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 9, 2025

@pablintino: This pull request references MCO-1961 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 9, 2025
@pablintino pablintino force-pushed the osimagestream-integration branch 4 times, most recently from 180934f to eedbfd7 Compare December 9, 2025 15:03
Comment on lines +76 to +83
// Early start the config informer because feature gate depends on it
ctrlctx.ConfigInformerFactory.Start(ctrlctx.Stop)
if fgErr := ctrlctx.FeatureGatesHandler.Connect(ctx); fgErr != nil {
klog.Error(fmt.Errorf("failed to connect to feature gates %w", fgErr))
runCancel()
<-ctx.Done()
return
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved earlier cause I need to check if the FeatureGateOSStreams is active at each controller instantiation.

if fgErr := ctrlctx.FeatureGatesHandler.Connect(ctx); fgErr != nil {
klog.Fatal(fmt.Errorf("failed to connect to feature gates %w", fgErr))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mcfgv1.GroupVersion, apioperatorsv1alpha1.GroupVersion,
apicfgv1.GroupVersion, apicfgv1alpha1.GroupVersion,
corev1.SchemeGroupVersion, mcfgv1alpha1.GroupVersion,
imagev1.SchemeGroupVersion)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +184 to +195
case *imagev1.ImageStream:
for _, tag := range obj.Spec.Tags {
if tag.Name == "machine-config-operator" {
if imageStream != nil {
klog.Infof("multiple ImageStream found. Previous ImageStream %s replaced by %s", imageStream.Name, obj.Name)
}
imageStream = obj

}
}
// It's an ImageStream that doesn't look like the Release one (doesn't have our tag)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to read the Payload ImageStream the installer copies into our manifests dir. The content is used by the OSImageStream logic to fetch the CoreOS images, get the labels and group the images by stream label.

iconfigs, err := template.RunBootstrap(b.templatesDir, cconfig, psraw, apiServer)
var osImageStream *mcfgv1alpha1.OSImageStream
// Enable OSImageStreams if the FeatureGate is active and the deployment is not OKD
if osimagestream.IsFeatureEnabled(fgHandler) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the feature is enabled we collect the streams and the bootstrap-time OSImageStream CR is created and passed to all the bootstrapping logic in each controller to properly generate the MCs pointing to the default stream and to make the CC urls point to it.
If the feature is not enabled nil is passed everywhere and the regular logic we use to have that uses the CC image urls is used instead.

Comment on lines +1039 to +1060
// GetBaseImageContainer returns the default bootable host base image.
func GetBaseImageContainer(cconfigspec *mcfgv1.ControllerConfigSpec, imageStream *v1alpha1.OSImageStreamSet) string {
if imageStream == nil {
return cconfigspec.BaseOSContainerImage
}
return string(imageStream.OSImage)
}

// GetBaseExtensionsImageContainer returns the default bootable host base image.
func GetBaseExtensionsImageContainer(cconfigspec *mcfgv1.ControllerConfigSpec, imageStream *v1alpha1.OSImageStreamSet) string {
if imageStream == nil {
return cconfigspec.BaseOSExtensionsContainerImage
}
return string(imageStream.OSExtensionsImage)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an OSImageStream is given it takes precedence to the regular "cluster-wide image urls"

Comment on lines -339 to -395
imgRegistryUsrData := []mcfgv1.ImageRegistryBundle{}
if cfg.Spec.AdditionalTrustedCA.Name != "" {
cm, err := optr.ocCmLister.ConfigMaps(ctrlcommon.OpenshiftConfigNamespace).Get(cfg.Spec.AdditionalTrustedCA.Name)
if err != nil {
klog.Warningf("could not find configmap specified in image.config.openshift.io/cluster with the name %s", cfg.Spec.AdditionalTrustedCA.Name)
} else {
newKeys := sets.StringKeySet(cm.Data).List()
newBinaryKeys := sets.StringKeySet(cm.BinaryData).List()
for _, key := range newKeys {
raw, err := base64.StdEncoding.DecodeString(cm.Data[key])
if err != nil {
imgRegistryUsrData = append(imgRegistryUsrData, mcfgv1.ImageRegistryBundle{
File: key,
Data: []byte(cm.Data[key]),
})
} else {
imgRegistryUsrData = append(imgRegistryUsrData, mcfgv1.ImageRegistryBundle{
File: key,
Data: raw,
})
}
}
for _, key := range newBinaryKeys {
imgRegistryUsrData = append(imgRegistryUsrData, mcfgv1.ImageRegistryBundle{
File: key,
Data: cm.BinaryData[key],
})
}
}
}

imgRegistryData := []mcfgv1.ImageRegistryBundle{}
cm, err := optr.clusterCmLister.ConfigMaps("openshift-config-managed").Get("image-registry-ca")
if err == nil {
newKeys := sets.StringKeySet(cm.Data).List()
newBinaryKeys := sets.StringKeySet(cm.BinaryData).List()
for _, key := range newKeys {
raw, err := base64.StdEncoding.DecodeString(cm.Data[key])
if err != nil {
imgRegistryData = append(imgRegistryData, mcfgv1.ImageRegistryBundle{
File: key,
Data: []byte(cm.Data[key]),
})
} else {
imgRegistryData = append(imgRegistryData, mcfgv1.ImageRegistryBundle{
File: key,
Data: raw,
})
}
}
for _, key := range newBinaryKeys {
imgRegistryData = append(imgRegistryData, mcfgv1.ImageRegistryBundle{
File: key,
Data: cm.BinaryData[key],
})
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic extracted to a common function used now by the OSImageStream sync function too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, why is this code removed? It has been moved from the controller to the operator. Why? Cause we need to have the OSImageCluster CR created ASAP. While having a dedicated controller for the CR may make sense it's really hard to do so. Why?

  1. Each piece of code that reads the instance should take into consideration that the CR is not yet created.
  2. The ControllerConfig will require an extra sync to rewrite the OSImages with the content of the default stream, and that may rollout a undesired update.
  3. Given the nature of the CR, that never changes there's no need to create a controller for it.

The content of this file is approximately the code used in https://github.com/openshift/machine-config-operator/pull/5476/files#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R1296

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 9, 2025

@pablintino: This pull request references MCO-1961 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

- What I did

This change integrates the already existing OSImageStream logic into the MCO reconcile logic.
The change covers:

  • Integration of the OSImageStreams into the boostrapping logic.
  • Integration of the OSImageStreams into the runtime config.
  • Extra required changes to the already existing OSImageStream logic:
  • Added extra error types to the "retriable" errors as during cluster boot some network timeouts and DNS errors may happen because the network is still booting.
  • Removed the OSImageStream and moved all its logic to a dedicated sync function in the operator.

- How to verify it

TBD.

- Description for the changelog

Integrated OSImageStream logic into MCO reconcile flow, covering bootstrapping and runtime config, with error handling improvements for network stability and a refactored sync function.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pablintino
Copy link
Contributor Author

/hold
To be reviewed and refined.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 9, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2025
@pablintino pablintino force-pushed the osimagestream-integration branch from d614675 to fed2a19 Compare December 10, 2025 12:42
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2025
@pablintino pablintino force-pushed the osimagestream-integration branch from fed2a19 to c26a561 Compare December 10, 2025 12:42
return nil
}

func (optr *Operator) syncOSImageStream(_ *renderConfig, _ *configv1.ClusterOperator) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Moving this into the operator seems like a great idea!

defer buildCancel()

imageStreamFactory := osimagestream.NewDefaultStreamSourceFactory(optr.mcoCmLister, &osimagestream.DefaultImagesInspectorFactory{})
osImageStream, err := osimagestream.BuildOsImageStreamRuntime(buildCtx, clusterPullSecret, minimalCC, image, imageStreamFactory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Every time this runs, will this function reach out to the image registry, pull down the image, extract the imagestream and load it?

I ask because doing this with the frequency that the operator sync loop runs feels unnecessary when the image pullspec and / or digest of the release payload image have not changed. If the underlying library that we use to get the ImageStream from the release payload image caches what it pulled, then this is moot since we'll have a cache hit in that case. If not, we may want to do a bit of caching here in the operator.

It could be as simple as storing the image pullspec on the operator struct and determining whether the pullspec has changed. If the pullspec hasn't changed, then the ImageStream contained within the pullspec also hasn't changed, which means we could skip the more expensive parts of the sync operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the main point of https://github.com/openshift/machine-config-operator/pull/5476/changes#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R1315-R1318
The idea is that this inspection logic only run at bootstrap time or when an update happens (the OSImageStream has the controller version as apart of an annotation, if it matches the current version we skip the whole function).
That said, given that you ask, this means the code is not too explicit about the run-once nature of the code and maybe I need to split this function in two, one that does the check if the refresh is required with a clear descriptive comment stating the purpose of this sync is to run once and another one with the inspection/build logic. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think splitting it up is a great idea and would help clarify this run-once nature you mentioned 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cheesesashimi I've addressed this, let me know if it's fine for you now.

return err
}

oscontainer, osextensionscontainer, err := optr.getOsImageURLs(optr.namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: It's a bit strange that we seem to pass the namespace through here, given that it is very specific CM that only exists in the MCO namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I thought the same but I didn't address it. I'll leave this comment unresolved as a reminder to address it.

@djoshy
Copy link
Contributor

djoshy commented Dec 10, 2025

Overall the changes make sense to me! Thanks for putting this together. I did have a question about how the defaulting behavior works. I see the defaultStream field defined in the image stream object on my test cluster:

  status:
    availableStreams:
    - name: rhel-9
      osExtensionsImage: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b2fae727bddb8c50a465ae302f58859b8c4af3348a94114650f3f8ea980836d4
      osImage: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eb1d1f17470cfadaec7dc4f717951d349b7e8e75f84ff22956941d69ce7246c2
    - name: rhel-10
      osExtensionsImage: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:12438c7e8586c4527be16633f48902ead60aa2a2d9305a2efbd09595165f3647
      osImage: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bc0976717e1f525d4a2ee6320512bfa7d0d8420d3317ee7d65a7956c7792587f
    defaultStream: rhel-9

When the MCP spec doesn't define a certain stream, I assume the default above is to be used. When the pool does complete updating to the default, do we want this to be represented on the pool's status? In the current implementation, spec will follow status(and spec is empty in the default case).

// `OSImageStream` reference in the MCP status to be consistent to what is defined in the
// MCP spec
if osimagestream.IsFeatureEnabled(ctrl.fgHandler) {
status.OSImageStream = pool.Spec.OSImageStream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into an issue w/ status while testing this PR. When the pool is in transition between two streams(I tried both forwards and backwards b/w rhel9 <-> rhel10), the status field seems to lose its value. Once the pool has finished updating, it seems to reflect the spec value. Are we somehow setting this value to empty when the pool is applying a new stream across its nodes? I can see that this check is only triggered when all nodes are updated, but I would've expected the pool status to remain rhel9 until it has completely transitioned all nodes to rhel10.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that does seem like an issue. 😕 I think ideally we want to persist the old image stream until the update is complete, consistent to what we see for the rendered MC version. It seems we may be hitting some corner case where the perfect storm of the conditions is being met, possibly:

  • The image stream is unset from the MCP spec
  • A new rendered MC has not yet been created, so the MCP is not in an "Updating" state yet
  • We're hitting this conditional and the image stream is being removed because the pool thinks it's updated to something new

I'll have to think more about how to handle this corner case because at initial glance nothing is coming to mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for helping figure this out, David! It seems that recreating the status here https://github.com/pablintino/machine-config-operator/blob/c684f2345859bbfe7ea615be2ef08cb06d703f16/pkg/controller/node/status.go#L214-L223 is erasing the image stream in the status. Adding the following in the non-updated condition (in the https://github.com/pablintino/machine-config-operator/blob/c684f2345859bbfe7ea615be2ef08cb06d703f16/pkg/controller/node/status.go#L272 else) should do the trick. Other solutions are definitely possible depending on your preference, but the idea is in the non-updated sync case, we need to explicitly keep the pool.Status.OSImageStream unchanged.

		if osimagestream.IsFeatureEnabled(ctrl.fgHandler) {
			status.OSImageStream = pool.Status.OSImageStream
		}

// because it will look like an override when it shouldn't be. So don't take this out until you've solved that.
// And inject the osimageurl here
mcfg.Spec.OSImageURL = ctrlcommon.GetDefaultBaseImageContainer(config.ControllerConfigSpec)
// TODO: @pablintino do we need to do the same with the extensions?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the context is in the old old way we did this, was that we simply ignored any osImageURL fields for any MachineConfigs either rendered by default, or provided by the user, and only allowed the configmap one to get rendered into the final machineconfig (when we did MergeMachineConfigs)

When we allowed overriding this field in 4.12(?) to allow for custom images, we inadvertantly caused users to use old images if:

  1. they accidentally provided a user-machineconfig with a osImageURL in it
  2. the MCO at some point rendered a MachineConfig into any generated-by-controller MCs

We couldn't really do anything about 1 since it's the same UX for off cluster layering, but (I think, my memory is hazy) we had this in place to prevent 2 from causing any problems. Now I think about it, maybe we should have cleared the field entirely and not had a "base template machineconfig with OSImageURL in it" and just let the rendering logic apply the default one to the final rendered-MC. This would also have circumnavigated the issue with "rhel10 pools" since now there's an inheritance model. For instance, if I want to use a off-cluster layered OSImageURL for rhel9 but just the default OSImageURL for RHEL 10, what happens if I provide that MC?

Extensions image was never an issue because 1. we didn't have it in really old versions and 2. we never had any overrides for it or any base template MC rendering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context @yuqi-zhang . I've continued the conversation in Slack and I think we can continue in #5488 too.

for _, tag := range obj.Spec.Tags {
if tag.Name == "machine-config-operator" {
if imageStream != nil {
klog.Infof("multiple ImageStream found. Previous ImageStream %s replaced by %s", imageStream.Name, obj.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity when would this be possible? I was under the impression each payload would only have one tagged with MCO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a defensive measure, in theory, we are only copying the payload ImageStream. I added the check to be 100% we load the one we expected to load in case a change in the installer makes another IS to sneak in.

// The template configs always match what's in controllerconfig because they get rendered from there,
// so the only way we get an override here is if the user adds something different
osImageURL := GetDefaultBaseImageContainer(&cconfig.Spec)
osImageURL := GetBaseImageContainer(&cconfig.Spec, imageStream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the same vein as the comment below, this code is fine for setting the default from the stream, but the override logic just assumes any override is for both pools, since both rhel9 and rhel10 are still "workers". Wouldn't that cause an issue if someone provides a worker osImageURL since both rhel9 and rhel10 would be using it?

Hmm, although looking at the must-gather, we do have that by default in 00-master and 00-worker, but it seems that testing this does work with RHEL 10, so maybe I'm misunderstanding that somewhere?

templateDir,
controllerConfig,
role,
osimagestream.TryGetOSImageStreamFromPoolListByPoolName(osImageStream, mcpPools, pool.Name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why does the non-base MC generation code (for example, the containerruntimeconfig here) need the osimagestream info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per Slack conversation, yes, it's not needed and I've removed it :)
Thanks for the pointer.

@pablintino pablintino force-pushed the osimagestream-integration branch from c26a561 to c684f23 Compare December 11, 2025 08:28
@@ -1,15 +1,55 @@
package osimagestream

import (
Copy link
Member

@cheesesashimi cheesesashimi Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (non-blocking): I had a random thought yesterday about a radically different approach:

The crux of the problem is that there is an ImageStream file contained within the release payload image that we need to extract, inspect, and do something with. What if we didn't have to get the pull secrets, look up the release payload image, fetch it, extract the files, etc.? What if the Kubelet could do the heavy-lifting for us?

Assuming we could inject the release payload image pullspec into the MCO deployment object, we could do one of two things:

  1. Use an init container to copy the ImageStream to an emptyDir volume before the MCO starts.
  2. Use an Image Volume to mount the release payload image as a volume into the MCO container. This has the caveat that image volumes are not yet a stable / default feature.

In either case, the Kubelet handles the heavy-lifting of getting the release payload image and making its files available to the MCO. The MCO only needs to read the ImageStream file from disk and deserialize it. However, the difficulty with this approach lies in having the CVO inject the release payload pullspec into the MCO deployment object. The CVO can render templates, but it only does so during bootstrap. And even then, it appears to only do so for its own manifests (e.g., the /install and/or /bootstrap dirs). At runtime, the CVO appears to sync all of the manifests that are in the /release-manifests dir as-is.

(To be clear: This was just a random idea that I wanted to put someplace for posterity. I am not asking you to reconsider your approach or to change anything 😄)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good alternative mate, complicated, but good. Maybe we can include it in the alternatives section of the EP? WDYT? @dkhater-redhat

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2025
@pablintino pablintino force-pushed the osimagestream-integration branch 2 times, most recently from 1ed69a8 to ce9a0ca Compare December 15, 2025 15:24
@pablintino pablintino force-pushed the osimagestream-integration branch from ce9a0ca to d307302 Compare December 16, 2025 12:05
@pablintino pablintino force-pushed the osimagestream-integration branch from 4143926 to 015f071 Compare December 17, 2025 07:32
@pablintino pablintino changed the title MCO-1961: Introduce the OSImageStream controller MCO-1961: Integrate OSImageStream into MCP and sync logic Dec 17, 2025
@pablintino pablintino force-pushed the osimagestream-integration branch 3 times, most recently from 1cfd581 to 9b45abc Compare December 17, 2025 07:47
@sergiordlr
Copy link
Contributor

sergiordlr commented Dec 17, 2025

We have tested dual-stream in IPI on AWS HA and SNO.

machine-config-daemon binaries

We saw that there is one binary for rhel8, another one for rhel9, but we didn't see any binary for rhel10. Is it intended?

/bin/machine-config-daemon
/bin/machine-config-daemon.rhel8

Basic functionality

No issues found in SNO and HA regarding the basic functionality. Enabling rhel9 and rhel10 and switching between them. The new right image is applied to the nodes.

Pause is working as expected too.

Scaling up nodes

Same when scaling up new nodes. They join to the cluster using the right rhel image.

Custom pools

No issues found either. Dual-stream works as intended when we use custom pools.

Day1 installation

Our day1 installation resulted in master pool being degraded. These are the manifests that we used:

$ cat 99-master-pool-dual-stream.yml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
    pools.operator.machineconfiguration.openshift.io/master: ""
  name: master
spec:
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: master
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/master: ""
  osImageStream:
    name: rhel-10

$ cat 99-worker-pool-dual-stream.yml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
    pools.operator.machineconfiguration.openshift.io/worker: ""
  name: worker
spec:
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  osImageStream:
    name: rhel-10

This is the degraded message:

  - lastTransitionTime: "2025-12-17T12:52:24Z"
    message: 'Node ip-10-0-41-183.us-east-2.compute.internal is reporting: "bootstrap
      generated MC rendered-master-d6dd0ee1be308b3b387864d13cb1b976 and in-cluster
      generated MC rendered-master-386f53cb906f9ac1d7f8225b1eb96e5d for this node
      do not match. Please check machine-config-daemon logs or the on disk diff file
      located at /etc/machine-config-daemon/bootstrapconfigdiff", Node ip-10-0-64-152.us-east-2.compute.internal
      is reporting: "bootstrap generated MC rendered-master-d6dd0ee1be308b3b387864d13cb1b976
      and in-cluster generated MC rendered-master-386f53cb906f9ac1d7f8225b1eb96e5d
      for this node do not match. Please check machine-config-daemon logs or the on
      disk diff file located at /etc/machine-config-daemon/bootstrapconfigdiff", Node
      ip-10-0-9-119.us-east-2.compute.internal is reporting: "bootstrap generated
      MC rendered-master-d6dd0ee1be308b3b387864d13cb1b976 and in-cluster generated
      MC rendered-master-386f53cb906f9ac1d7f8225b1eb96e5d for this node do not match.
      Please check machine-config-daemon logs or the on disk diff file located at
      /etc/machine-config-daemon/bootstrapconfigdiff"'
    reason: ""
    status: "True"
    type: Degraded

The reason for the degradation was a mismatch between the bootstrap MC and the first cluster MC. We included the bootstrapconfigdiff in the must gather file that can be found in: https://issues.redhat.com/browse/MCO-1915?focusedId=28689793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28689793

Extensions

Our cluster was degraded when we enabled the rhel10 stream and we tried to install extension. This is the degraded message:

  - lastTransitionTime: "2025-12-17T13:59:22Z"
    message: 'Node ip-10-0-24-255.us-east-2.compute.internal is reporting: "Node ip-10-0-24-255.us-east-2.compute.internal
      upgrade failure. error running rpm-ostree update --install NetworkManager-libreswan
      --install crun-wasm --install kata-containers --install kernel-devel --install
      kernel-headers --install krb5-workstation --install libkadm5 --install libreswan
      --install sysstat --install usbguard: error: Packages not found: NetworkManager-libreswan,
      crun-wasm, libreswan\n: exit status 1", Node ip-10-0-24-255.us-east-2.compute.internal
      is reporting: "error running rpm-ostree update --install NetworkManager-libreswan
      --install crun-wasm --install kata-containers --install kernel-devel --install
      kernel-headers --install krb5-workstation --install libkadm5 --install libreswan
      --install sysstat --install usbguard: error: Packages not found: NetworkManager-libreswan,
      crun-wasm, libreswan\n: exit status 1"'
    reason: ""
    status: "True"
    type: Degraded

Note as well that when we try to degrade the MCP it is not immediately degraded. It seems an issue similar to https://issues.redhat.com/browse/OCPBUGS-62984

tall kernel-devel --install kernel-headers --install krb5-workstation --install libkadm5 --install libreswan --install sysstat --install usbguard: error: Packages not found: NetworkManager-
libreswan, crun-wasm, libreswan\n: exit status 1"
I1217 13:59:17.442263    4121 event.go:377] Event(v1.ObjectReference{Kind:"Node", Namespace:"openshift-machine-config-operator", Name:"ip-10-0-24-255.us-east-2.compute.internal", UID:"1682d
36f-f5a3-415b-9207-6a5072a7f245", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RemoveSigtermProtection' Removing SIGTERM protection
I1217 13:59:17.452091    4121 daemon.go:644] Error syncing node ip-10-0-24-255.us-east-2.compute.internal (retries 10): error running rpm-ostree update --install NetworkManager-libreswan --
install crun-wasm --install kata-containers --install kernel-devel --install kernel-headers --install krb5-workstation --install libkadm5 --install libreswan --install sysstat --install usb
guard: error: Packages not found: NetworkManager-libreswan, crun-wasm, libreswan
: exit status 1
I1217 13:59:17.452179    4121 daemon.go:780] Transitioned from state: Done -> Working
I1217 13:59:17.465475    4121 command_runner.go:24] Running captured: rpm-ostree kargs

We can see in the logs that after saying that we are going to degrade the node, what we do is Transitioned from state: Done -> Working

A must-gather file with the cluster's information can be found in: https://issues.redhat.com/browse/MCO-1915?focusedId=28689793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28689793

We hit what seems to be another issue, but we are not sure since it is kind of masked by the previous extensions degradation.

This is the scenario:

1- Enable rhel9
2- Install extensions
3- Enable rhel10

What's the expected behaviour here? the extensions should be re-installed, right? since the packages change.

The behaviour that we see now is that no extension is installed and the rebase command fails

W1217 15:02:06.929262 6623 update.go:2639] Failed to update OS to registry.build11.ci.openshift.org/ci-ln-wyf5wtb/stable@sha256:18d09e698ad10cccfa9f18ebd99c4490d4276afd3c2adad3112a4652cceca0b4 (will retry): error running rpm-ostree rebase --experimental ostree-unverified-registry:registry.build11.ci.openshift.org/ci-ln-wyf5wtb/stable@sha256:18d09e698ad10cccfa9f18ebd99c4490d4276afd3c2adad3112a4652cceca0b4: error: Packages not found: NetworkManager-libreswan, crun-wasm, libreswan

Kernel types

If we configure realtime kernel in rhel10 the pool is degraded. In rhel9 it is deployed without problems.

Steps to reproduce this issue:

  1. Create infra pool with 1 node
  2. Install realtime kernel in this pool
  3. Enable rhel10 in the pool
  - lastTransitionTime: "2025-12-18T14:47:37Z"
    message: 'Node ip-10-0-25-231.us-east-2.compute.internal is reporting: "Node ip-10-0-25-231.us-east-2.compute.internal
      upgrade failure. error removing staged deployment: [error running rpm-ostree
      cleanup -p: Failed to start rpm-ostreed.service: Operation for unit rpm-ostreed.service
      refused, D-Bus is shutting down.\nSee system logs and ''systemctl status rpm-ostreed.service''
      for details.\nerror: cleanup: Loading sysroot: exit status: 1\n: exit status
      1, error running rpm-ostree override reset kernel kernel-core kernel-modules
      kernel-modules-core kernel-modules-extra --uninstall kernel-rt-core --uninstall
      kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra:
      : signal: terminated]", Node ip-10-0-25-231.us-east-2.compute.internal is reporting:
      "unexpected on-disk state validating against rendered-infra-2280494f59692b9e51fafbadf216cc59:
      error running rpm-ostree kargs: exit status 1\nFailed to start rpm-ostreed.service:
      Operation for unit rpm-ostreed.service refused, D-Bus is shutting down.\nSee
      system logs and ''systemctl status rpm-ostreed.service'' for details.\nerror:
      Loading sysroot: exit status: 1\n"'
    reason: ""
    status: "True"
    type: Degraded

If what we do is

  1. Create empty infra pool
  2. Create a MC to deploy realtime kernel in the infra pool
  3. Enable rhel10 in the infra pool
  4. Move a node to the infra pool

The degradation reported is different

  - lastTransitionTime: "2025-12-18T15:32:57Z"
    message: 'Node ip-10-0-0-85.us-east-2.compute.internal is reporting: "Node ip-10-0-0-85.us-east-2.compute.internal
      upgrade failure. error running rpm-ostree override remove kernel kernel-core
      kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core
      --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm:
      error: Packages not found: kernel-rt-kvm\n: exit status 1", Node ip-10-0-0-85.us-east-2.compute.internal
      is reporting: "error running rpm-ostree override remove kernel kernel-core kernel-modules
      kernel-modules-core kernel-modules-extra --install kernel-rt-core --install
      kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm:
      error: Packages not found: kernel-rt-kvm\n: exit status 1"'
    reason: ""
    status: "True"
    type: Degraded

@pablintino pablintino force-pushed the osimagestream-integration branch from 9b45abc to 158e485 Compare December 18, 2025 11:02
@sergiordlr
Copy link
Contributor

sergiordlr commented Dec 18, 2025

We hit a problem upgrading a cluster using rhel-10.

fedora@preserve-sregidor-work ~]$ l mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-f4e1b047480dac70e625d98f1f95cbe8    True      False      False      1              1                   1                     0                      3h11m
master   rendered-master-6b42861b1be32bcd1396643be3f05bf7   True      False      False      3              3                   3                     0                      5h23m
worker   rendered-worker-2281be5d524c648dc85291327cee6484   True      False      False      2              2                   2                     0                      5h23m
$ oc get co machine-config
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.22.0-0-2025-12-17-115459-test-ci-ln-wyf5wtb-latest   True        True          True       5h22m   Unable to apply 4.22.0-0-2025-12-18-084243-test-ci-ln-q4t3bxb-latest: error during syncRequiredMachineConfigPools: [context deadline exceeded, MachineConfigPool master has not progressed to latest configuration: osImageURL mismatch for master in rendered-master-6b42861b1be32bcd1396643be3f05bf7 expected: registry.build11.ci.openshift.org/ci-ln-wyf5wtb/stable@sha256:a4b51f91be40177d9b5666ef6cf3232b44bf1151ce37179ce44dc476d89a9cb6 got: registry.build11.ci.openshift.org/ci-ln-wyf5wtb/stable@sha256:18d09e698ad10cccfa9f18ebd99c4490d4276afd3c2adad3112a4652cceca0b4, retrying]

The fix seems to be already pushed. We will test it again.

Updated status:

With the new commit, the upgrade was successful using worker and master rhel10 and infra rhel9. The issue was fixed.

This commit completes the OSImageStream integration by updating
MachineConfigPool status to track the active stream and implementing
the sync logic in the MCO operator.

Key changes:

- Set MachineConfigPool status.osImageStream to match spec when the
  pool is fully updated and the OSStreams feature gate is enabled.
  This provides visibility into which OS image stream is actively
  running on each pool.

- Implement OSImageStream sync logic in the operator to check if a
  build is required before performing expensive image inspection, and
  handle CR creation/updates.

- Extract RenderedMachineConfigPrefix constant to controller/common
  for reuse across template and render controllers.

Co-authored-by: Zack Zlotnik <[email protected]>
Co-authored-by: Isabella Janssen <[email protected]>
@pablintino pablintino force-pushed the osimagestream-integration branch from 158e485 to 668ed84 Compare December 18, 2025 16:30
@pablintino
Copy link
Contributor Author

/test e2e-gcp-op-2of2

Copy link
Member

@isabella-janssen isabella-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Previous comments needing to be address have been, others are tracked as follow-up work in Jira (MCO-1914). OKD should be good as well (ref).

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 18, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: isabella-janssen, pablintino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [isabella-janssen,pablintino]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pablintino
Copy link
Contributor Author

/verified by @sergiordlr

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 18, 2025
@openshift-ci-robot
Copy link
Contributor

@pablintino: This PR has been marked as verified by @sergiordlr.

Details

In response to this:

/verified by @sergiordlr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy
Copy link
Contributor

djoshy commented Dec 18, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 18, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 09437b9 and 2 for PR HEAD 668ed84 in total

@pablintino
Copy link
Contributor Author

/test e2e-hypershift

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 2c0c8d2 and 1 for PR HEAD 668ed84 in total

@pablintino
Copy link
Contributor Author

/test e2e-hypershift

@yuqi-zhang
Copy link
Contributor

/override ci/prow/e2e-hypershift

Shouldn't affect hypershift in its current form

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 19, 2025

@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-hypershift

Details

In response to this:

/override ci/prow/e2e-hypershift

Shouldn't affect hypershift in its current form

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 19, 2025

@pablintino: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/bootstrap-unit 668ed84 link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit bf7ba38 into openshift:main Dec 19, 2025
13 of 14 checks passed
@djoshy
Copy link
Contributor

djoshy commented Dec 19, 2025

/woof

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 19, 2025

@djoshy: dog image

Details

In response to this:

/woof

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants