OCPBUGS-64793: Infrastructure Topology calculation is wrong #10077

barbacbd · 2025-11-13T15:40:29Z

If masters are schedulable then all nodes, not just non-controlplane nodes, should
be taken into account when deciding what we tell OLM operators about the topology.

Example:
1.Deploy a cluster with 3 CP nodes (schedulable) and 1 Worker Node
2.Inspect infrastructureTopology attribute of the cluster

Expect: infrastructureTopology= HighlyAvailableTopologyMode

openshift-ci-robot · 2025-11-13T15:40:35Z

@barbacbd: This pull request references Jira Issue OCPBUGS-64793, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

If masters are schedulable then all nodes, not just non-controlplane nodes, should
be taken into account when deciding what we tell OLM operators about the topology.

Example:
1.Deploy a cluster with 3 CP nodes (schedulable) and 1 Worker Node
2.Inspect infrastructureTopology attribute of the cluster

Expect: infrastructureTopology= HighlyAvailableTopologyMode

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

barbacbd · 2025-11-13T15:40:42Z

/cc @zaneb

openshift-ci · 2025-11-13T15:41:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mtulio for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

barbacbd · 2025-11-13T15:47:02Z

/jira refresh

openshift-ci-robot · 2025-11-13T15:47:07Z

@barbacbd: This pull request references Jira Issue OCPBUGS-64793, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tthvo · 2025-11-13T17:28:22Z

/cc @sadasu

tthvo · 2025-11-13T19:21:55Z

pkg/asset/manifests/topologies.go

-	case 1:
-		infrastructureTopology = configv1.SingleReplicaTopologyMode


💭 Is there a case where control plane nodes are non-schedulable in the 3 ctrlplane + 1 worker setup?

In that case, infrastructureTopology is single-replica right?

Currently when numOfWorker == 1, masters are not schedulable. See https://github.com/openshift/installer/blob/main/pkg/asset/manifests/scheduler.go#L75

So, even by setting infrastructureTopology to a different value, we will not have enough nodes to have a successful cluster.

oh right, then the bug is not valid? I actually have 0 clue now 😅

Whoops, I missed your comments in the bug card. The bug makes sense to me now 🙏

Is there a case where control plane nodes are non-schedulable in the 3 ctrlplane + 1 worker setup?

Not a supported case, anyway. This cluster will fail to complete installation because Ingress won't start.
It can be rescued on day 2, by either adding another worker or making control plane nodes schedulable, but in either case you end up with a cluster that should have HA InfrastructureTopology.

If masters are schedulable then all nodes, not just non-controlplane nodes, should be taken into account when deciding what we tell OLM operators about the topology. Example: 1.Deploy a cluster with 3 CP nodes (schedulable) and 1 Worker Node 2.Inspect infrastructureTopology attribute of the cluster Expect: infrastructureTopology= HighlyAvailableTopologyMode

sadasu · 2025-11-14T16:58:24Z

pkg/asset/manifests/ingress_test.go

 			installConfigBuildOptions: []icOption{icBuild.forNone()},
 			controlPlaneTopology:      configv1.HighlyAvailableTopologyMode,
-			infrastructureTopology:    configv1.SingleReplicaTopologyMode,
+			infrastructureTopology:    configv1.HighlyAvailableTopologyMode,


When there is a single control-plane node and 0 compute nodes, we should end up with controlPlaneTopology==configv1.SingleReplicaTopologyMode and infrastructureTopology==configv1.SingleReplicaTopologyMode. I don't see a test for that.

https://github.com/openshift/installer/blob/main/pkg/asset/manifests/topologies_test.go#L31 This test should take care of that.

barbacbd · 2025-11-14T17:47:22Z

/retest-required

tthvo · 2025-11-14T22:32:36Z

/test e2e-aws-ovn-upi e2e-aws-ovn-single-node

tthvo · 2025-11-14T23:00:33Z

pkg/asset/manifests/topologies.go

+	if numOfWorkers < 2 {
+		// Control planes are schedulable when there are < 2 workers.
+		// Adjust the number of schedulable nodes here to reflect.
+		numOfWorkers += controlPlaneReplicas
+	}

 	switch numOfWorkers {
 	case 0:


By adding this check, the numOfWorkers is never 0 because number of control plane nodes cannot be 0, right? This makes case 0 unreachable.

installer/pkg/asset/manifests/topologies.go

Lines 33 to 41 in 4180662

case 0:

// Two node deployments with 0 workers mean that the control plane nodes are treated as workers

// in that situation we have decided that it is appropriate to set the infrastructureTopology to HA.

// All other configuration for different worker count are respected with the original intention.

if controlPlaneTopology == configv1.DualReplicaTopologyMode || controlPlaneTopology == configv1.HighlyAvailableArbiterMode {

infrastructureTopology = configv1.HighlyAvailableTopologyMode

} else {

infrastructureTopology = controlPlaneTopology

}

IIUC, it is OK. When using deploying 2-node or arbiter cluster, the infrastructure topology should configv1.HighlyAvailableTopologyMode, which is already handled by the default: case.

/cc @eggfoobar @jaypoulz

I think this change is effectively applying the same rules that we use for 2-node/arbiter case to everything. So it's valid, but the comments and code below that suggest there's still a special case that applies only to 2-node/arbiter clusters is now misleading. Better to delete it I think.

tthvo

Based on comment, the change worked for me. The question I have above is really just a matter of cleanup (i.e. remove unreachable case if needed).

$ yq .controlPlane install-config.yaml 
architecture: amd64
hyperthreading: Enabled
name: master
platform: {}
replicas: 3

$ yq .compute[0] install-config.yaml 
architecture: amd64
hyperthreading: Enabled
name: worker
platform: {}
replicas: 1

$ ./openshift-install create manifests --dir=.
...output-omitted...

$ yq .spec.mastersSchedulable manifests/cluster-scheduler-02-config.yml 
true

$ yq .status.infrastructureTopology manifests/cluster-infrastructure-02-config.yml 
HighlyAvailable

/lgtm

openshift-ci · 2025-11-15T02:16:39Z

@barbacbd: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`7181c84`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

gpei · 2025-11-18T02:52:37Z

In pre-merge testing, I noticed another change in behavior. When we specify a single control-plane node and 1 compute nodes, the control-plane will also be set to schedulable. Then this results in a cluster with
controlPlaneTopology: SingleReplica and infrastructureTopology: HighlyAvailable. Is this expected?

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 13, 2025

openshift-ci bot requested review from patrickdillon, rwsu and zaneb November 13, 2025 15:40

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 13, 2025

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 13, 2025

openshift-ci bot requested a review from gpei November 13, 2025 15:47

openshift-ci bot requested a review from sadasu November 13, 2025 17:28

barbacbd force-pushed the OCPBUGS-64793 branch from bc127b2 to d40a3b8 Compare November 13, 2025 19:05

tthvo reviewed Nov 13, 2025

View reviewed changes

barbacbd force-pushed the OCPBUGS-64793 branch from d40a3b8 to 7181c84 Compare November 14, 2025 14:46

sadasu reviewed Nov 14, 2025

View reviewed changes

tthvo reviewed Nov 14, 2025

View reviewed changes

openshift-ci bot requested review from eggfoobar and jaypoulz November 14, 2025 23:00

tthvo reviewed Nov 14, 2025

View reviewed changes

openshift-ci bot assigned tthvo Nov 14, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 14, 2025

		case 1:
		infrastructureTopology = configv1.SingleReplicaTopologyMode

	case 0:
	// Two node deployments with 0 workers mean that the control plane nodes are treated as workers
	// in that situation we have decided that it is appropriate to set the infrastructureTopology to HA.
	// All other configuration for different worker count are respected with the original intention.
	if controlPlaneTopology == configv1.DualReplicaTopologyMode \|\| controlPlaneTopology == configv1.HighlyAvailableArbiterMode {
	infrastructureTopology = configv1.HighlyAvailableTopologyMode
	} else {
	infrastructureTopology = controlPlaneTopology
	}

OCPBUGS-64793: Infrastructure Topology calculation is wrong #10077

Are you sure you want to change the base?

OCPBUGS-64793: Infrastructure Topology calculation is wrong #10077

Uh oh!

Conversation

barbacbd commented Nov 13, 2025

Uh oh!

openshift-ci-robot commented Nov 13, 2025

Uh oh!

barbacbd commented Nov 13, 2025

Uh oh!

openshift-ci bot commented Nov 13, 2025

Uh oh!

barbacbd commented Nov 13, 2025

Uh oh!

openshift-ci-robot commented Nov 13, 2025

Uh oh!

tthvo commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barbacbd commented Nov 14, 2025

Uh oh!

tthvo commented Nov 14, 2025

Uh oh!

tthvo Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tthvo left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Nov 15, 2025

Uh oh!

gpei commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sadasu Nov 13, 2025 •

edited

Loading

sadasu Nov 14, 2025 •

edited

Loading

tthvo Nov 14, 2025 •

edited

Loading