Skip to content

Conversation

@barbacbd
Copy link
Contributor

If masters are schedulable then all nodes, not just non-controlplane nodes, should
be taken into account when deciding what we tell OLM operators about the topology.

Example:
1.Deploy a cluster with 3 CP nodes (schedulable) and 1 Worker Node
2.Inspect infrastructureTopology attribute of the cluster

Expect: infrastructureTopology= HighlyAvailableTopologyMode

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 13, 2025
@openshift-ci-robot
Copy link
Contributor

@barbacbd: This pull request references Jira Issue OCPBUGS-64793, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

If masters are schedulable then all nodes, not just non-controlplane nodes, should
be taken into account when deciding what we tell OLM operators about the topology.

Example:
1.Deploy a cluster with 3 CP nodes (schedulable) and 1 Worker Node
2.Inspect infrastructureTopology attribute of the cluster

Expect: infrastructureTopology= HighlyAvailableTopologyMode

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@barbacbd
Copy link
Contributor Author

/cc @zaneb

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mtulio for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@barbacbd
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 13, 2025
@openshift-ci-robot
Copy link
Contributor

@barbacbd: This pull request references Jira Issue OCPBUGS-64793, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 13, 2025
@openshift-ci openshift-ci bot requested a review from gpei November 13, 2025 15:47
@tthvo
Copy link
Member

tthvo commented Nov 13, 2025

/cc @sadasu

Comment on lines 42 to 43
case 1:
infrastructureTopology = configv1.SingleReplicaTopologyMode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Is there a case where control plane nodes are non-schedulable in the 3 ctrlplane + 1 worker setup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, infrastructureTopology is single-replica right?

Copy link
Contributor

@sadasu sadasu Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently when numOfWorker == 1, masters are not schedulable. See https://github.com/openshift/installer/blob/main/pkg/asset/manifests/scheduler.go#L75

So, even by setting infrastructureTopology to a different value, we will not have enough nodes to have a successful cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, then the bug is not valid? I actually have 0 clue now 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I missed your comments in the bug card. The bug makes sense to me now 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where control plane nodes are non-schedulable in the 3 ctrlplane + 1 worker setup?

Not a supported case, anyway. This cluster will fail to complete installation because Ingress won't start.
It can be rescued on day 2, by either adding another worker or making control plane nodes schedulable, but in either case you end up with a cluster that should have HA InfrastructureTopology.

If masters are schedulable then all nodes, not just non-controlplane nodes, should
be taken into account when deciding what we tell OLM operators about the topology.

Example:
1.Deploy a cluster with 3 CP nodes (schedulable) and 1 Worker Node
2.Inspect infrastructureTopology attribute of the cluster

Expect: infrastructureTopology= HighlyAvailableTopologyMode
installConfigBuildOptions: []icOption{icBuild.forNone()},
controlPlaneTopology: configv1.HighlyAvailableTopologyMode,
infrastructureTopology: configv1.SingleReplicaTopologyMode,
infrastructureTopology: configv1.HighlyAvailableTopologyMode,
Copy link
Contributor

@sadasu sadasu Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is a single control-plane node and 0 compute nodes, we should end up with controlPlaneTopology==configv1.SingleReplicaTopologyMode and infrastructureTopology==configv1.SingleReplicaTopologyMode. I don't see a test for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@barbacbd
Copy link
Contributor Author

/retest-required

@tthvo
Copy link
Member

tthvo commented Nov 14, 2025

/test e2e-aws-ovn-upi e2e-aws-ovn-single-node

Comment on lines +31 to 38
if numOfWorkers < 2 {
// Control planes are schedulable when there are < 2 workers.
// Adjust the number of schedulable nodes here to reflect.
numOfWorkers += controlPlaneReplicas
}

switch numOfWorkers {
case 0:
Copy link
Member

@tthvo tthvo Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By adding this check, the numOfWorkers is never 0 because number of control plane nodes cannot be 0, right? This makes case 0 unreachable.

case 0:
// Two node deployments with 0 workers mean that the control plane nodes are treated as workers
// in that situation we have decided that it is appropriate to set the infrastructureTopology to HA.
// All other configuration for different worker count are respected with the original intention.
if controlPlaneTopology == configv1.DualReplicaTopologyMode || controlPlaneTopology == configv1.HighlyAvailableArbiterMode {
infrastructureTopology = configv1.HighlyAvailableTopologyMode
} else {
infrastructureTopology = controlPlaneTopology
}

IIUC, it is OK. When using deploying 2-node or arbiter cluster, the infrastructure topology should configv1.HighlyAvailableTopologyMode, which is already handled by the default: case.

/cc @eggfoobar @jaypoulz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is effectively applying the same rules that we use for 2-node/arbiter case to everything. So it's valid, but the comments and code below that suggest there's still a special case that applies only to 2-node/arbiter clusters is now misleading. Better to delete it I think.

@openshift-ci openshift-ci bot requested review from eggfoobar and jaypoulz November 14, 2025 23:00
Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on comment, the change worked for me. The question I have above is really just a matter of cleanup (i.e. remove unreachable case if needed).

$ yq .controlPlane install-config.yaml 
architecture: amd64
hyperthreading: Enabled
name: master
platform: {}
replicas: 3

$ yq .compute[0] install-config.yaml 
architecture: amd64
hyperthreading: Enabled
name: worker
platform: {}
replicas: 1

$ ./openshift-install create manifests --dir=.
...output-omitted...

$ yq .spec.mastersSchedulable manifests/cluster-scheduler-02-config.yml 
true

$ yq .status.infrastructureTopology manifests/cluster-infrastructure-02-config.yml 
HighlyAvailable

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 14, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 15, 2025

@barbacbd: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 7181c84 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gpei
Copy link
Contributor

gpei commented Nov 18, 2025

In pre-merge testing, I noticed another change in behavior. When we specify a single control-plane node and 1 compute nodes, the control-plane will also be set to schedulable. Then this results in a cluster with
controlPlaneTopology: SingleReplica and infrastructureTopology: HighlyAvailable. Is this expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants