-
Notifications
You must be signed in to change notification settings - Fork 1.5k
OCPBUGS-64793: Infrastructure Topology calculation is wrong #10077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
barbacbd
commented
Nov 13, 2025
|
@barbacbd: This pull request references Jira Issue OCPBUGS-64793, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cc @zaneb |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
|
@barbacbd: This pull request references Jira Issue OCPBUGS-64793, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cc @sadasu |
bc127b2 to
d40a3b8
Compare
| case 1: | ||
| infrastructureTopology = configv1.SingleReplicaTopologyMode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 Is there a case where control plane nodes are non-schedulable in the 3 ctrlplane + 1 worker setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, infrastructureTopology is single-replica right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently when numOfWorker == 1, masters are not schedulable. See https://github.com/openshift/installer/blob/main/pkg/asset/manifests/scheduler.go#L75
So, even by setting infrastructureTopology to a different value, we will not have enough nodes to have a successful cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right, then the bug is not valid? I actually have 0 clue now 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I missed your comments in the bug card. The bug makes sense to me now 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case where control plane nodes are non-schedulable in the 3 ctrlplane + 1 worker setup?
Not a supported case, anyway. This cluster will fail to complete installation because Ingress won't start.
It can be rescued on day 2, by either adding another worker or making control plane nodes schedulable, but in either case you end up with a cluster that should have HA InfrastructureTopology.
If masters are schedulable then all nodes, not just non-controlplane nodes, should be taken into account when deciding what we tell OLM operators about the topology. Example: 1.Deploy a cluster with 3 CP nodes (schedulable) and 1 Worker Node 2.Inspect infrastructureTopology attribute of the cluster Expect: infrastructureTopology= HighlyAvailableTopologyMode
d40a3b8 to
7181c84
Compare
| installConfigBuildOptions: []icOption{icBuild.forNone()}, | ||
| controlPlaneTopology: configv1.HighlyAvailableTopologyMode, | ||
| infrastructureTopology: configv1.SingleReplicaTopologyMode, | ||
| infrastructureTopology: configv1.HighlyAvailableTopologyMode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there is a single control-plane node and 0 compute nodes, we should end up with controlPlaneTopology==configv1.SingleReplicaTopologyMode and infrastructureTopology==configv1.SingleReplicaTopologyMode. I don't see a test for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/openshift/installer/blob/main/pkg/asset/manifests/topologies_test.go#L31 This test should take care of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
/retest-required |
|
/test e2e-aws-ovn-upi e2e-aws-ovn-single-node |
| if numOfWorkers < 2 { | ||
| // Control planes are schedulable when there are < 2 workers. | ||
| // Adjust the number of schedulable nodes here to reflect. | ||
| numOfWorkers += controlPlaneReplicas | ||
| } | ||
|
|
||
| switch numOfWorkers { | ||
| case 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By adding this check, the numOfWorkers is never 0 because number of control plane nodes cannot be 0, right? This makes case 0 unreachable.
installer/pkg/asset/manifests/topologies.go
Lines 33 to 41 in 4180662
| case 0: | |
| // Two node deployments with 0 workers mean that the control plane nodes are treated as workers | |
| // in that situation we have decided that it is appropriate to set the infrastructureTopology to HA. | |
| // All other configuration for different worker count are respected with the original intention. | |
| if controlPlaneTopology == configv1.DualReplicaTopologyMode || controlPlaneTopology == configv1.HighlyAvailableArbiterMode { | |
| infrastructureTopology = configv1.HighlyAvailableTopologyMode | |
| } else { | |
| infrastructureTopology = controlPlaneTopology | |
| } |
IIUC, it is OK. When using deploying 2-node or arbiter cluster, the infrastructure topology should configv1.HighlyAvailableTopologyMode, which is already handled by the default: case.
/cc @eggfoobar @jaypoulz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is effectively applying the same rules that we use for 2-node/arbiter case to everything. So it's valid, but the comments and code below that suggest there's still a special case that applies only to 2-node/arbiter clusters is now misleading. Better to delete it I think.
tthvo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on comment, the change worked for me. The question I have above is really just a matter of cleanup (i.e. remove unreachable case if needed).
$ yq .controlPlane install-config.yaml
architecture: amd64
hyperthreading: Enabled
name: master
platform: {}
replicas: 3
$ yq .compute[0] install-config.yaml
architecture: amd64
hyperthreading: Enabled
name: worker
platform: {}
replicas: 1
$ ./openshift-install create manifests --dir=.
...output-omitted...
$ yq .spec.mastersSchedulable manifests/cluster-scheduler-02-config.yml
true
$ yq .status.infrastructureTopology manifests/cluster-infrastructure-02-config.yml
HighlyAvailable/lgtm
|
@barbacbd: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
In pre-merge testing, I noticed another change in behavior. When we specify a single control-plane node and 1 compute nodes, the control-plane will also be set to schedulable. Then this results in a cluster with |