-
Notifications
You must be signed in to change notification settings - Fork 34
Improve status_reason when cluster creation is stalled #470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8fdb65e to
df7e0a1
Compare
|
Our recent changes include significant updates, especially with the introduction of Rust. |
|
@jrosser could you rebase your branch? we improved tests in CI so good to see the CI success result before peer-review. |
If during cluster creation or update, one or more nodes cannot be created by Nova (because, for example, the project runs out of quota, or Nova runs out of suitable hypervisors), the cluster stalls with CREATE_IN_PROGRESS status. The status_reason in this case is not helpful as it only reports the previous step which succeeded. This patch adds a check of each individual machine status looking for machines which are not ready and reports any reason found. This can result in useful status_reason messages such as 'error creating Openstack instance ... Quota exceeded for instances'
df7e0a1 to
37037e3
Compare
okozachenko1203
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrosser thanks for your quick response.
I have one comment now.
| # Check reason if an individual machine is not ready | ||
| machines = objects.Machine.objects(self.k8s_api).filter( | ||
| namespace="magnum-system", | ||
| selector={ | ||
| "cluster.x-k8s.io/cluster-name": cluster.stack_id, | ||
| }, | ||
| ) | ||
|
|
||
| for machine in machines: | ||
| for cond in machine.obj["status"]["conditions"]: | ||
| if ( | ||
| cond.get("type") == "InfrastructureReady" | ||
| and cond.get("status") == "False" | ||
| ): | ||
| messagetext = cond.get("message") | ||
| if messagetext: | ||
| cluster.status_reason = messagetext | ||
| cluster.save() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will overwrite the status_reason generated by _get_cluster_status_reason in the previous step.
Currently, _get_cluster_status_reason derives the status reason message from the Cluster and OpenStackCluster CRs.
I recommend integrating your code snippet directly into the _get_cluster_status_reason function. To achieve this, you will need to:
- Loop through all machines and concatenate their respective messages along with the machine names.
- Append this aggregated all-machine message to the final status reason, alongside the existing messages.
magnum-cluster-api/magnum_cluster_api/driver.py
Lines 124 to 148 in b2a612f
| def _get_cluster_status_reason(self, capi_cluster): | |
| capi_cluster_status_reason = "" | |
| capi_ops_cluster_status_reason = "" | |
| # Get the latest event message of the CAPI Cluster | |
| capi_cluster_events = capi_cluster.events | |
| if capi_cluster_events: | |
| capi_cluster_status_reason += utils.format_event_message( | |
| list(capi_cluster_events)[-1] | |
| ) | |
| # Get the latest event message of the CAPI OpenstackCluster | |
| capi_ops_cluster_events = [] | |
| capi_ops_cluster = capi_cluster.openstack_cluster | |
| if capi_ops_cluster: | |
| capi_ops_cluster_events = capi_ops_cluster.events | |
| if capi_ops_cluster_events: | |
| capi_ops_cluster_status_reason += utils.format_event_message( | |
| list(capi_ops_cluster_events)[-1] | |
| ) | |
| return "CAPI Cluster status: %s. CAPI OpenstackCluster status reason: %s" % ( | |
| capi_cluster_status_reason, | |
| capi_ops_cluster_status_reason, | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and i am curious if the Cluster and OpenstackCluster CR conditions can be all true although machines are not ready during CREATE/UPDATE_IN_PROGRESS status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the original author of this patch has now taken retirement and no longer works on my team. Realistically it is going to need someone else to pick this up as I don't have any engineering effort available to work on this for the foreseeable future.
If during cluster creation or update, one or more nodes cannot be created by Nova (because, for example, the project runs out of quota, or Nova runs out of suitable hypervisors), the cluster stalls with CREATE_IN_PROGRESS status. The status_reason in this case is not helpful as it only reports the previous step which succeeded.
This patch adds a check of each individual machine status looking for machines which are not ready and reports any reason found. This can result in useful status_reason messages such as
'error creating Openstack instance ... Quota exceeded for instances'