Improve status_reason when cluster creation is stalled #470

jrosser · 2025-01-08T18:56:29Z

If during cluster creation or update, one or more nodes cannot be created by Nova (because, for example, the project runs out of quota, or Nova runs out of suitable hypervisors), the cluster stalls with CREATE_IN_PROGRESS status. The status_reason in this case is not helpful as it only reports the previous step which succeeded.

This patch adds a check of each individual machine status looking for machines which are not ready and reports any reason found. This can result in useful status_reason messages such as
'error creating Openstack instance ... Quota exceeded for instances'

okozachenko1203 · 2025-03-11T13:01:05Z

Our recent changes include significant updates, especially with the introduction of Rust.
However, the status reason update remains within the Python codebase, so this PR is still valid and relevant.

okozachenko1203 · 2025-03-11T13:02:03Z

@jrosser could you rebase your branch? we improved tests in CI so good to see the CI success result before peer-review.

If during cluster creation or update, one or more nodes cannot be created by Nova (because, for example, the project runs out of quota, or Nova runs out of suitable hypervisors), the cluster stalls with CREATE_IN_PROGRESS status. The status_reason in this case is not helpful as it only reports the previous step which succeeded. This patch adds a check of each individual machine status looking for machines which are not ready and reports any reason found. This can result in useful status_reason messages such as 'error creating Openstack instance ... Quota exceeded for instances'

okozachenko1203

@jrosser thanks for your quick response.
I have one comment now.

okozachenko1203 · 2025-03-11T17:10:07Z

magnum_cluster_api/driver.py

+            # Check reason if an individual machine is not ready
+            machines = objects.Machine.objects(self.k8s_api).filter(
+                namespace="magnum-system",
+                selector={
+                    "cluster.x-k8s.io/cluster-name": cluster.stack_id,
+                },
+            )
+
+            for machine in machines:
+                for cond in machine.obj["status"]["conditions"]:
+                    if (
+                        cond.get("type") == "InfrastructureReady"
+                        and cond.get("status") == "False"
+                    ):
+                        messagetext = cond.get("message")
+                        if messagetext:
+                            cluster.status_reason = messagetext
+                            cluster.save()
+


This change will overwrite the status_reason generated by _get_cluster_status_reason in the previous step.

Currently, _get_cluster_status_reason derives the status reason message from the Cluster and OpenStackCluster CRs.

I recommend integrating your code snippet directly into the _get_cluster_status_reason function. To achieve this, you will need to:

Loop through all machines and concatenate their respective messages along with the machine names.

Append this aggregated all-machine message to the final status reason, alongside the existing messages.

magnum-cluster-api/magnum_cluster_api/driver.py

Lines 124 to 148 in b2a612f

def _get_cluster_status_reason(self, capi_cluster):

capi_cluster_status_reason = ""

capi_ops_cluster_status_reason = ""

# Get the latest event message of the CAPI Cluster

capi_cluster_events = capi_cluster.events

if capi_cluster_events:

capi_cluster_status_reason += utils.format_event_message(

list(capi_cluster_events)[-1]

)

# Get the latest event message of the CAPI OpenstackCluster

capi_ops_cluster_events = []

capi_ops_cluster = capi_cluster.openstack_cluster

if capi_ops_cluster:

capi_ops_cluster_events = capi_ops_cluster.events

if capi_ops_cluster_events:

capi_ops_cluster_status_reason += utils.format_event_message(

list(capi_ops_cluster_events)[-1]

)

return "CAPI Cluster status: %s. CAPI OpenstackCluster status reason: %s" % (

capi_cluster_status_reason,

capi_ops_cluster_status_reason,

)

and i am curious if the Cluster and OpenstackCluster CR conditions can be all true although machines are not ready during CREATE/UPDATE_IN_PROGRESS status.

Unfortunately the original author of this patch has now taken retirement and no longer works on my team. Realistically it is going to need someone else to pick this up as I don't have any engineering effort available to work on this for the foreseeable future.

jrosser force-pushed the better-status-reports-pr branch 2 times, most recently from 8fdb65e to df7e0a1 Compare January 8, 2025 20:43

yaguangtang requested a review from mnaser January 10, 2025 11:49

jrosser force-pushed the better-status-reports-pr branch from df7e0a1 to 37037e3 Compare March 11, 2025 13:11

okozachenko1203 requested changes Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve status_reason when cluster creation is stalled #470

Improve status_reason when cluster creation is stalled #470

Uh oh!

jrosser commented Jan 8, 2025

Uh oh!

okozachenko1203 commented Mar 11, 2025

Uh oh!

okozachenko1203 commented Mar 11, 2025

Uh oh!

okozachenko1203 left a comment

Uh oh!

okozachenko1203 Mar 11, 2025 •

edited

Loading

Uh oh!

okozachenko1203 Mar 11, 2025

Uh oh!

jrosser Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def _get_cluster_status_reason(self, capi_cluster):
	capi_cluster_status_reason = ""
	capi_ops_cluster_status_reason = ""

	# Get the latest event message of the CAPI Cluster
	capi_cluster_events = capi_cluster.events
	if capi_cluster_events:
	capi_cluster_status_reason += utils.format_event_message(
	list(capi_cluster_events)[-1]
	)

	# Get the latest event message of the CAPI OpenstackCluster
	capi_ops_cluster_events = []
	capi_ops_cluster = capi_cluster.openstack_cluster
	if capi_ops_cluster:
	capi_ops_cluster_events = capi_ops_cluster.events
	if capi_ops_cluster_events:
	capi_ops_cluster_status_reason += utils.format_event_message(
	list(capi_ops_cluster_events)[-1]
	)

	return "CAPI Cluster status: %s. CAPI OpenstackCluster status reason: %s" % (
	capi_cluster_status_reason,
	capi_ops_cluster_status_reason,
	)

Improve status_reason when cluster creation is stalled #470

Are you sure you want to change the base?

Improve status_reason when cluster creation is stalled #470

Uh oh!

Conversation

jrosser commented Jan 8, 2025

Uh oh!

okozachenko1203 commented Mar 11, 2025

Uh oh!

okozachenko1203 commented Mar 11, 2025

Uh oh!

okozachenko1203 left a comment

Choose a reason for hiding this comment

Uh oh!

okozachenko1203 Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okozachenko1203 Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

jrosser Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

okozachenko1203 Mar 11, 2025 •

edited

Loading