Skip to content

Conversation

@jrosser
Copy link
Contributor

@jrosser jrosser commented Jan 8, 2025

If during cluster creation or update, one or more nodes cannot be created by Nova (because, for example, the project runs out of quota, or Nova runs out of suitable hypervisors), the cluster stalls with CREATE_IN_PROGRESS status. The status_reason in this case is not helpful as it only reports the previous step which succeeded.

This patch adds a check of each individual machine status looking for machines which are not ready and reports any reason found. This can result in useful status_reason messages such as
'error creating Openstack instance ... Quota exceeded for instances'

@jrosser jrosser force-pushed the better-status-reports-pr branch 2 times, most recently from 8fdb65e to df7e0a1 Compare January 8, 2025 20:43
@yaguangtang yaguangtang requested a review from mnaser January 10, 2025 11:49
@okozachenko1203
Copy link
Member

Our recent changes include significant updates, especially with the introduction of Rust.
However, the status reason update remains within the Python codebase, so this PR is still valid and relevant.

@okozachenko1203
Copy link
Member

@jrosser could you rebase your branch? we improved tests in CI so good to see the CI success result before peer-review.

If during cluster creation or update, one or more nodes cannot be
created by Nova (because, for example, the project runs out of
quota, or Nova runs out of suitable hypervisors), the cluster stalls
with CREATE_IN_PROGRESS status. The status_reason in this case is
not helpful as it only reports the previous step which succeeded.

This patch adds a check of each individual machine status looking
for machines which are not ready and reports any reason found.
This can result in useful status_reason messages such as
 'error creating Openstack instance ... Quota exceeded for instances'
@jrosser jrosser force-pushed the better-status-reports-pr branch from df7e0a1 to 37037e3 Compare March 11, 2025 13:11
Copy link
Member

@okozachenko1203 okozachenko1203 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrosser thanks for your quick response.
I have one comment now.

Comment on lines +223 to +241
# Check reason if an individual machine is not ready
machines = objects.Machine.objects(self.k8s_api).filter(
namespace="magnum-system",
selector={
"cluster.x-k8s.io/cluster-name": cluster.stack_id,
},
)

for machine in machines:
for cond in machine.obj["status"]["conditions"]:
if (
cond.get("type") == "InfrastructureReady"
and cond.get("status") == "False"
):
messagetext = cond.get("message")
if messagetext:
cluster.status_reason = messagetext
cluster.save()

Copy link
Member

@okozachenko1203 okozachenko1203 Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will overwrite the status_reason generated by _get_cluster_status_reason in the previous step.

Currently, _get_cluster_status_reason derives the status reason message from the Cluster and OpenStackCluster CRs.

I recommend integrating your code snippet directly into the _get_cluster_status_reason function. To achieve this, you will need to:

  • Loop through all machines and concatenate their respective messages along with the machine names.
  • Append this aggregated all-machine message to the final status reason, alongside the existing messages.

def _get_cluster_status_reason(self, capi_cluster):
capi_cluster_status_reason = ""
capi_ops_cluster_status_reason = ""
# Get the latest event message of the CAPI Cluster
capi_cluster_events = capi_cluster.events
if capi_cluster_events:
capi_cluster_status_reason += utils.format_event_message(
list(capi_cluster_events)[-1]
)
# Get the latest event message of the CAPI OpenstackCluster
capi_ops_cluster_events = []
capi_ops_cluster = capi_cluster.openstack_cluster
if capi_ops_cluster:
capi_ops_cluster_events = capi_ops_cluster.events
if capi_ops_cluster_events:
capi_ops_cluster_status_reason += utils.format_event_message(
list(capi_ops_cluster_events)[-1]
)
return "CAPI Cluster status: %s. CAPI OpenstackCluster status reason: %s" % (
capi_cluster_status_reason,
capi_ops_cluster_status_reason,
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and i am curious if the Cluster and OpenstackCluster CR conditions can be all true although machines are not ready during CREATE/UPDATE_IN_PROGRESS status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the original author of this patch has now taken retirement and no longer works on my team. Realistically it is going to need someone else to pick this up as I don't have any engineering effort available to work on this for the foreseeable future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants