[RayCluster] Status includes head containter status message #4196

spencer-p · 2025-11-17T17:20:03Z

Why are these changes needed?

Sometimes cluster provisioning fails at runtime and requires intervention, but
it is not apparent from reading the RayCluster status that things will not
progress.

For example, users may forget to add image pull secrets or make a typo in their
container images and would only see ContainersNotReady, which may leave them
waiting for readiness that won't occur.

This change adds container status messages from Waiting and Terminated states to
the head pod condition to aid debuggability in these cases.

Before:

status:
  conditions:
  - lastTransitionTime: '2025-10-24T21:09:52Z'
    message: 'containers with unready status: [ray-head]'
    reason: ContainersNotReady
    status: 'False'
    type: HeadPodReady

After:

status:
  conditions:
  - lastTransitionTime: '2025-10-24T21:09:52Z'
    message: 'containers with unready status: [ray-head]; ray-head:
        ImagePullBackOff: Back-off pulling image "rayproject/roy:2.46.0":
        ErrImagePull: rpc error: code = NotFound desc = ...'
    reason: ContainersNotReady
    status: 'False'
    type: HeadPodReady

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

andrewsykim · 2025-11-17T17:32:21Z

@spencer-p any reason for not updating worker status too? There's a separate condition for all workers that we can use I think

kevin85421 · 2025-11-17T21:21:19Z

We had some earlier discussions: #2387 (comment).

This is a reasonable request because it has been requested several times, but we should be careful and understand: (1) why the K8s ReplicaSet didn't surface the issues, (2) when we should surface the issues, and (3) what we should surface.

kevin85421 · 2025-11-17T21:21:57Z

Closes #2387

rueian · 2025-11-17T23:21:21Z

Hi @spencer-p, could you help fix the lint error?

spencer-p · 2025-11-18T01:07:55Z

any reason for not updating worker status too?

It was not as obvious low hanging fruit :)

I'd like to get some agreement on the shape of the change before I go ahead and do the same for the workers, but I'll consider that a goal for this PR.

we should be careful and understand: (1) why the K8s ReplicaSet didn't surface the issues, (2) when we should surface the issues, and (3) what we should surface.

I agree, and I'm a little wary of inserting all this info in the message string where it is not even that useful for automation. What do you think about inserting the container status reason and the message in only cases where the pod readiness reason is "ContainersNotReady"? E.g.

status:
  conditions:
  - lastTransitionTime: '2025-10-24T21:09:52Z'
	message: 'containers with unready status: [ray-head]; ray-head: Back-off
		pulling image "rayproject/roy:2.46.0": ErrImagePull: rpc error: code =
		NotFound desc = ...'
    reason: ImagePullBackoff
    status: 'False'
    type: HeadPodReady

Maybe it's a subtle point but I am more comfortable with this. I feel it addresses point 2 in narrowing the scope and point 3 in providing structured info (as well as the human readable msg)

(I'm not sure what you're referring to regarding replicasets (i didn't think kuberay used repliasets and I don't see any references?))

kevin85421 · 2025-11-18T03:37:42Z

What do you think about inserting the container status reason and the message in only cases where the pod readiness reason is "ContainersNotReady"

This is reasonable to me.

(I'm not sure what you're referring to regarding replicasets (i didn't think kuberay used repliasets and I don't see any references?))

Oh, when I am not sure whether some changes are good or not, I will check upstream K8s behavior. For example, ReplicaSet also includes a group of Pods (similar to RayCluster), but it doesn't surface the issues of Pods in ReplicaSet's status.

rueian · 2025-11-19T23:06:25Z

Are we going to extend the RayClusterReplicaFailure condition? I think it should be in another PR due to the complexity.

spencer-p · 2025-11-19T23:14:49Z

Considering this ready for code review. I added the same behavior as the head pod to the workers (per Andrew's comment), where the first container failure for ContainersNotReady is reported.

Showing only the first error should help clue in new users and will avoid cases where hundreds of workers have errors concatenated together.

check upstream K8s behavior

I agree. I noticed Deployments will simply report MinimumReplicasUnavailable even if the pods are crashing or unpullable. I think it's been expected that Kubernetes-native users will know to dig deeper, but that's tricky for folks that are new (or more interested in their Ray job than pod statuses).

Are we going to extend the RayClusterReplicaFailure condition? I think it should be in another PR due to the complexity.

Just saw this after adding the change :). I'm happy to move it to another PR but it seems to be about the same complexity as the head pod change IMO.

rueian · 2025-11-19T23:15:30Z

Hi @spencer-p, can we have a new condition name for workers? WorkerPodsReady? Since not being ready yet may not result in a failure.

spencer-p · 2025-11-19T23:30:56Z

I like WorkerPodsReady.

That opens up some new cases when there are no workers. I am thinking in such cases the condition should be omitted.

spencer-p · 2025-11-19T23:54:27Z

Reverting the last commit so that WorkerPodsReady can have a separate PR as it will be more detailed

andrewsykim

Just some nits

andrewsykim · 2025-11-20T02:11:21Z

ray-operator/controllers/ray/utils/util.go

 	return headPodReadyCondition
 }

+func firstNotReadyContainerStatus(pod *corev1.Pod) (reason string, message string, ok bool) {


nit: named returned variables are not needed here

I named the return variables for documentation, to disambiguate the two strings

andrewsykim · 2025-11-20T02:11:56Z

ray-operator/controllers/ray/utils/util.go

+func firstNotReadyContainerStatus(pod *corev1.Pod) (reason string, message string, ok bool) {
+	for _, status := range pod.Status.ContainerStatuses {
+		if status.State.Waiting != nil {
+			return status.State.Waiting.Reason, status.Name + ": " + status.State.Waiting.Message, true


nit: return status.State.Waiting.Reason, fmt.Sprintf("%s: %s", status.Name, status.State.Waiting.Message), true

(same for terminated)

Signed-off-by: Spencer Peterson <[email protected]>

spencer-p requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners November 17, 2025 17:20

spencer-p force-pushed the raycluster-errimagepull branch from 2526828 to c4b9fe3 Compare November 17, 2025 17:51

Future-Outlier self-assigned this Nov 18, 2025

spencer-p force-pushed the raycluster-errimagepull branch from c4b9fe3 to 2a9ed03 Compare November 18, 2025 01:20

spencer-p force-pushed the raycluster-errimagepull branch from b8cc92c to 2a9ed03 Compare November 19, 2025 23:53

rueian approved these changes Nov 19, 2025

View reviewed changes

andrewsykim approved these changes Nov 20, 2025

View reviewed changes

spencer-p added 4 commits November 20, 2025 18:35

[RayCluster] Status includes head containter status message

00cc9f3

Signed-off-by: Spencer Peterson <[email protected]>

lint

8e29bd5

Signed-off-by: Spencer Peterson <[email protected]>

[RayCluster] Containers not ready status reflects structured reason

b930df8

Signed-off-by: Spencer Peterson <[email protected]>

nit

a15cb60

Signed-off-by: Spencer Peterson <[email protected]>

spencer-p force-pushed the raycluster-errimagepull branch from b1f579b to a15cb60 Compare November 20, 2025 18:37

andrewsykim merged commit 3bd363d into ray-project:master Nov 20, 2025
27 of 36 checks passed

[RayCluster] Status includes head containter status message #4196

[RayCluster] Status includes head containter status message #4196

Uh oh!

Conversation

spencer-p commented Nov 17, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

andrewsykim commented Nov 17, 2025

Uh oh!

kevin85421 commented Nov 17, 2025

Uh oh!

kevin85421 commented Nov 17, 2025

Uh oh!

rueian commented Nov 17, 2025

Uh oh!

spencer-p commented Nov 18, 2025

Uh oh!

kevin85421 commented Nov 18, 2025

Uh oh!

rueian commented Nov 19, 2025

Uh oh!

spencer-p commented Nov 19, 2025

Uh oh!

rueian commented Nov 19, 2025

Uh oh!

spencer-p commented Nov 19, 2025

Uh oh!

spencer-p commented Nov 19, 2025

Uh oh!

andrewsykim left a comment

Choose a reason for hiding this comment

Uh oh!

andrewsykim Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

spencer-p Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

andrewsykim Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

andrewsykim Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

spencer-p Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants