Skip to content

Conversation

@spencer-p
Copy link
Contributor

Why are these changes needed?

Sometimes cluster provisioning fails at runtime and requires intervention, but
it is not apparent from reading the RayCluster status that things will not
progress.

For example, users may forget to add image pull secrets or make a typo in their
container images and would only see ContainersNotReady, which may leave them
waiting for readiness that won't occur.

This change adds container status messages from Waiting and Terminated states to
the head pod condition to aid debuggability in these cases.

Before:

status:
  conditions:
  - lastTransitionTime: '2025-10-24T21:09:52Z'
    message: 'containers with unready status: [ray-head]'
    reason: ContainersNotReady
    status: 'False'
    type: HeadPodReady

After:

status:
  conditions:
  - lastTransitionTime: '2025-10-24T21:09:52Z'
    message: 'containers with unready status: [ray-head]; ray-head:
        ImagePullBackOff: Back-off pulling image "rayproject/roy:2.46.0":
        ErrImagePull: rpc error: code = NotFound desc = ...'
    reason: ContainersNotReady
    status: 'False'
    type: HeadPodReady

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@andrewsykim
Copy link
Member

@spencer-p any reason for not updating worker status too? There's a separate condition for all workers that we can use I think

@spencer-p spencer-p force-pushed the raycluster-errimagepull branch from 2526828 to c4b9fe3 Compare November 17, 2025 17:51
@kevin85421
Copy link
Member

We had some earlier discussions: #2387 (comment).

This is a reasonable request because it has been requested several times, but we should be careful and understand: (1) why the K8s ReplicaSet didn't surface the issues, (2) when we should surface the issues, and (3) what we should surface.

@kevin85421
Copy link
Member

Closes #2387

@rueian
Copy link
Collaborator

rueian commented Nov 17, 2025

Hi @spencer-p, could you help fix the lint error?

@Future-Outlier Future-Outlier self-assigned this Nov 18, 2025
@spencer-p
Copy link
Contributor Author

any reason for not updating worker status too?

It was not as obvious low hanging fruit :)

I'd like to get some agreement on the shape of the change before I go ahead and do the same for the workers, but I'll consider that a goal for this PR.

we should be careful and understand: (1) why the K8s ReplicaSet didn't surface the issues, (2) when we should surface the issues, and (3) what we should surface.

I agree, and I'm a little wary of inserting all this info in the message string where it is not even that useful for automation. What do you think about inserting the container status reason and the message in only cases where the pod readiness reason is "ContainersNotReady"? E.g.

status:
  conditions:
  - lastTransitionTime: '2025-10-24T21:09:52Z'
	message: 'containers with unready status: [ray-head]; ray-head: Back-off
		pulling image "rayproject/roy:2.46.0": ErrImagePull: rpc error: code =
		NotFound desc = ...'
    reason: ImagePullBackoff
    status: 'False'
    type: HeadPodReady

Maybe it's a subtle point but I am more comfortable with this. I feel it addresses point 2 in narrowing the scope and point 3 in providing structured info (as well as the human readable msg)

(I'm not sure what you're referring to regarding replicasets (i didn't think kuberay used repliasets and I don't see any references?))

@spencer-p spencer-p force-pushed the raycluster-errimagepull branch from c4b9fe3 to 2a9ed03 Compare November 18, 2025 01:20
@kevin85421
Copy link
Member

What do you think about inserting the container status reason and the message in only cases where the pod readiness reason is "ContainersNotReady"

This is reasonable to me.

(I'm not sure what you're referring to regarding replicasets (i didn't think kuberay used repliasets and I don't see any references?))

Oh, when I am not sure whether some changes are good or not, I will check upstream K8s behavior. For example, ReplicaSet also includes a group of Pods (similar to RayCluster), but it doesn't surface the issues of Pods in ReplicaSet's status.

@rueian
Copy link
Collaborator

rueian commented Nov 19, 2025

Are we going to extend the RayClusterReplicaFailure condition? I think it should be in another PR due to the complexity.

@spencer-p
Copy link
Contributor Author

Considering this ready for code review. I added the same behavior as the head pod to the workers (per Andrew's comment), where the first container failure for ContainersNotReady is reported.

Showing only the first error should help clue in new users and will avoid cases where hundreds of workers have errors concatenated together.

check upstream K8s behavior

I agree. I noticed Deployments will simply report MinimumReplicasUnavailable even if the pods are crashing or unpullable. I think it's been expected that Kubernetes-native users will know to dig deeper, but that's tricky for folks that are new (or more interested in their Ray job than pod statuses).

Are we going to extend the RayClusterReplicaFailure condition? I think it should be in another PR due to the complexity.

Just saw this after adding the change :). I'm happy to move it to another PR but it seems to be about the same complexity as the head pod change IMO.

@rueian
Copy link
Collaborator

rueian commented Nov 19, 2025

Hi @spencer-p, can we have a new condition name for workers? WorkerPodsReady? Since not being ready yet may not result in a failure.

@spencer-p
Copy link
Contributor Author

I like WorkerPodsReady.

That opens up some new cases when there are no workers. I am thinking in such cases the condition should be omitted.

@spencer-p spencer-p force-pushed the raycluster-errimagepull branch from b8cc92c to 2a9ed03 Compare November 19, 2025 23:53
@spencer-p
Copy link
Contributor Author

Reverting the last commit so that WorkerPodsReady can have a separate PR as it will be more detailed

Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits

return headPodReadyCondition
}

func firstNotReadyContainerStatus(pod *corev1.Pod) (reason string, message string, ok bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: named returned variables are not needed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named the return variables for documentation, to disambiguate the two strings

func firstNotReadyContainerStatus(pod *corev1.Pod) (reason string, message string, ok bool) {
for _, status := range pod.Status.ContainerStatuses {
if status.State.Waiting != nil {
return status.State.Waiting.Reason, status.Name + ": " + status.State.Waiting.Message, true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return status.State.Waiting.Reason, fmt.Sprintf("%s: %s", status.Name, status.State.Waiting.Message), true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same for terminated)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@spencer-p spencer-p force-pushed the raycluster-errimagepull branch from b1f579b to a15cb60 Compare November 20, 2025 18:37
@andrewsykim andrewsykim merged commit 3bd363d into ray-project:master Nov 20, 2025
27 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants