-
Notifications
You must be signed in to change notification settings - Fork 656
[RayCluster] Status includes head containter status message #4196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayCluster] Status includes head containter status message #4196
Conversation
|
@spencer-p any reason for not updating worker status too? There's a separate condition for all workers that we can use I think |
2526828 to
c4b9fe3
Compare
|
We had some earlier discussions: #2387 (comment). This is a reasonable request because it has been requested several times, but we should be careful and understand: (1) why the K8s ReplicaSet didn't surface the issues, (2) when we should surface the issues, and (3) what we should surface. |
|
Closes #2387 |
|
Hi @spencer-p, could you help fix the lint error? |
It was not as obvious low hanging fruit :) I'd like to get some agreement on the shape of the change before I go ahead and do the same for the workers, but I'll consider that a goal for this PR.
I agree, and I'm a little wary of inserting all this info in the message string where it is not even that useful for automation. What do you think about inserting the container status reason and the message in only cases where the pod readiness reason is "ContainersNotReady"? E.g. Maybe it's a subtle point but I am more comfortable with this. I feel it addresses point 2 in narrowing the scope and point 3 in providing structured info (as well as the human readable msg) (I'm not sure what you're referring to regarding replicasets (i didn't think kuberay used repliasets and I don't see any references?)) |
c4b9fe3 to
2a9ed03
Compare
This is reasonable to me.
Oh, when I am not sure whether some changes are good or not, I will check upstream K8s behavior. For example, |
|
Are we going to extend the |
|
Considering this ready for code review. I added the same behavior as the head pod to the workers (per Andrew's comment), where the first container failure for ContainersNotReady is reported. Showing only the first error should help clue in new users and will avoid cases where hundreds of workers have errors concatenated together.
I agree. I noticed Deployments will simply report MinimumReplicasUnavailable even if the pods are crashing or unpullable. I think it's been expected that Kubernetes-native users will know to dig deeper, but that's tricky for folks that are new (or more interested in their Ray job than pod statuses).
Just saw this after adding the change :). I'm happy to move it to another PR but it seems to be about the same complexity as the head pod change IMO. |
|
Hi @spencer-p, can we have a new condition name for workers? WorkerPodsReady? Since not being ready yet may not result in a failure. |
|
I like WorkerPodsReady. That opens up some new cases when there are no workers. I am thinking in such cases the condition should be omitted. |
b8cc92c to
2a9ed03
Compare
|
Reverting the last commit so that WorkerPodsReady can have a separate PR as it will be more detailed |
andrewsykim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits
| return headPodReadyCondition | ||
| } | ||
|
|
||
| func firstNotReadyContainerStatus(pod *corev1.Pod) (reason string, message string, ok bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: named returned variables are not needed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named the return variables for documentation, to disambiguate the two strings
| func firstNotReadyContainerStatus(pod *corev1.Pod) (reason string, message string, ok bool) { | ||
| for _, status := range pod.Status.ContainerStatuses { | ||
| if status.State.Waiting != nil { | ||
| return status.State.Waiting.Reason, status.Name + ": " + status.State.Waiting.Message, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: return status.State.Waiting.Reason, fmt.Sprintf("%s: %s", status.Name, status.State.Waiting.Message), true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same for terminated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Spencer Peterson <[email protected]>
Signed-off-by: Spencer Peterson <[email protected]>
Signed-off-by: Spencer Peterson <[email protected]>
Signed-off-by: Spencer Peterson <[email protected]>
b1f579b to
a15cb60
Compare
Why are these changes needed?
Sometimes cluster provisioning fails at runtime and requires intervention, but
it is not apparent from reading the RayCluster status that things will not
progress.
For example, users may forget to add image pull secrets or make a typo in their
container images and would only see
ContainersNotReady, which may leave themwaiting for readiness that won't occur.
This change adds container status messages from Waiting and Terminated states to
the head pod condition to aid debuggability in these cases.
Before:
After:
Related issue number
Checks