Skip to content

feat: add worker-0 fast-fail and host maintenance retry to clustered plugin#7418

Open
AdilFayyaz wants to merge 1 commit into
adil/jobsets-emission-happy-pathfrom
adil/jobsets-failure-paths
Open

feat: add worker-0 fast-fail and host maintenance retry to clustered plugin#7418
AdilFayyaz wants to merge 1 commit into
adil/jobsets-emission-happy-pathfrom
adil/jobsets-failure-paths

Conversation

@AdilFayyaz
Copy link
Copy Markdown
Contributor

@AdilFayyaz AdilFayyaz commented May 22, 2026

Why are the changes needed?

The PR (#7417) happy-path plugin had no way to surface pod-level failures early or distinguish infrastructure evictions from user code failures, causing wasted retry budget on maintenance events and slow failure detection.

What changes were proposed in this pull request?

  • Adds maybeFastFailWorker0: when ReplicatedJobsStatus[workers].Failed > 0, calls flytek8s.DemystifyFailedOrPendingPod on the rank-0 pod (<jobset>-workers-0-0) to surface failures before the JobSet controller sets JobSetFailed
  • Adds maybeSystemRetryOnMaintenance: on JobSetFailed, if RestartOnHostMaintenance=true, inspects the rank-0 pod and reclassifies system-caused evictions as PhaseInfoSystemRetryableFailureWithCleanup — retrying without charging the user's max_restarts budget
  • GetTaskPhase now reads ClusteredTaskSpec to check FailurePolicy.RestartOnHostMaintenance flag; both detection paths are best-effort (fall through to normal failure if the pod is already cleaned up)

How was this patch tested?

  • Run: go test ./flyteplugins/go/tasks/plugins/k8s/clustered/... -v — all 15 tests pass
  • TestGetTaskPhase_FastFail_NoJobsFailed: confirms no pod lookup when ReplicatedJobsStatus[workers].Failed == 0 → returns PhaseRunning
  • TestGetTaskPhase_MaintenanceRetry_FlagFalse: confirms JobSetFailed returns PhaseRetryableFailure when RestartOnHostMaintenance=false, no pod lookup attempted
  • Integration coverage (actual eviction detection via DemystifyFailedOrPendingPod) is deferred to a later PR covering E2E tests

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Stack

If you do use git town to manage PR Stacks, the stack relevant to this PR
will show below. Otherwise, you can ignore this section.

Docs link

@AdilFayyaz AdilFayyaz self-assigned this May 22, 2026
@AdilFayyaz AdilFayyaz added the added Merged changes that add new functionality label May 22, 2026
@AdilFayyaz AdilFayyaz changed the title add: failure cases feat: add worker-0 fast-fail and host maintenance retry to clustered plugin May 22, 2026
@AdilFayyaz AdilFayyaz requested a review from pingsutw May 22, 2026 19:28
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
@AdilFayyaz AdilFayyaz force-pushed the adil/jobsets-failure-paths branch from d797483 to 968df9f Compare May 22, 2026 19:33
// Read spec for failure-policy flags (restart_on_host_maintenance).
var spec clusteredpb.ClusteredTaskSpec
if taskTemplate, err := pluginContext.TaskReader().Read(ctx); err == nil && taskTemplate != nil {
_ = utils.UnmarshalStruct(taskTemplate.GetCustom(), &spec) //nolint:staticcheck
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we raise an error if it fails to unmarshal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

added Merged changes that add new functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants