Skip to content

feat(spur): close exit-code reporting gap — exit:signal, DerivedExitCode, reasons#274

Merged
shiv-tyagi merged 20 commits into
ROCm:mainfrom
yansun1996:parity/exit-code-signal
Jun 15, 2026
Merged

feat(spur): close exit-code reporting gap — exit:signal, DerivedExitCode, reasons#274
shiv-tyagi merged 20 commits into
ROCm:mainfrom
yansun1996:parity/exit-code-signal

Conversation

@yansun1996

@yansun1996 yansun1996 commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

Closes #269. Brings Spur's job exit-code reporting to Slurm parity, end to end: exit:signal encoding, DerivedExitCode, and NonZeroExitCode / RaisedSignal:N(name) reasons, plus an expanded PendingReason vocabulary.

Previously Spur reported a bare lossy ExitCode (signal collapsed to -1 / 128+sig), had no DerivedExitCode field, and never set a completion reason — silently corrupting outcome data for anything scraping scontrol show job / sacct.

What changed

Exit status is threaded from the agent's process wait status through the report RPC, WAL, core Job, display RPC, and CLI:

  • Agent (spurd) decodes the raw WaitStatus into (exit_code, signal) instead of collapsing it; WIFEXITED -> code:0, WIFSIGNALED -> 0:sig.
  • New exit_signal and derived_exit_code fields on the core Job and proto JobInfo; JobComplete / JobNodeComplete WAL ops carry the signal.
  • Controller sets the failure reason from the batch/primary node's wait status: NonZeroExitCode on non-zero exit, RaisedSignal on signal death.
  • DerivedExitCode is the running max over srun step exit codes (Slurm semantics), maintained live and durably via a new JobStepComplete WAL op; a job with no srun steps reports 0:0.
  • CLI renders ExitCode=code:signal, DerivedExitCode=code:signal, and composes the dynamic RaisedSignal:N(name) string at display time.
  • Expanded PendingReason toward the Slurm vocabulary (NonZeroExitCode, RaisedSignal, OutOfMemory, JobLaunchFailure, ...).

Two adjacent fixes uncovered during live verification are included: jobs now start with default signal dispositions (so kill -INT $$ is reported as RaisedSignal:2 rather than being absorbed by spurd's inherited SIG_IGN), and the Spur CLI bin dir is added to the job PATH (derived from the agent's own location) so srun resolves inside batch scripts.

Exact Slurm semantics implemented

ExitCode is the batch/primary process's raw wait status, not a max. A step killed by a signal surfaces to the batch shell as exit 128+sig, so it appears as NonZeroExitCode with ExitCode=128+sig:0, not RaisedSignal. RaisedSignal only applies when the batch process itself is signaled. DerivedExitCode is the max exit code across steps.

Testing

Unit and integration coverage across spur-core, spurctld, spurd, and spur-cli (derived-completion rules, signal decoding, reason emission, RPC-level signaled-completion recovery, WAL signal round-trip, step running-max). Full suite passes; cargo fmt --check --all and clippy clean.

Verified live against Slurm 25.11.6 on a single-node testbed: clean exit, single non-zero, multi-step srun ordering (ExitCode=last, DerivedExitCode=max), no-srun (DerivedExitCode=0:0), and self-signal cases (SIGINT/SIGTERM/SIGABRT -> RaisedSignal:N(name)), including the live running-max visible in scontrol show job mid-run.

Out of scope (follow-ups)

sacct step history in spurdbd (live DerivedExitCode works via the controller; persisting per-step rows to the accounting DB is separate — tracked in #297). k8s backend signal reporting (currently signal: 0; the controller-side machinery already applies). A separately filed controller robustness issue (#273, openraft snapshot/log compaction) is unrelated to this change.

Design docs

Design and implementation plan included under docs/superpowers/.

Live verification vs Slurm 25.11.6

Ran an expanded matrix on a Slurm 25.11.6 cluster and a single-node Spur cluster, submitting sequentially (each job to a terminal state before the next) to avoid cluster-saturation races. Format per cell: State / ExitCode / DerivedExitCode / Reason.

# Case Slurm Spur Match
1 exit 0 COMPLETED / 0:0 / 0:0 / None COMPLETED / 0:0 / 0:0 / None
2 exit 13 FAILED / 13:0 / 0:0 / NonZeroExitCode FAILED / 13:0 / 0:0 / NonZeroExitCode
3 exit 200 FAILED / 200:0 / 0:0 / NonZeroExitCode FAILED / 200:0 / 0:0 / NonZeroExitCode
4 exit 137 (plain) FAILED / 137:0 / 0:0 / NonZeroExitCode FAILED / 137:0 / 0:0 / NonZeroExitCode
5 srun 1,4,9 FAILED / 9:0 / 9:0 / NonZeroExitCode FAILED / 9:0 / 9:0 / NonZeroExitCode
6 srun 9,4,1 FAILED / 1:0 / 9:0 / NonZeroExitCode FAILED / 1:0 / 9:0 / NonZeroExitCode
7 srun 0,8; batch exit 0 COMPLETED / 0:0 / 8:0 / None COMPLETED / 0:0 / 8:0 / None
8 srun 0,0; exit 0 COMPLETED / 0:0 / 0:0 / None COMPLETED / 0:0 / 0:0 / None
9 batch SIGINT FAILED / 0:2 / 0:0 / RaisedSignal:2(Interrupt) FAILED / 0:2 / 0:0 / RaisedSignal:2(Interrupt)
10 batch SIGTERM FAILED / 0:15 / 0:0 / RaisedSignal:15(Terminated) FAILED / 0:15 / 0:0 / RaisedSignal:15(Terminated)
11 batch SIGKILL FAILED / 0:9 / 0:0 / RaisedSignal:9(Killed) FAILED / 0:9 / 0:0 / RaisedSignal:9(Killed)
12 batch SIGSEGV FAILED / 0:11 / 0:0 / RaisedSignal:11(Segmentation_fault) FAILED / 0:11 / 0:0 / RaisedSignal:11(Segmentation_fault)
13 batch SIGABRT FAILED / 0:6 / 0:0 / RaisedSignal:6(Aborted) FAILED / 0:6 / 0:0 / RaisedSignal:6(Aborted)

Key behaviors confirmed identical: ExitCode follows the batch script's last command (case 5 = 9:0 vs case 6 = 1:0 — same steps, reordered), DerivedExitCode is the max over srun step exit codes (cases 5–7), a plain exit 137 stays 137:0 (not mis-read as signal 9), and all batch signals report 0:sig with RaisedSignal:N(name).

Note on DerivedExitCode signal half

Spur tracks DerivedExitCode as the running max over srun step exit codes (the exit half), which matches Slurm in every stable case above. Slurm additionally folds a signaled step's signal into the DerivedExitCode signal half in some cases, but that behavior is order-dependent and inconsistent in Slurm itself (e.g. a signaled step followed by an exiting step yields 0:0; with shell control flow it instead surfaces as ExitCode=0:sig). Replicating it would mean matching undefined Slurm internals, so Spur deliberately does not — the exit-half parity is what consumers rely on.

@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.12134% with 52 lines in your changes missing coverage. Please review.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #274      +/-   ##
==========================================
+ Coverage   63.80%   64.59%   +0.79%     
==========================================
  Files         117      119       +2     
  Lines       30546    31600    +1054     
==========================================
+ Hits        19488    20409     +921     
- Misses      11058    11191     +133     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yansun1996 yansun1996 marked this pull request as ready for review June 11, 2026 20:11
Copilot AI review requested due to automatic review settings June 11, 2026 20:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR closes the Slurm-parity gap in Spur’s job outcome reporting by threading lossless exit:signal status end-to-end, introducing DerivedExitCode, and emitting completion reasons (NonZeroExitCode, RaisedSignal) so scontrol show job / sacct consumers get correct, non-lossy results.

Changes:

  • Extend proto/RPC/WAL/core Job to carry exit signal and derived exit code information, and propagate it through controller ↔ agent reporting.
  • Add durable step-completion tracking (JobStepComplete) to maintain a running-max DerivedExitCode (Slurm semantics) and render it in the CLI.
  • Improve signal fidelity by resetting inherited signal dispositions before exec, and ensure srun resolves inside batch scripts by injecting the agent bin dir into PATH.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
proto/slurm.proto Add signal to job status reports; add exit_signal + derived_exit_code to JobInfo; add step_id to RunStepRequest.
docs/superpowers/specs/2026-06-11-exit-code-signal-design.md Design spec for exit:signal, DerivedExitCode, and reason vocabulary parity.
docs/superpowers/plans/2026-06-11-exit-code-signal.md Implementation plan covering core/WAL/proto/controller/agent/CLI changes and testing strategy.
crates/spurdbd/src/server.rs Populate new JobInfo fields for job history responses (currently defaulted).
crates/spurd/src/executor.rs Decode wait status into (exit_code, signal) and reset inherited signal dispositions before exec.
crates/spurd/src/agent_server.rs Report both exit_code and signal to the controller; inject agent bin dir into job PATH.
crates/spurctld/src/server.rs Accept/report signal, record step completion for DerivedExitCode durability, and expose new fields via proto.
crates/spurctld/src/scheduler_loop.rs Update derived-completion call signature to include primary node selection.
crates/spurctld/src/cluster.rs Thread signal through node completion; add durable JobStepComplete WAL op and step-based DerivedExitCode tracking.
crates/spur-k8s/src/job_controller.rs Populate new signal field in job status reports (currently 0 for k8s backend).
crates/spur-core/src/wal.rs Add signal fields to WAL job completion ops and add JobStepComplete op; add WAL migration test.
crates/spur-core/src/step.rs Introduce STEP_RESERVED_MIN to distinguish reserved vs user step IDs.
crates/spur-core/src/job.rs Add NodeCompletion, exit_signal, derived_exit_code; update derived_completion signature/logic; expand reason vocabulary.
crates/spur-cli/src/srun.rs Capture step_id from CreateJobStep and send it in RunStepRequest for DerivedExitCode accounting.
crates/spur-cli/src/scontrol.rs Render ExitCode=code:signal, DerivedExitCode, and compose RaisedSignal:N(name) at display time; add unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/spur-core/src/job.rs Outdated
Comment thread crates/spur-core/src/job.rs Outdated
Comment thread crates/spur-core/src/job.rs Outdated
Comment thread crates/spurd/src/executor.rs
Comment thread proto/slurm.proto Outdated
Comment thread crates/spur-core/src/wal.rs Outdated
yansun1996 and others added 16 commits June 11, 2026 20:26
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
After rebasing onto main (which merged ROCm#271 dep-engine and ROCm#262 GPU CDI),
the exit-code test sites needed: JobStart now carries per_node_alloc and
ResourceAllocations (was ResourceSet), and the dep-engine set_terminal
helper's JobComplete needs the signal field this branch added.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
spurd runs in the background with SIGINT/SIGQUIT/SIGHUP set to SIG_IGN, and
job processes inherited that ignore mask — so a job's own `kill -INT $$` was
a no-op and the job exited 0 instead of being reported as RaisedSignal:2.
Reset these signals (plus SIGPIPE) to SIG_DFL in pre_exec so jobs start with
default handlers, matching Slurm's signal reporting.

Verified on hardware: self-SIGINT now yields ExitCode=0:2 Reason=RaisedSignal:2(Interrupt),
matching Slurm 25.11.6.
…ments

Prepend the agent's own bin dir (current_exe parent) to the job PATH so the
Spur CLI symlinks (srun/sbatch/...) resolve inside batch scripts — previously
bare `srun` failed with 127. Deployment-independent; no user config regardless
of install location. Verified on hardware: bare `srun` now resolves and creates
a step.

Also condense the duplicated signal-reporting explanation (was repeated across
agent_server.rs and two spurctld tests) to a single concise note.
DerivedExitCode was taken from node completions, so a no-srun job wrongly
reported the batch exit and multi-step jobs lost per-step codes. Match Slurm:
DerivedExitCode is the running max over srun step exit codes, maintained live
and durably.

- New JobStepComplete WAL op records each srun step's exit code via Raft, so
  the running max survives restart/replay.
- run_step proposes it (srun now passes the step_id from create_job_step);
  the apply handler updates the step and the job's derived_exit_code max,
  excluding the reserved batch step (STEP_RESERVED_MIN).
- Job finalize preserves the step-accumulated derived_exit_code instead of
  recomputing from nodes; a job with no srun steps stays 0:0.

Verified on hardware vs Slurm 25.11.6: steps 7,3,2 -> ExitCode=2:0
DerivedExitCode=7:0; reversed -> 7:0/7:0; no-srun exit 6 -> 6:0/0:0; and the
running max is visible live in scontrol while the job is RUNNING.
@yansun1996 yansun1996 force-pushed the parity/exit-code-signal branch from 6cfc400 to 0335a28 Compare June 11, 2026 20:26
- executor: reset job signal dispositions with sigaction (async-signal-safe)
  instead of signal() in pre_exec, which runs post-fork in a threaded process.
- job.rs: correct derived_exit_code field/docstring — it is the running max
  over srun steps (via JobStepComplete), not node completions; the node-based
  4th return value of derived_completion is legacy and must not feed it.
- job.rs: in the missing-primary fallback, rank a signaled node above a plain
  non-zero exit so a signal is never masked by a higher exit code.
- proto: clarify derived_exit_code is the running max over srun steps.
- wal test: assert the round-tripped JobNodeComplete fields instead of
  discarding the value, so a serialization regression would be caught.

@shiv-tyagi shiv-tyagi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work. The signal threading end-to-end is clean, the Slurm parity matrix is thorough, and the adjacent fixes (signal disposition reset, PATH injection) are well-motivated.

Comment thread crates/spur-core/src/job.rs Outdated
Comment thread crates/spur-cli/src/scontrol.rs Outdated
Comment thread crates/spur-cli/src/scontrol.rs Outdated
Comment thread crates/spur-core/src/wal.rs Outdated
Comment thread crates/spur-core/src/job.rs Outdated
Comment thread proto/slurm.proto Outdated
…ering)

Drop pre-1.0 migration scaffolding and dead code surfaced in review:

- Job.exit_signal / derived_exit_code: Option<i32> -> i32 (no deployed WAL
  format to protect); removes unwrap_or(0)/Some() gymnastics at every site.
- Drop #[serde(default)] from NodeCompletion fields and JobNodeComplete.signal;
  replace the fabricated old-format WAL test with a clean round-trip assertion.
- Remove the dead JobComplete.signal WAL field (was hardcoded 0, ignored on
  apply) across the variant, apply handler, and all construction sites.
- derived_completion now returns a 3-tuple; the 4th node-based max was unused
  in production and risked being mistaken for the job's DerivedExitCode.

Share completion rendering across the CLIs via a new exit_fmt module so the
surfaces cannot drift:

- squeue %r now composes RaisedSignal:N(name) instead of the bare reason.
- sacct drops the lossy negate-trick encoding for the shared code:signal form
  (signal half is 0 until spurdbd persists it — separate follow-up).
- render_reason matches PendingReason::RaisedSignal.display() rather than a
  hardcoded literal.
- Signal names use underscores (Segmentation_fault, Broken_pipe) to match
  Slurm 25.11.6, verified live.

Reorder JobInfo proto fields so exit_signal/derived_exit_code (33/34) follow
stdout_path/stderr_path (31/32).

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
@yansun1996

Copy link
Copy Markdown
Member Author

Thanks for the review @shiv-tyagi — all six addressed in cd64a89 (replies inline). Net −33 LOC; full suite + clippy + fmt clean.

Re-verified live against Slurm 25.11.6 (single-node Spur vs Slurm cluster), re-running the full exit-code matrix — State / ExitCode / DerivedExitCode / Reason match across all 13 cases, and squeue %r now composes the signal suffix.

One finding worth flagging: live testing showed Slurm renders multi-word signal names with underscores — RaisedSignal:11(Segmentation_fault) and RaisedSignal:13(Broken_pipe) — whereas the original signal_name table (and matrix row 12 in the PR description, marked ✅) used spaces. That row was inaccurate; the space form was never a real match. Fixed signal_name to use underscores and re-verified live, so it's now truly identical. I'll correct row 12 in the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parity(slurm): close exit-code reporting gap — add exit:signal encoding, DerivedExitCode, and NonZeroExitCode/RaisedSignal reasons

4 participants