feat(spur): close exit-code reporting gap — exit:signal, DerivedExitCode, reasons#274
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #274 +/- ##
==========================================
+ Coverage 63.80% 64.59% +0.79%
==========================================
Files 117 119 +2
Lines 30546 31600 +1054
==========================================
+ Hits 19488 20409 +921
- Misses 11058 11191 +133 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR closes the Slurm-parity gap in Spur’s job outcome reporting by threading lossless exit:signal status end-to-end, introducing DerivedExitCode, and emitting completion reasons (NonZeroExitCode, RaisedSignal) so scontrol show job / sacct consumers get correct, non-lossy results.
Changes:
- Extend proto/RPC/WAL/core
Jobto carry exit signal and derived exit code information, and propagate it through controller ↔ agent reporting. - Add durable step-completion tracking (
JobStepComplete) to maintain a running-maxDerivedExitCode(Slurm semantics) and render it in the CLI. - Improve signal fidelity by resetting inherited signal dispositions before
exec, and ensuresrunresolves inside batch scripts by injecting the agent bin dir intoPATH.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| proto/slurm.proto | Add signal to job status reports; add exit_signal + derived_exit_code to JobInfo; add step_id to RunStepRequest. |
| docs/superpowers/specs/2026-06-11-exit-code-signal-design.md | Design spec for exit:signal, DerivedExitCode, and reason vocabulary parity. |
| docs/superpowers/plans/2026-06-11-exit-code-signal.md | Implementation plan covering core/WAL/proto/controller/agent/CLI changes and testing strategy. |
| crates/spurdbd/src/server.rs | Populate new JobInfo fields for job history responses (currently defaulted). |
| crates/spurd/src/executor.rs | Decode wait status into (exit_code, signal) and reset inherited signal dispositions before exec. |
| crates/spurd/src/agent_server.rs | Report both exit_code and signal to the controller; inject agent bin dir into job PATH. |
| crates/spurctld/src/server.rs | Accept/report signal, record step completion for DerivedExitCode durability, and expose new fields via proto. |
| crates/spurctld/src/scheduler_loop.rs | Update derived-completion call signature to include primary node selection. |
| crates/spurctld/src/cluster.rs | Thread signal through node completion; add durable JobStepComplete WAL op and step-based DerivedExitCode tracking. |
| crates/spur-k8s/src/job_controller.rs | Populate new signal field in job status reports (currently 0 for k8s backend). |
| crates/spur-core/src/wal.rs | Add signal fields to WAL job completion ops and add JobStepComplete op; add WAL migration test. |
| crates/spur-core/src/step.rs | Introduce STEP_RESERVED_MIN to distinguish reserved vs user step IDs. |
| crates/spur-core/src/job.rs | Add NodeCompletion, exit_signal, derived_exit_code; update derived_completion signature/logic; expand reason vocabulary. |
| crates/spur-cli/src/srun.rs | Capture step_id from CreateJobStep and send it in RunStepRequest for DerivedExitCode accounting. |
| crates/spur-cli/src/scontrol.rs | Render ExitCode=code:signal, DerivedExitCode, and compose RaisedSignal:N(name) at display time; add unit tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…, Slurm-parity derived_completion
…pletion; doc fallback
…, reason emission
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
After rebasing onto main (which merged ROCm#271 dep-engine and ROCm#262 GPU CDI), the exit-code test sites needed: JobStart now carries per_node_alloc and ResourceAllocations (was ResourceSet), and the dep-engine set_terminal helper's JobComplete needs the signal field this branch added. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
spurd runs in the background with SIGINT/SIGQUIT/SIGHUP set to SIG_IGN, and job processes inherited that ignore mask — so a job's own `kill -INT $$` was a no-op and the job exited 0 instead of being reported as RaisedSignal:2. Reset these signals (plus SIGPIPE) to SIG_DFL in pre_exec so jobs start with default handlers, matching Slurm's signal reporting. Verified on hardware: self-SIGINT now yields ExitCode=0:2 Reason=RaisedSignal:2(Interrupt), matching Slurm 25.11.6.
…ments Prepend the agent's own bin dir (current_exe parent) to the job PATH so the Spur CLI symlinks (srun/sbatch/...) resolve inside batch scripts — previously bare `srun` failed with 127. Deployment-independent; no user config regardless of install location. Verified on hardware: bare `srun` now resolves and creates a step. Also condense the duplicated signal-reporting explanation (was repeated across agent_server.rs and two spurctld tests) to a single concise note.
DerivedExitCode was taken from node completions, so a no-srun job wrongly reported the batch exit and multi-step jobs lost per-step codes. Match Slurm: DerivedExitCode is the running max over srun step exit codes, maintained live and durably. - New JobStepComplete WAL op records each srun step's exit code via Raft, so the running max survives restart/replay. - run_step proposes it (srun now passes the step_id from create_job_step); the apply handler updates the step and the job's derived_exit_code max, excluding the reserved batch step (STEP_RESERVED_MIN). - Job finalize preserves the step-accumulated derived_exit_code instead of recomputing from nodes; a job with no srun steps stays 0:0. Verified on hardware vs Slurm 25.11.6: steps 7,3,2 -> ExitCode=2:0 DerivedExitCode=7:0; reversed -> 7:0/7:0; no-srun exit 6 -> 6:0/0:0; and the running max is visible live in scontrol while the job is RUNNING.
6cfc400 to
0335a28
Compare
- executor: reset job signal dispositions with sigaction (async-signal-safe) instead of signal() in pre_exec, which runs post-fork in a threaded process. - job.rs: correct derived_exit_code field/docstring — it is the running max over srun steps (via JobStepComplete), not node completions; the node-based 4th return value of derived_completion is legacy and must not feed it. - job.rs: in the missing-primary fallback, rank a signaled node above a plain non-zero exit so a signal is never masked by a higher exit code. - proto: clarify derived_exit_code is the running max over srun steps. - wal test: assert the round-tripped JobNodeComplete fields instead of discarding the value, so a serialization regression would be caught.
…ering) Drop pre-1.0 migration scaffolding and dead code surfaced in review: - Job.exit_signal / derived_exit_code: Option<i32> -> i32 (no deployed WAL format to protect); removes unwrap_or(0)/Some() gymnastics at every site. - Drop #[serde(default)] from NodeCompletion fields and JobNodeComplete.signal; replace the fabricated old-format WAL test with a clean round-trip assertion. - Remove the dead JobComplete.signal WAL field (was hardcoded 0, ignored on apply) across the variant, apply handler, and all construction sites. - derived_completion now returns a 3-tuple; the 4th node-based max was unused in production and risked being mistaken for the job's DerivedExitCode. Share completion rendering across the CLIs via a new exit_fmt module so the surfaces cannot drift: - squeue %r now composes RaisedSignal:N(name) instead of the bare reason. - sacct drops the lossy negate-trick encoding for the shared code:signal form (signal half is 0 until spurdbd persists it — separate follow-up). - render_reason matches PendingReason::RaisedSignal.display() rather than a hardcoded literal. - Signal names use underscores (Segmentation_fault, Broken_pipe) to match Slurm 25.11.6, verified live. Reorder JobInfo proto fields so exit_signal/derived_exit_code (33/34) follow stdout_path/stderr_path (31/32). Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
|
Thanks for the review @shiv-tyagi — all six addressed in cd64a89 (replies inline). Net −33 LOC; full suite + clippy + fmt clean. Re-verified live against Slurm 25.11.6 (single-node Spur vs Slurm cluster), re-running the full exit-code matrix — State / ExitCode / DerivedExitCode / Reason match across all 13 cases, and squeue One finding worth flagging: live testing showed Slurm renders multi-word signal names with underscores — |
Summary
Closes #269. Brings Spur's job exit-code reporting to Slurm parity, end to end:
exit:signalencoding,DerivedExitCode, andNonZeroExitCode/RaisedSignal:N(name)reasons, plus an expandedPendingReasonvocabulary.Previously Spur reported a bare lossy
ExitCode(signal collapsed to-1/128+sig), had noDerivedExitCodefield, and never set a completion reason — silently corrupting outcome data for anything scrapingscontrol show job/sacct.What changed
Exit status is threaded from the agent's process wait status through the report RPC, WAL, core
Job, display RPC, and CLI:spurd) decodes the rawWaitStatusinto(exit_code, signal)instead of collapsing it;WIFEXITED -> code:0,WIFSIGNALED -> 0:sig.exit_signalandderived_exit_codefields on the coreJoband protoJobInfo;JobComplete/JobNodeCompleteWAL ops carry the signal.NonZeroExitCodeon non-zero exit,RaisedSignalon signal death.DerivedExitCodeis the running max oversrunstep exit codes (Slurm semantics), maintained live and durably via a newJobStepCompleteWAL op; a job with no srun steps reports0:0.ExitCode=code:signal,DerivedExitCode=code:signal, and composes the dynamicRaisedSignal:N(name)string at display time.PendingReasontoward the Slurm vocabulary (NonZeroExitCode,RaisedSignal,OutOfMemory,JobLaunchFailure, ...).Two adjacent fixes uncovered during live verification are included: jobs now start with default signal dispositions (so
kill -INT $$is reported asRaisedSignal:2rather than being absorbed by spurd's inheritedSIG_IGN), and the Spur CLI bin dir is added to the jobPATH(derived from the agent's own location) sosrunresolves inside batch scripts.Exact Slurm semantics implemented
ExitCodeis the batch/primary process's raw wait status, not a max. A step killed by a signal surfaces to the batch shell as exit128+sig, so it appears asNonZeroExitCodewithExitCode=128+sig:0, notRaisedSignal.RaisedSignalonly applies when the batch process itself is signaled.DerivedExitCodeis the max exit code across steps.Testing
Unit and integration coverage across
spur-core,spurctld,spurd, andspur-cli(derived-completion rules, signal decoding, reason emission, RPC-level signaled-completion recovery, WAL signal round-trip, step running-max). Full suite passes;cargo fmt --check --allandclippyclean.Verified live against Slurm 25.11.6 on a single-node testbed: clean exit, single non-zero, multi-step
srunordering (ExitCode=last,DerivedExitCode=max), no-srun (DerivedExitCode=0:0), and self-signal cases (SIGINT/SIGTERM/SIGABRT->RaisedSignal:N(name)), including the live running-max visible inscontrol show jobmid-run.Out of scope (follow-ups)
sacctstep history inspurdbd(liveDerivedExitCodeworks via the controller; persisting per-step rows to the accounting DB is separate — tracked in #297). k8s backend signal reporting (currentlysignal: 0; the controller-side machinery already applies). A separately filed controller robustness issue (#273, openraft snapshot/log compaction) is unrelated to this change.Design docs
Design and implementation plan included under
docs/superpowers/.Live verification vs Slurm 25.11.6
Ran an expanded matrix on a Slurm 25.11.6 cluster and a single-node Spur cluster, submitting sequentially (each job to a terminal state before the next) to avoid cluster-saturation races. Format per cell:
State / ExitCode / DerivedExitCode / Reason.exit 0exit 13exit 200exit 137(plain)Key behaviors confirmed identical:
ExitCodefollows the batch script's last command (case 5 = 9:0 vs case 6 = 1:0 — same steps, reordered),DerivedExitCodeis the max over srun step exit codes (cases 5–7), a plainexit 137stays137:0(not mis-read as signal 9), and all batch signals report0:sigwithRaisedSignal:N(name).Note on DerivedExitCode signal half
Spur tracks
DerivedExitCodeas the running max over srun step exit codes (the exit half), which matches Slurm in every stable case above. Slurm additionally folds a signaled step's signal into the DerivedExitCode signal half in some cases, but that behavior is order-dependent and inconsistent in Slurm itself (e.g. a signaled step followed by an exiting step yields0:0; with shell control flow it instead surfaces asExitCode=0:sig). Replicating it would mean matching undefined Slurm internals, so Spur deliberately does not — the exit-half parity is what consumers rely on.