feat(spur): add job suspend/resume functionality by yansun1996 · Pull Request #275 · ROCm/spur

yansun1996 · 2026-06-12T03:33:26Z

Closes #270.

Summary

Adds Slurm-parity job suspend/resume to Spur, end to end: SuspendJob/ResumeJob controller RPCs, SIGSTOP/SIGCONT agent dispatch, spur scontrol suspend|resume (+ scontrol symlink), and full suspended-time accounting (excluded from both run-time and time-limit enforcement). A suspended job retains its node allocation (plain scontrol suspend semantics).

Previously scontrol suspend <id> / resume <id> were unrecognized subcommands, there were no suspend/resume RPCs, and the agent never issued SIGSTOP/SIGCONT — only the JOB_SUSPENDED enum value existed, with nothing driving a job into or out of it.

What changed

proto (proto/slurm.proto): SuspendJob/ResumeJob controller RPCs + request messages; AgentSuspendJob agent RPC (resume bool selects SIGSTOP vs SIGCONT).
core (spur-core): Job.suspended_at / suspended_secs fields; run_time() excludes suspended time (clamped ≥ 0); new effective_deadline(). Two timestamped WAL ops JobSuspend/JobResume for replay-deterministic accounting.
controller (spurctld): suspend_job/resume_job cluster methods (state-guarded, propose through Raft, allocation retained — no dealloc) and RPC handlers (leader-forward, failed_precondition on bad state, fan-out dispatch to every allocated agent). Time-limit enforcer now uses effective_deadline, so a resumed job regains the budget it lost while suspended; suspended jobs are excluded from the timeout scan.
agent (spurd): suspend_job handler issues SIGSTOP/SIGCONT via the existing kill_signal path.
CLI (spur-cli): scontrol suspend|resume <id> subcommands; suspended jobs are kept visible in the default squeue view (ST=S).

Two adjacent fixes uncovered during verification (included)

Whole-process-group signalling. Managed (non-container) jobs were spawned in spurd's own process group and signalled by tracked PID only, so SIGSTOP froze the batch shell but not its children (e.g. an inner sleep), and cancel left orphaned processes reparented to init. Jobs are now spawned as their own process-group leader (process_group(0)) and signalled by group (negative pid). This also fixes the pre-existing cancel-orphan bug. (Container/Forked jobs already handled this via kill_process_tree.)
Finalize-on-death from SUSPENDED. Making SUSPENDED reachable exposed a state-machine gap: a suspended job whose process died out-of-band (OOM, external kill, node loss) hit a rejected Suspended → terminal transition and stranded in SUSPENDED. Now permits Suspended → {Completed, Failed, Timeout, NodeFail}.

Out of scope (documented non-goals)

Authorization: the RPC user field is advisory (carried for parity/forward-compat), matching how cancel_job currently treats it. Slurm requires privilege for suspend/resume; Spur does not enforce it yet — a separate cross-cutting effort.
k8s backend: VirtualAgent::suspend_job is a documented no-op (controller-side state change still applies); pod-level SIGSTOP/SIGCONT is not modeled.

Testing

Unit + integration coverage across spur-core, spurctld, spurd, and spur-cli (state-machine guards, run-time/deadline accounting, WAL round-trip + replay determinism, cluster-method guards, allocation retention, real-process SIGSTOP/SIGCONT). New pytest e2e suite tests/e2e/native_host/test_suspend_resume.py (12 scenarios). Full workspace cargo test passes; cargo fmt --check and clippy --all-targets --all-features clean.

Live parity verification vs Slurm 25.11.6

Ran each scenario on a single-node Spur cluster and a Slurm 25.11.6 cluster and compared. 17/17 behaviors match.

#	Scenario	Slurm 25.11.6	Spur	Match
1	Suspend running job	`ST=S`	`ST=S`	✅
2	Resume suspended job	`ST=R`	`ST=R`	✅
3	`scontrol show` while suspended	`JobState=SUSPENDED`	`JobState=SUSPENDED`	✅
4	`squeue` ST code (suspended)	`S`	`S`	✅
5	Allocation retained (sinfo)	node alloc/mix	node alloc/mix	✅
6	Process tree freeze (sleep child)	`T` (stopped)	`T` (stopped)	✅
7	Process tree thaw on resume	running	running	✅
8	Run-time excludes suspended interval	Δ0s over 14s suspended	Δ0s over 14s suspended	✅
9	Time-limit not consumed while suspended	still `S` after 30s (20s limit)	still SUSPENDED	✅
10	Cancel suspended → terminal	terminal	`CANCELLED`	✅
11	Cancel leaves no orphan processes	0 orphans	0 orphans	✅
12	Multiple suspend/resume cycles (3×)	OK	OK	✅
13	Suspend a PENDING job	rejected, stays `PD`	rejected, stays `PENDING`	✅
14	Suspend a terminal job	rejected	rejected	✅
15	Resume a RUNNING job	rejected	rejected	✅
16	Suspend unknown job id	error	error	✅
17	Double-suspend	stays `S`	stays SUSPENDED	✅

The headline parity claims — run-time exclusion (8), time-limit freeze (9), whole-tree freeze (6) — are identical to real Slurm. Item 4 (suspended jobs visible in default squeue) was a gap found during this comparison and fixed in this PR.

Known cosmetic difference (not behavioral)

Rejection message wording differs (Spur emits a generic "suspend failed"; Slurm gives specifics like "Job is not suspended"). All rejections produce the correct outcome and a non-zero exit code; only the human-readable text differs.

🤖 Generated with Claude Code

codecov-commenter · 2026-06-12T03:46:07Z

Codecov Report

❌ Patch coverage is 86.03448% with 81 lines in your changes missing coverage. Please review.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #275      +/-   ##
==========================================
+ Coverage   64.61%   64.98%   +0.38%     
==========================================
  Files         119      120       +1     
  Lines       31600    32174     +574     
==========================================
+ Hits        20416    20908     +492     
- Misses      11184    11266      +82

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR adds end-to-end job suspend/resume support to Spur with Slurm-like semantics (controller RPCs, agent SIGSTOP/SIGCONT dispatch, CLI scontrol suspend|resume, and suspended-time accounting that pauses runtime/time-limit enforcement).

Changes:

Added controller/agent RPCs and CLI commands to suspend/resume jobs, including controller fan-out to all allocated agents.
Implemented suspended-time bookkeeping (suspended_at / suspended_secs), updated runtime/deadline calculations, and added WAL ops for replay-deterministic accounting.
Fixed managed-job process handling by creating a per-job process group and signaling the whole group to correctly freeze/thaw/cancel the full process tree.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/e2e/native_host/test_suspend_resume.py	New native-host E2E suspend/resume scenarios (state, signaling, accounting, persistence, multi-node dispatch).
tests/e2e/native_host/cluster.py	Adds controller restart helper for E2E persistence coverage.
proto/slurm.proto	Adds SuspendJob/ResumeJob controller RPCs and agent SuspendJob RPC message.
crates/spurd/src/executor.rs	Spawns managed jobs in their own process group; signals process group for managed jobs.
crates/spurd/src/agent_server.rs	Implements agent SuspendJob RPC and unit test for SIGSTOP/SIGCONT behavior.
crates/spurctld/src/server.rs	Adds SuspendJob/ResumeJob RPC handlers with leader-forwarding and agent fan-out.
crates/spurctld/src/scheduler_loop.rs	Adds suspend/resume dispatch helper; time-limit enforcement uses effective deadlines.
crates/spurctld/src/cluster.rs	Adds cluster suspend/resume methods + WAL apply logic + unit tests for accounting/guards.
crates/spur-tests/src/t60_suspend.rs	Adds core state-machine/accounting tests for suspend/resume.
crates/spur-tests/src/lib.rs	Registers the new T60 test module.
crates/spur-k8s/src/agent.rs	Adds a no-op suspend/resume implementation for the k8s backend.
crates/spur-core/src/wal.rs	Adds WAL variants for suspend/resume plus serde round-trip tests.
crates/spur-core/src/job.rs	Adds suspended-time fields and updates run_time()/effective_deadline() + transition rules.
crates/spur-cli/src/squeue.rs	Includes SUSPENDED in default `squeue` view.
crates/spur-cli/src/scontrol.rs	Adds `scontrol suspend

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

shiv-tyagi

Nice work. Few small comments. Good to merge once you post a reply to those (or push a fix).

Spec for SuspendJob/ResumeJob RPCs, SIGSTOP/SIGCONT dispatch, scontrol suspend|resume, and full suspended-time accounting (run_time + time-limit exclusion via dedicated timestamped WAL ops). Retains allocation; user field advisory per existing convention. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>