feat(spur): add job suspend/resume functionality#275
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #275 +/- ##
==========================================
+ Coverage 64.61% 64.98% +0.38%
==========================================
Files 119 120 +1
Lines 31600 32174 +574
==========================================
+ Hits 20416 20908 +492
- Misses 11184 11266 +82 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end job suspend/resume support to Spur with Slurm-like semantics (controller RPCs, agent SIGSTOP/SIGCONT dispatch, CLI scontrol suspend|resume, and suspended-time accounting that pauses runtime/time-limit enforcement).
Changes:
- Added controller/agent RPCs and CLI commands to suspend/resume jobs, including controller fan-out to all allocated agents.
- Implemented suspended-time bookkeeping (
suspended_at/suspended_secs), updated runtime/deadline calculations, and added WAL ops for replay-deterministic accounting. - Fixed managed-job process handling by creating a per-job process group and signaling the whole group to correctly freeze/thaw/cancel the full process tree.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/native_host/test_suspend_resume.py | New native-host E2E suspend/resume scenarios (state, signaling, accounting, persistence, multi-node dispatch). |
| tests/e2e/native_host/cluster.py | Adds controller restart helper for E2E persistence coverage. |
| proto/slurm.proto | Adds SuspendJob/ResumeJob controller RPCs and agent SuspendJob RPC message. |
| crates/spurd/src/executor.rs | Spawns managed jobs in their own process group; signals process group for managed jobs. |
| crates/spurd/src/agent_server.rs | Implements agent SuspendJob RPC and unit test for SIGSTOP/SIGCONT behavior. |
| crates/spurctld/src/server.rs | Adds SuspendJob/ResumeJob RPC handlers with leader-forwarding and agent fan-out. |
| crates/spurctld/src/scheduler_loop.rs | Adds suspend/resume dispatch helper; time-limit enforcement uses effective deadlines. |
| crates/spurctld/src/cluster.rs | Adds cluster suspend/resume methods + WAL apply logic + unit tests for accounting/guards. |
| crates/spur-tests/src/t60_suspend.rs | Adds core state-machine/accounting tests for suspend/resume. |
| crates/spur-tests/src/lib.rs | Registers the new T60 test module. |
| crates/spur-k8s/src/agent.rs | Adds a no-op suspend/resume implementation for the k8s backend. |
| crates/spur-core/src/wal.rs | Adds WAL variants for suspend/resume plus serde round-trip tests. |
| crates/spur-core/src/job.rs | Adds suspended-time fields and updates run_time()/effective_deadline() + transition rules. |
| crates/spur-cli/src/squeue.rs | Includes SUSPENDED in default squeue view. |
| crates/spur-cli/src/scontrol.rs | Adds `scontrol suspend |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
shiv-tyagi
left a comment
There was a problem hiding this comment.
Nice work. Few small comments. Good to merge once you post a reply to those (or push a fix).
Spec for SuspendJob/ResumeJob RPCs, SIGSTOP/SIGCONT dispatch, scontrol suspend|resume, and full suspended-time accounting (run_time + time-limit exclusion via dedicated timestamped WAL ops). Retains allocation; user field advisory per existing convention. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Mirror the suspend apply arm so suspended-time accounting only mutates on a successful Suspended->Running transition, avoiding an inconsistent record on the (currently unreachable) failure branch. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Replace 200ms fixed sleeps with a bounded poll loop (faster, deterministic) and drop a redundant cfg(test) attribute. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
The k8s VirtualAgent implements the SlurmAgent trait, which now requires suspend_job. Pod-level SIGSTOP/SIGCONT is out of scope for the k8s backend (the controller-side state change still applies), so accept it as a no-op. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Making Suspended reachable exposed a state-machine gap: a suspended job whose
process dies out-of-band (OOM, external kill, node loss) hit a rejected
Suspended->terminal transition in the completion apply path and stranded in
SUSPENDED forever. Permit Suspended -> {Completed,Failed,Timeout,NodeFail},
mirroring Slurm finalizing a suspended job that exits. Covered by t60_7.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Managed (non-container) jobs were spawned in spurd's own process group and kill_signal targeted only the tracked batch PID, so SIGSTOP/SIGCONT (suspend) and SIGTERM/SIGKILL (cancel) froze/killed the batch shell but not its children (e.g. an inner `sleep`) — leaving the workload running and, on cancel, orphaned processes reparented to init. Spawn managed jobs as their own process-group leader (process_group(0)) and signal the group (negative pid) so the whole job tree is reached. Container (Forked) jobs already handled this via kill_process_tree. Test harness spawns updated to match the real launch (own process group). Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
State-machine guards (suspend from cancelled/failed/completing, resume to terminal), accounting (cancel-while-suspended folds open window, multi-cycle accumulation), and cluster-method guards (suspend/resume reject pending/ running/unknown, double-suspend/double-resume rejected, allocation retained while suspended, enforcer scan excludes suspended jobs). Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Twelve scenarios covering the suspend/resume feature end-to-end against real daemons: state transitions, scontrol display, multi-cycle, whole-process-tree freeze/thaw, cancel-leaves-no-orphans, run-time exclusion, time-limit not consumed while suspended, allocation retention, CLI guards (suspend pending / unknown id), controller-restart persistence, and multi-node dispatch fan-out. Adds SpurCluster.restart_controller() so the persistence test can bounce the controller without requiring nodes to be idle (a suspended job keeps its allocation). Verified live: 11 passed, multi-node skips on single-node beds. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Slurm's default squeue view includes SUSPENDED jobs; Spur's default state filter (PD/R/CG) hid them, so a suspended job vanished from squeue. Add JobSuspended to the default set. Found via live parity testing vs Slurm 25.11.6. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
These superpowers spec/plan files are local working artifacts (and contained lab-specific host/credential references); they do not belong in the upstream tree. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
CI runs many tests on a shared cluster, where the original checks broke: - `pgrep -f 'sleep <secs>'` self-matched the very shell running the check (its command line contains the literal pattern), yielding a phantom extra process. Filter `pgrep -x sleep` (exact binary) by reading each cmdline instead, so the matcher never self-matches. - a failing process-state assertion aborted before scancel, leaving a SUSPENDED job holding the node and wedging later tests. Every test now cancels its job in a finally block. - multi-node dispatch assertion matched the agent hostname, which CI masks; assert the suspend-dispatch count equals the node count instead, stripping ANSI color codes from the controller log first. Verified: 11 passed, 1 skipped (multi-node) on a single-node bed. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
The agent spawns the batch sleep child a moment after the job flips to RUNNING, and signal delivery + state change is not instant. The tag-scoped process check correctly returned empty in that window (the old pgrep -x sleep masked it by matching foreign sleeps), so the first check raced and failed in CI. Poll for the sleep to appear, and poll for the expected stopped/running state after suspend/resume, instead of asserting immediately. Also loosen the cmdline match to tolerate a missing trailing space. Verified: 11 passed, 1 skipped on a single-node bed. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
- Clear suspended_at when a Suspended job transitions directly to a terminal state (e.g. cancelled while suspended), folding the final suspended interval into suspended_secs. Keeps the documented invariant "suspended_at.is_some() == currently suspended" and the run_time/accounting math correct. - suspend_job / resume_job now return NOT_FOUND for unknown job ids (consistent with get_job) instead of failed_precondition; resume snapshots the job up-front (allocation is retained across resume, so allocated_nodes is valid for agent dispatch). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address review (shiv-tyagi): a suspended job whose task reports completion out-of-band now transitions through Completing (Slurm JOB_COMPLETING) instead of jumping straight to a terminal state. Adds the (Suspended, Completing) edge and widens the JobNodeComplete guard to Running|Suspended; derived_completion only yields Completed/Failed, both valid from Completing, so no strand. Also trim verbose comments and drop the issue-number reference in the suspend/resume e2e docstring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
One merge conflict. Good to go once resolved. |
5739bc3 to
be63ffd
Compare
There was a problem hiding this comment.
Approved. Will merge as CI and E2Es go green.
Thanks a lot @yansun1996.
one e2e failure. Can you try running against a local cluster to verify if it is legit on an infra issue? |
test_whole_process_tree_freezes_and_thaws inspected only cluster.nodes[0], but a -N 1 job can land on any node of a multi-node cluster (CI runs several), so it intermittently found no sleep process and failed. The sleep-state helpers now search every node; the unique sleep_secs tag keeps the match to this job. Same fix applied to the cancel-orphan check (a latent nodes[0]-only false-pass). The multi-node dispatch test keeps its intentional primary-node assertion via a single-element node list. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Merged, thanks! |
Closes #270.
Summary
Adds Slurm-parity job suspend/resume to Spur, end to end:
SuspendJob/ResumeJobcontroller RPCs, SIGSTOP/SIGCONT agent dispatch,spur scontrol suspend|resume(+scontrolsymlink), and full suspended-time accounting (excluded from both run-time and time-limit enforcement). A suspended job retains its node allocation (plainscontrol suspendsemantics).Previously
scontrol suspend <id>/resume <id>were unrecognized subcommands, there were no suspend/resume RPCs, and the agent never issued SIGSTOP/SIGCONT — only theJOB_SUSPENDEDenum value existed, with nothing driving a job into or out of it.What changed
proto/slurm.proto):SuspendJob/ResumeJobcontroller RPCs + request messages;AgentSuspendJobagent RPC (resumebool selects SIGSTOP vs SIGCONT).spur-core):Job.suspended_at/suspended_secsfields;run_time()excludes suspended time (clamped ≥ 0); neweffective_deadline(). Two timestamped WAL opsJobSuspend/JobResumefor replay-deterministic accounting.spurctld):suspend_job/resume_jobcluster methods (state-guarded, propose through Raft, allocation retained — no dealloc) and RPC handlers (leader-forward,failed_preconditionon bad state, fan-out dispatch to every allocated agent). Time-limit enforcer now useseffective_deadline, so a resumed job regains the budget it lost while suspended; suspended jobs are excluded from the timeout scan.spurd):suspend_jobhandler issues SIGSTOP/SIGCONT via the existingkill_signalpath.spur-cli):scontrol suspend|resume <id>subcommands; suspended jobs are kept visible in the defaultsqueueview (ST=S).Two adjacent fixes uncovered during verification (included)
sleep), and cancel left orphaned processes reparented to init. Jobs are now spawned as their own process-group leader (process_group(0)) and signalled by group (negative pid). This also fixes the pre-existing cancel-orphan bug. (Container/Forkedjobs already handled this viakill_process_tree.)SUSPENDEDreachable exposed a state-machine gap: a suspended job whose process died out-of-band (OOM, external kill, node loss) hit a rejectedSuspended → terminaltransition and stranded in SUSPENDED. Now permitsSuspended → {Completed, Failed, Timeout, NodeFail}.Out of scope (documented non-goals)
userfield is advisory (carried for parity/forward-compat), matching howcancel_jobcurrently treats it. Slurm requires privilege for suspend/resume; Spur does not enforce it yet — a separate cross-cutting effort.VirtualAgent::suspend_jobis a documented no-op (controller-side state change still applies); pod-level SIGSTOP/SIGCONT is not modeled.Testing
Unit + integration coverage across
spur-core,spurctld,spurd, andspur-cli(state-machine guards, run-time/deadline accounting, WAL round-trip + replay determinism, cluster-method guards, allocation retention, real-process SIGSTOP/SIGCONT). New pytest e2e suitetests/e2e/native_host/test_suspend_resume.py(12 scenarios). Full workspacecargo testpasses;cargo fmt --checkandclippy --all-targets --all-featuresclean.Live parity verification vs Slurm 25.11.6
Ran each scenario on a single-node Spur cluster and a Slurm 25.11.6 cluster and compared. 17/17 behaviors match.
ST=SST=SST=RST=Rscontrol showwhile suspendedJobState=SUSPENDEDJobState=SUSPENDEDsqueueST code (suspended)SST(stopped)T(stopped)Safter 30s (20s limit)CANCELLEDPDPENDINGSThe headline parity claims — run-time exclusion (8), time-limit freeze (9), whole-tree freeze (6) — are identical to real Slurm. Item 4 (suspended jobs visible in default
squeue) was a gap found during this comparison and fixed in this PR.Known cosmetic difference (not behavioral)
Rejection message wording differs (Spur emits a generic "suspend failed"; Slurm gives specifics like "Job is not suspended"). All rejections produce the correct outcome and a non-zero exit code; only the human-readable text differs.
🤖 Generated with Claude Code