Skip to content

feat(spur): add job suspend/resume functionality#275

Merged
shiv-tyagi merged 30 commits into
ROCm:mainfrom
yansun1996:parity/suspend-resume
Jun 15, 2026
Merged

feat(spur): add job suspend/resume functionality#275
shiv-tyagi merged 30 commits into
ROCm:mainfrom
yansun1996:parity/suspend-resume

Conversation

@yansun1996

@yansun1996 yansun1996 commented Jun 12, 2026

Copy link
Copy Markdown
Member

Closes #270.

Summary

Adds Slurm-parity job suspend/resume to Spur, end to end: SuspendJob/ResumeJob controller RPCs, SIGSTOP/SIGCONT agent dispatch, spur scontrol suspend|resume (+ scontrol symlink), and full suspended-time accounting (excluded from both run-time and time-limit enforcement). A suspended job retains its node allocation (plain scontrol suspend semantics).

Previously scontrol suspend <id> / resume <id> were unrecognized subcommands, there were no suspend/resume RPCs, and the agent never issued SIGSTOP/SIGCONT — only the JOB_SUSPENDED enum value existed, with nothing driving a job into or out of it.

What changed

  • proto (proto/slurm.proto): SuspendJob/ResumeJob controller RPCs + request messages; AgentSuspendJob agent RPC (resume bool selects SIGSTOP vs SIGCONT).
  • core (spur-core): Job.suspended_at / suspended_secs fields; run_time() excludes suspended time (clamped ≥ 0); new effective_deadline(). Two timestamped WAL ops JobSuspend/JobResume for replay-deterministic accounting.
  • controller (spurctld): suspend_job/resume_job cluster methods (state-guarded, propose through Raft, allocation retained — no dealloc) and RPC handlers (leader-forward, failed_precondition on bad state, fan-out dispatch to every allocated agent). Time-limit enforcer now uses effective_deadline, so a resumed job regains the budget it lost while suspended; suspended jobs are excluded from the timeout scan.
  • agent (spurd): suspend_job handler issues SIGSTOP/SIGCONT via the existing kill_signal path.
  • CLI (spur-cli): scontrol suspend|resume <id> subcommands; suspended jobs are kept visible in the default squeue view (ST=S).

Two adjacent fixes uncovered during verification (included)

  1. Whole-process-group signalling. Managed (non-container) jobs were spawned in spurd's own process group and signalled by tracked PID only, so SIGSTOP froze the batch shell but not its children (e.g. an inner sleep), and cancel left orphaned processes reparented to init. Jobs are now spawned as their own process-group leader (process_group(0)) and signalled by group (negative pid). This also fixes the pre-existing cancel-orphan bug. (Container/Forked jobs already handled this via kill_process_tree.)
  2. Finalize-on-death from SUSPENDED. Making SUSPENDED reachable exposed a state-machine gap: a suspended job whose process died out-of-band (OOM, external kill, node loss) hit a rejected Suspended → terminal transition and stranded in SUSPENDED. Now permits Suspended → {Completed, Failed, Timeout, NodeFail}.

Out of scope (documented non-goals)

  • Authorization: the RPC user field is advisory (carried for parity/forward-compat), matching how cancel_job currently treats it. Slurm requires privilege for suspend/resume; Spur does not enforce it yet — a separate cross-cutting effort.
  • k8s backend: VirtualAgent::suspend_job is a documented no-op (controller-side state change still applies); pod-level SIGSTOP/SIGCONT is not modeled.

Testing

Unit + integration coverage across spur-core, spurctld, spurd, and spur-cli (state-machine guards, run-time/deadline accounting, WAL round-trip + replay determinism, cluster-method guards, allocation retention, real-process SIGSTOP/SIGCONT). New pytest e2e suite tests/e2e/native_host/test_suspend_resume.py (12 scenarios). Full workspace cargo test passes; cargo fmt --check and clippy --all-targets --all-features clean.

Live parity verification vs Slurm 25.11.6

Ran each scenario on a single-node Spur cluster and a Slurm 25.11.6 cluster and compared. 17/17 behaviors match.

# Scenario Slurm 25.11.6 Spur Match
1 Suspend running job ST=S ST=S
2 Resume suspended job ST=R ST=R
3 scontrol show while suspended JobState=SUSPENDED JobState=SUSPENDED
4 squeue ST code (suspended) S S
5 Allocation retained (sinfo) node alloc/mix node alloc/mix
6 Process tree freeze (sleep child) T (stopped) T (stopped)
7 Process tree thaw on resume running running
8 Run-time excludes suspended interval Δ0s over 14s suspended Δ0s over 14s suspended
9 Time-limit not consumed while suspended still S after 30s (20s limit) still SUSPENDED
10 Cancel suspended → terminal terminal CANCELLED
11 Cancel leaves no orphan processes 0 orphans 0 orphans
12 Multiple suspend/resume cycles (3×) OK OK
13 Suspend a PENDING job rejected, stays PD rejected, stays PENDING
14 Suspend a terminal job rejected rejected
15 Resume a RUNNING job rejected rejected
16 Suspend unknown job id error error
17 Double-suspend stays S stays SUSPENDED

The headline parity claims — run-time exclusion (8), time-limit freeze (9), whole-tree freeze (6) — are identical to real Slurm. Item 4 (suspended jobs visible in default squeue) was a gap found during this comparison and fixed in this PR.

Known cosmetic difference (not behavioral)

Rejection message wording differs (Spur emits a generic "suspend failed"; Slurm gives specifics like "Job is not suspended"). All rejections produce the correct outcome and a non-zero exit code; only the human-readable text differs.

🤖 Generated with Claude Code

@codecov-commenter

codecov-commenter commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.03448% with 81 lines in your changes missing coverage. Please review.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #275      +/-   ##
==========================================
+ Coverage   64.61%   64.98%   +0.38%     
==========================================
  Files         119      120       +1     
  Lines       31600    32174     +574     
==========================================
+ Hits        20416    20908     +492     
- Misses      11184    11266      +82     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yansun1996 yansun1996 marked this pull request as ready for review June 12, 2026 16:43
@yansun1996 yansun1996 requested a review from powderluv as a code owner June 12, 2026 16:43
Copilot AI review requested due to automatic review settings June 12, 2026 16:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end job suspend/resume support to Spur with Slurm-like semantics (controller RPCs, agent SIGSTOP/SIGCONT dispatch, CLI scontrol suspend|resume, and suspended-time accounting that pauses runtime/time-limit enforcement).

Changes:

  • Added controller/agent RPCs and CLI commands to suspend/resume jobs, including controller fan-out to all allocated agents.
  • Implemented suspended-time bookkeeping (suspended_at / suspended_secs), updated runtime/deadline calculations, and added WAL ops for replay-deterministic accounting.
  • Fixed managed-job process handling by creating a per-job process group and signaling the whole group to correctly freeze/thaw/cancel the full process tree.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/e2e/native_host/test_suspend_resume.py New native-host E2E suspend/resume scenarios (state, signaling, accounting, persistence, multi-node dispatch).
tests/e2e/native_host/cluster.py Adds controller restart helper for E2E persistence coverage.
proto/slurm.proto Adds SuspendJob/ResumeJob controller RPCs and agent SuspendJob RPC message.
crates/spurd/src/executor.rs Spawns managed jobs in their own process group; signals process group for managed jobs.
crates/spurd/src/agent_server.rs Implements agent SuspendJob RPC and unit test for SIGSTOP/SIGCONT behavior.
crates/spurctld/src/server.rs Adds SuspendJob/ResumeJob RPC handlers with leader-forwarding and agent fan-out.
crates/spurctld/src/scheduler_loop.rs Adds suspend/resume dispatch helper; time-limit enforcement uses effective deadlines.
crates/spurctld/src/cluster.rs Adds cluster suspend/resume methods + WAL apply logic + unit tests for accounting/guards.
crates/spur-tests/src/t60_suspend.rs Adds core state-machine/accounting tests for suspend/resume.
crates/spur-tests/src/lib.rs Registers the new T60 test module.
crates/spur-k8s/src/agent.rs Adds a no-op suspend/resume implementation for the k8s backend.
crates/spur-core/src/wal.rs Adds WAL variants for suspend/resume plus serde round-trip tests.
crates/spur-core/src/job.rs Adds suspended-time fields and updates run_time()/effective_deadline() + transition rules.
crates/spur-cli/src/squeue.rs Includes SUSPENDED in default squeue view.
crates/spur-cli/src/scontrol.rs Adds `scontrol suspend

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/spur-core/src/job.rs
Comment thread crates/spurctld/src/server.rs
Comment thread crates/spurctld/src/server.rs
@shiv-tyagi shiv-tyagi changed the title feat(spur): job suspend/resume (Slurm parity, #270) feat(spur): add job suspend/resume functionality Jun 15, 2026

@shiv-tyagi shiv-tyagi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Few small comments. Good to merge once you post a reply to those (or push a fix).

Comment thread crates/spur-core/src/job.rs
Comment thread tests/e2e/native_host/test_suspend_resume.py Outdated
yansun1996 and others added 20 commits June 15, 2026 09:43
Spec for SuspendJob/ResumeJob RPCs, SIGSTOP/SIGCONT dispatch, scontrol
suspend|resume, and full suspended-time accounting (run_time + time-limit
exclusion via dedicated timestamped WAL ops). Retains allocation; user field
advisory per existing convention.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Mirror the suspend apply arm so suspended-time accounting only mutates on a
successful Suspended->Running transition, avoiding an inconsistent record on
the (currently unreachable) failure branch.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Replace 200ms fixed sleeps with a bounded poll loop (faster, deterministic)
and drop a redundant cfg(test) attribute.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
The k8s VirtualAgent implements the SlurmAgent trait, which now requires
suspend_job. Pod-level SIGSTOP/SIGCONT is out of scope for the k8s backend
(the controller-side state change still applies), so accept it as a no-op.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Making Suspended reachable exposed a state-machine gap: a suspended job whose
process dies out-of-band (OOM, external kill, node loss) hit a rejected
Suspended->terminal transition in the completion apply path and stranded in
SUSPENDED forever. Permit Suspended -> {Completed,Failed,Timeout,NodeFail},
mirroring Slurm finalizing a suspended job that exits. Covered by t60_7.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Managed (non-container) jobs were spawned in spurd's own process group and
kill_signal targeted only the tracked batch PID, so SIGSTOP/SIGCONT (suspend)
and SIGTERM/SIGKILL (cancel) froze/killed the batch shell but not its children
(e.g. an inner `sleep`) — leaving the workload running and, on cancel, orphaned
processes reparented to init.

Spawn managed jobs as their own process-group leader (process_group(0)) and
signal the group (negative pid) so the whole job tree is reached. Container
(Forked) jobs already handled this via kill_process_tree. Test harness spawns
updated to match the real launch (own process group).

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
State-machine guards (suspend from cancelled/failed/completing, resume to
terminal), accounting (cancel-while-suspended folds open window, multi-cycle
accumulation), and cluster-method guards (suspend/resume reject pending/
running/unknown, double-suspend/double-resume rejected, allocation retained
while suspended, enforcer scan excludes suspended jobs).

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
yansun1996 and others added 9 commits June 15, 2026 09:43
Twelve scenarios covering the suspend/resume feature end-to-end against real
daemons: state transitions, scontrol display, multi-cycle, whole-process-tree
freeze/thaw, cancel-leaves-no-orphans, run-time exclusion, time-limit not
consumed while suspended, allocation retention, CLI guards (suspend pending /
unknown id), controller-restart persistence, and multi-node dispatch fan-out.

Adds SpurCluster.restart_controller() so the persistence test can bounce the
controller without requiring nodes to be idle (a suspended job keeps its
allocation). Verified live: 11 passed, multi-node skips on single-node beds.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Slurm's default squeue view includes SUSPENDED jobs; Spur's default state
filter (PD/R/CG) hid them, so a suspended job vanished from squeue. Add
JobSuspended to the default set. Found via live parity testing vs Slurm 25.11.6.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
These superpowers spec/plan files are local working artifacts (and contained
lab-specific host/credential references); they do not belong in the upstream
tree.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
CI runs many tests on a shared cluster, where the original checks broke:
- `pgrep -f 'sleep <secs>'` self-matched the very shell running the check
  (its command line contains the literal pattern), yielding a phantom extra
  process. Filter `pgrep -x sleep` (exact binary) by reading each cmdline
  instead, so the matcher never self-matches.
- a failing process-state assertion aborted before scancel, leaving a
  SUSPENDED job holding the node and wedging later tests. Every test now
  cancels its job in a finally block.
- multi-node dispatch assertion matched the agent hostname, which CI masks;
  assert the suspend-dispatch count equals the node count instead, stripping
  ANSI color codes from the controller log first.

Verified: 11 passed, 1 skipped (multi-node) on a single-node bed.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
The agent spawns the batch sleep child a moment after the job flips to RUNNING,
and signal delivery + state change is not instant. The tag-scoped process check
correctly returned empty in that window (the old pgrep -x sleep masked it by
matching foreign sleeps), so the first check raced and failed in CI. Poll for
the sleep to appear, and poll for the expected stopped/running state after
suspend/resume, instead of asserting immediately. Also loosen the cmdline match
to tolerate a missing trailing space.

Verified: 11 passed, 1 skipped on a single-node bed.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
- Clear suspended_at when a Suspended job transitions directly to a terminal
  state (e.g. cancelled while suspended), folding the final suspended interval
  into suspended_secs. Keeps the documented invariant "suspended_at.is_some()
  == currently suspended" and the run_time/accounting math correct.
- suspend_job / resume_job now return NOT_FOUND for unknown job ids (consistent
  with get_job) instead of failed_precondition; resume snapshots the job
  up-front (allocation is retained across resume, so allocated_nodes is valid
  for agent dispatch).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address review (shiv-tyagi): a suspended job whose task reports completion
out-of-band now transitions through Completing (Slurm JOB_COMPLETING) instead
of jumping straight to a terminal state. Adds the (Suspended, Completing) edge
and widens the JobNodeComplete guard to Running|Suspended; derived_completion
only yields Completed/Failed, both valid from Completing, so no strand.

Also trim verbose comments and drop the issue-number reference in the
suspend/resume e2e docstring.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@shiv-tyagi

shiv-tyagi commented Jun 15, 2026

Copy link
Copy Markdown
Member

One merge conflict.

Good to go once resolved.

@yansun1996 yansun1996 force-pushed the parity/suspend-resume branch from 5739bc3 to be63ffd Compare June 15, 2026 09:47
shiv-tyagi
shiv-tyagi previously approved these changes Jun 15, 2026

@shiv-tyagi shiv-tyagi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Will merge as CI and E2Es go green.

Thanks a lot @yansun1996.

@shiv-tyagi shiv-tyagi added the merge-on-ci-pass Issue has been reviewed and approved, and can be merged as soon as the CI goes green. label Jun 15, 2026
@shiv-tyagi

Copy link
Copy Markdown
Member

=================================== FAILURES ===================================
____ TestSuspendProcessSemantics.test_whole_process_tree_freezes_and_thaws _____

self = <test_suspend_resume.TestSuspendProcessSemantics object at 0x7ba194395fd0>
cluster = <cluster.SpurCluster object at 0x7ba19440ae40>

    def test_whole_process_tree_freezes_and_thaws(self, cluster):
        """SIGSTOP must reach the batch shell AND its sleep child (state 'T'),
        and SIGCONT must thaw both. Regression test for the managed-executor
        process-group fix."""
        job_id = None
        secs = 604  # unique tag for this job's sleep
        try:
            job_id = _submit_sleep(cluster, "sr-tree", sleep_secs=secs)
            node = cluster.nodes[0]
>           assert _wait_sleep_states(node, secs), "no sleep process found for this job"
E           AssertionError: no sleep process found for this job
E           assert []
E            +  where [] = _wait_sleep_states(<cluster.SshNode object at 0x7ba19453fb60>, 604)

/tmp/spur-e2e-assets-27538218148/tests/e2e/native_host/test_suspend_resume.py:152: AssertionError
=========================== short test summary info ============================
FAILED ../../../../../../tmp/spur-e2e-assets-27538218148/tests/e2e/native_host/test_suspend_resume.py::TestSuspendProcessSemantics::test_whole_process_tree_freezes_and_thaws - AssertionError: no sleep process found for this job
assert []
 +  where [] = _wait_sleep_states(<cluster.SshNode object at 0x7ba19453fb60>, 604)
=================== 1 failed, 80 passed in 896.47s (0:14:56) ===================

one e2e failure.

Can you try running against a local cluster to verify if it is legit on an infra issue?

test_whole_process_tree_freezes_and_thaws inspected only cluster.nodes[0],
but a -N 1 job can land on any node of a multi-node cluster (CI runs several),
so it intermittently found no sleep process and failed. The sleep-state
helpers now search every node; the unique sleep_secs tag keeps the match to
this job. Same fix applied to the cancel-orphan check (a latent nodes[0]-only
false-pass). The multi-node dispatch test keeps its intentional primary-node
assertion via a single-element node list.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@shiv-tyagi shiv-tyagi merged commit 87977b1 into ROCm:main Jun 15, 2026
12 checks passed
@shiv-tyagi

Copy link
Copy Markdown
Member

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-on-ci-pass Issue has been reviewed and approved, and can be merged as soon as the CI goes green.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parity(slurm): close suspend/resume gap — add SuspendJob/ResumeJob RPCs, SIGSTOP/SIGCONT, and scontrol suspend|resume

4 participants