Skip to content

feat(spur): expand pending-reason vocabulary and surface reservation/license/QoS reasons#301

Open
yansun1996 wants to merge 11 commits into
ROCm:mainfrom
yansun1996:parity/reason-code-vocab
Open

feat(spur): expand pending-reason vocabulary and surface reservation/license/QoS reasons#301
yansun1996 wants to merge 11 commits into
ROCm:mainfrom
yansun1996:parity/reason-code-vocab

Conversation

@yansun1996

@yansun1996 yansun1996 commented Jun 15, 2026

Copy link
Copy Markdown
Member

Closes #300.

This PR closes the Slurm-parity gap where Spur shipped only ~14 PendingReason values against Slurm's ~50, and where the reasons Spur did compute often never reached the user. It has two parts: vocabulary, and the wiring that makes that vocabulary observable.

Vocabulary

Adds 33 PendingReason variants whose display strings match Slurm 25.11.6's job_reason_string() (table jsra[] in src/common/job_state_reason.c) byte-for-byte — these strings are scraped by CI gates and workflow engines, so exact parity matters. The set covers Reservation, PartitionConfig, SystemFailure, AccountingPolicy, the full Assoc* and QOS* Grp/Max limit families, and BurstBuffer*. Slurm's deliberate casing inconsistency (QOS* uppercase vs Assoc*/Association* mixed) is reproduced exactly.

Emission status (important)

Most of these are vocabulary only at this stage — the variant and its exact Slurm string exist so external tooling recognizes the reason, but Spur does not yet compute it. Concretely, of the 33:

  • 1 is production-active today: Reservation — emitted by reservation_block() in pending_jobs()/tag_blocked_pending_reasons() against real reservations (verified live vs Slurm 25.11.6).
  • 5 are wired into the QoS limit check but dormant: QOSMaxCpuPerJobLimit, QOSMaxWallDurationPerJobLimit, QOSMaxMemoryPerJob, QOSMaxCpuPerUserLimit, QOSMaxSubmitJobPerUserLimit. check_qos_limits() returns them, but the only production caller (qos_block_for) passes a default limitless Qos, so they cannot fire until QoS definitions are loaded from spurdbd (Wire QOS definitions into scheduling (replace default QOS) #282/Enforce per-account resource limits during scheduling #283). Today they are exercised only by qos.rs unit tests.
  • 27 are display-string parity only with no emission path: PartitionConfig, SystemFailure, AccountingPolicy, all Assoc* (11), the remaining QOS* (QOSJobLimit, QOSResourceLimit, QOSTimeLimit, all QOSGrp*, QOSMaxNodePerJobLimit), and both BurstBuffer*. No code path assigns these; they are scaffolding so the strings match Slurm when the corresponding enforcement (association limits, group-TRES limits, burst buffer) is implemented.

This PR's behavioral value is therefore the observable wiring + NodeDown/Resources fix + Reservation/Licenses emission, not the full reason set. The remaining variants are intentionally landed ahead of their enforcement so future PRs only add logic, not strings. The emission gap (limits cache + QoS/association/partition enforcement + burst buffer) is tracked in #307.

QoS emission correctness

The QoS limit checks in qos.rs previously returned generic reasons (Resources, PartitionTimeLimit); they now return the specific Slurm reason for each cap (e.g. max wall → QOSMaxWallDurationPerJobLimit, MaxTRESPerJob cpu/mem → QOSMaxCpuPerJobLimit/QOSMaxMemoryPerJob, MaxTRESPerUser cpu → QOSMaxCpuPerUserLimit, max submit-jobs-per-user → QOSMaxSubmitJobPerUserLimit).

Observable wiring

pending_jobs() drops jobs blocked by reservation, license, or QoS limits before update_pending_reasons() runs, so those jobs' pending_reason never reflected the real cause. This PR adds tag_blocked_pending_reasons() — a write-locked scheduler pass mirroring cancel_unsatisfiable_dependency_jobs() — that sets the Reservation/Licenses/QoS reason, and extracts the eligibility checks into shared helpers (reservation_block, license_block, qos_block_for) so the drop decision and the displayed reason cannot diverge.

NodeDown/Resources fix

update_pending_reasons() treated a fully-allocated (busy-but-up) cluster as NodeDown, because it used NodeState::is_available() (Idle|Mixed only). This PR adds NodeState::is_up() (Idle|Mixed|Allocated) and uses it, so a saturated cluster reports Resources — matching Slurm — while only genuine down/drain/error/unknown/suspended states yield NodeDown.

Coordination with #274 (now merged)

This branch was rebased onto current upstream/main after #274 (exit-code reasons) merged. #274's 11 variants (NonZeroExitCode, RaisedSignal, JobLaunchFailure, JobHeldAdmin, BadConstraints, PartitionInactive, DependencyNeverSatisfied, InvalidAccount, InvalidQOS, BootFail, OutOfMemory) and this PR's 33 are disjoint; the only rebase conflict was the shared append point in the PendingReason enum + Display match, resolved by keeping both sets. FrontEndDown/WAIT_FRONT_END was intentionally omitted — front-end mode was removed in Slurm 25.11 and there is no such string to match.

Tests

Unit (spur-core): reason_vocab_display_matches_slurm_25_11 (Display + format for all 33), reason_vocab_serde_roundtrips (JSON round-trip for all 33); qos.rs test_blocked_by_max_wall/max_tres_per_job updated + max_mem_per_job/max_cpu_per_user/max_submit_jobs_per_user added (drive real check_qos_limits). spurctld: fully_allocated_cluster_reports_resources_not_nodedown, tag_blocked_sets_reservation_reason, tag_blocked_sets_licenses_reason, tag_blocked_preserves_held_reason. spur-tests: t21_acctmgr reasons updated; t55_format reason table extended.

Post-rebase results: spur-core 176 / spurctld 131 / spur-tests 360 pass; full workspace 0 fail; cargo clippy --all-targets 0 warnings; cargo fmt --all --check clean.

Live verification vs Slurm 25.11.6 (testbed .145 Slurm / .147 Spur)

4/4 PASS:

  • Busy-but-up cluster waiter: Slurm Resources == Spur Resources (was NodeDown).
  • Inactive/absent reservation: Slurm Reservation == Spur Reservation.
  • Unavailable license: Slurm Licenses == Spur Licenses.
  • Regression: trivial job ran to COMPLETED ExitCode=0 (scheduler refactor safe).
  • QoS: SKIP — not testable (no accounting source for QoS limits, by design).

Known caveats / divergences

  1. QoS reasons are wired but inert: QoS limits are not yet sourced from the accounting DB (qos_block_for uses a default limitless Qos), so the specific QOS* reasons cannot fire until QoS-loading from spurdbd is implemented — see Wire QOS definitions into scheduling (replace default QOS) #282 and Enforce per-account resource limits during scheduling #283. The path is complete so the correct reason surfaces the moment configs exist.
  2. Submit-time validation divergence (pre-existing, not addressed here): Slurm rejects an absent reservation / unconfigured license at submit time and never holds the job PENDING; Spur admits the job to PENDING and surfaces the correct reason instead. Spur is a superset — a separate submit-validation parity gap.

Follow-ups

yansun1996 and others added 5 commits June 15, 2026 14:41
Expand PendingReason toward Slurm parity with 33 variants whose display
strings match Slurm 25.11.6 job_reason_string() byte-for-byte: Reservation,
PartitionConfig, SystemFailure, AccountingPolicy, the Assoc* and QOS*
Grp/Max limit families, and BurstBuffer*. Strings reproduce Slurm's casing
(Assoc* vs QOS*). Additions are non-breaking (Display-only flow, no
exhaustive match, serde derive carries them in Raft snapshots).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
QoS limit checks reported generic reasons (Resources, PartitionTimeLimit)
or the wrong QoS reason. Map each cap to the Slurm-specific reason so
scraped output matches Slurm: max_wall -> QosMaxWallDurationPerJobLimit,
max_tres_per_job CPU/Mem -> QosMaxCpuPerJobLimit/QosMaxMemoryPerJob,
max_tres_per_user CPU -> QosMaxCpuPerUserLimit, max_submit_jobs_per_user
-> QosMaxSubmitJobPerUserLimit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Down

Jobs blocked by a reservation, license shortfall, or QoS limit were dropped
from scheduling in pending_jobs() before update_pending_reasons() ran, so
their pending_reason never reflected the real cause (users saw a stale or
generic reason). Add tag_blocked_pending_reasons() — a write-locked scheduler
pass mirroring cancel_unsatisfiable_dependency_jobs() — that sets the
Reservation/Licenses/QoS reason, and extract the eligibility checks into
shared helpers (reservation_block/license_block/qos_block_for) so the drop
decision in pending_jobs() and the displayed reason cannot diverge. QoS still
uses a default (limitless) QoS until configs are sourced from accounting; the
path is wired so the specific QOS* reason surfaces once they are.

Also fix a NodeDown/Resources parity bug: update_pending_reasons treated a
fully-allocated (busy-but-up) cluster as NodeDown because it used
NodeState::is_available() (Idle|Mixed only). Add NodeState::is_up()
(Idle|Mixed|Allocated) and use it, so a saturated cluster reports Resources —
matching live Slurm 25.11.6 — while only genuine down/drain/error states
yield NodeDown.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-claim

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 15, 2026 14:45
@codecov-commenter

codecov-commenter commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.90909% with 40 lines in your changes missing coverage. Please review.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #301      +/-   ##
==========================================
+ Coverage   65.02%   65.82%   +0.80%     
==========================================
  Files         120      120              
  Lines       32215    32556     +341     
==========================================
+ Hits        20947    21428     +481     
+ Misses      11268    11128     -140     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Expands Spur’s PendingReason vocabulary toward Slurm 25.11 parity and fixes scheduler plumbing so reservation/license/QoS blocks and “busy-but-up” clusters surface the correct user-visible pending reason.

Changes:

  • Added 33 new PendingReason variants with Slurm-matching Display strings and serde/Display tests.
  • Updated QoS limit checks to emit specific QOS* pending reasons instead of generic ones.
  • Fixed reason observability by tagging reservation/license/QoS-blocked jobs before pending_jobs() drops them, and corrected busy-but-up clusters to report Resources (via NodeState::is_up()).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
docs/design/2026-06-12-reason-code-vocab.md Design note documenting Slurm 25.11 string evidence and wiring plan for expanded reason vocabulary.
crates/spurctld/src/scheduler_loop.rs Runs tag_blocked_pending_reasons() each leader scheduling cycle so dropped jobs still get accurate reasons.
crates/spurctld/src/cluster.rs Refactors eligibility checks into shared helpers, tags blocked pending reasons, and fixes NodeDown vs Resources via is_up().
crates/spur-core/src/qos.rs Emits specific Slurm-parity QoS pending reasons and extends unit tests for new QoS reason mapping.
crates/spur-core/src/node.rs Adds NodeState::is_up() to distinguish “busy” from “down”.
crates/spur-core/src/job.rs Adds new PendingReason variants + Display mapping and tests for Display/serde parity.
crates/spur-tests/src/t55_format.rs Extends CLI-format parity test cases for additional pending reasons.
crates/spur-tests/src/t21_acctmgr.rs Updates expected QoS pending reasons to match new specific Slurm-parity reasons.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/spurctld/src/cluster.rs
Comment thread docs/design/2026-06-12-reason-code-vocab.md Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comment thread crates/spurctld/src/cluster.rs
Comment thread docs/design/2026-06-12-reason-code-vocab.md Outdated
- tag_blocked_pending_reasons(): match pending_jobs() drop precedence
  (QoS -> Licenses -> Reservation) so the displayed reason cannot diverge
  from the check that actually removed the job from the schedulable set.
- Remove stray </content> artifact from the design note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@yansun1996

Copy link
Copy Markdown
Member Author

Addressed the Copilot review in 876aa0c:

  • cluster.rs:1416 (reason precedence) — Fixed. tag_blocked_pending_reasons() now evaluates QoS → Licenses → Reservation, matching the order pending_jobs() applies its retain() filters. Previously it tagged Reservation → Licenses → QoS, so a job blocked by multiple constraints could display a reason different from the check that actually dropped it. The drop decision and displayed reason now agree.
  • design doc:189 (stray </content>) — Fixed. Removed the accidental artifact left between the original Verification section and the follow-up Update section.

No other comments; nothing rejected. cargo build/test (touched crates)/clippy --all-targets (0 warnings)/fmt --check all clean.

yansun1996 and others added 5 commits June 15, 2026 15:13
…dency

pending_jobs() drops jobs in order Dependency -> QoS -> Licenses ->
Reservation, and cancel_unsatisfiable_dependency_jobs() tags waiting jobs
with PendingReason::Dependency one step earlier in the scheduler loop.
tag_blocked_pending_reasons() only evaluated QoS/Licenses/Reservation, so a
job blocked by both a dependency and one of those would have its Dependency
reason overwritten by a lower-precedence reason -- contradicting the
invariant that the displayed reason matches the drop decision.

Re-evaluate dependency first in the precedence chain (mirroring pending_jobs)
so it can no longer be clobbered. Addresses Copilot review point (1) on ROCm#301.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Condense multi-line comment blocks added for the pending-reason work into
concise form (tag_blocked_pending_reasons doc + inline notes, is_up,
reservation_block/qos_block_for docs, scheduler_loop call site, REASON_VOCAB
doc). No logic change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The variant name plus the byte-exact string in display() are self-describing;
the WAIT_*/FAIL_* mapping lives in the design doc. Remove the per-variant doc
lines and repeated section headers in the enum, Display match, and test arrays.
No logic change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The design note was a working artifact; the rationale lives in commit
messages, code comments, and the PR description. Remove it so the PR
carries only code and tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cluster-wide license accounting used a mutable pool that was decremented on
job start and re-added on completion. That design had three live failures:

1. Drift: completion added back via entry().or_insert(0)+=, which creates
   keys and never saturates, while start used get_mut()+saturating_sub. Returns
   could exceed subtracts, so the pool grew without bound across a controller's
   life and accumulated phantom licenses.
2. Config ignored after first snapshot: the pool was serialized into the Raft
   snapshot and restored on startup, overwriting the value loaded from config.
   Once a (drifted) snapshot existed, edits to the configured totals had no
   effect, and enforcement keyed off garbage.
3. Within-pass over-subscription: pending_jobs() filtered licenses with a
   stateless retain(), so multiple pending jobs in one scheduler pass each
   checked against the full pool and could collectively exceed it.

Replace the mutable pool with derived availability: the configured totals are
immutable, and current usage is computed on demand from jobs in the active set
(Running/Suspended/Completing). available = total - in_use. This is restart-safe
(usage is recomputed from the restored jobs), cannot drift, and keeps config
authoritative. pending_jobs() now reserves licenses incrementally in priority
order so a single pass cannot over-subscribe. The JobStart subtract, the
completion return, and the snapshot restore of the pool are removed.

Also fix double-counting: sbatch/srun pushed `-L` licenses into GRES while the
controller (proto_to_job_spec) already folds the dedicated `licenses` field into
GRES, so each license was counted twice. The CLI no longer pushes them.

Tests: running_job_license_consumption_blocks_next_job (cross-pass),
pending_jobs_does_not_overallocate_licenses_within_one_pass (within-pass),
terminal_job_frees_its_licenses_without_drifting_total (free on completion,
total never mutated). Verified live on the testbed: dedup, over-request,
cross-tick, and within-tick all enforce correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parity(slurm): close reason-code vocabulary gap — expand PendingReason 14→~50 and wire reservation/license/QoS reasons

3 participants