feat(spur): expand pending-reason vocabulary and surface reservation/license/QoS reasons by yansun1996 · Pull Request #301 · ROCm/spur

yansun1996 · 2026-06-15T14:45:06Z

Closes #300.

This PR closes the Slurm-parity gap where Spur shipped only ~14 PendingReason values against Slurm's ~50, and where the reasons Spur did compute often never reached the user. It has two parts: vocabulary, and the wiring that makes that vocabulary observable.

Vocabulary

Adds 33 PendingReason variants whose display strings match Slurm 25.11.6's job_reason_string() (table jsra[] in src/common/job_state_reason.c) byte-for-byte — these strings are scraped by CI gates and workflow engines, so exact parity matters. The set covers Reservation, PartitionConfig, SystemFailure, AccountingPolicy, the full Assoc* and QOS* Grp/Max limit families, and BurstBuffer*. Slurm's deliberate casing inconsistency (QOS* uppercase vs Assoc*/Association* mixed) is reproduced exactly.

Emission status (important)

Most of these are vocabulary only at this stage — the variant and its exact Slurm string exist so external tooling recognizes the reason, but Spur does not yet compute it. Concretely, of the 33:

1 is production-active today: Reservation — emitted by reservation_block() in pending_jobs()/tag_blocked_pending_reasons() against real reservations (verified live vs Slurm 25.11.6).
5 are wired into the QoS limit check but dormant: QOSMaxCpuPerJobLimit, QOSMaxWallDurationPerJobLimit, QOSMaxMemoryPerJob, QOSMaxCpuPerUserLimit, QOSMaxSubmitJobPerUserLimit. check_qos_limits() returns them, but the only production caller (qos_block_for) passes a default limitless Qos, so they cannot fire until QoS definitions are loaded from spurdbd (Wire QOS definitions into scheduling (replace default QOS) #282/Enforce per-account resource limits during scheduling #283). Today they are exercised only by qos.rs unit tests.
27 are display-string parity only with no emission path: PartitionConfig, SystemFailure, AccountingPolicy, all Assoc* (11), the remaining QOS* (QOSJobLimit, QOSResourceLimit, QOSTimeLimit, all QOSGrp*, QOSMaxNodePerJobLimit), and both BurstBuffer*. No code path assigns these; they are scaffolding so the strings match Slurm when the corresponding enforcement (association limits, group-TRES limits, burst buffer) is implemented.

This PR's behavioral value is therefore the observable wiring + NodeDown/Resources fix + Reservation/Licenses emission, not the full reason set. The remaining variants are intentionally landed ahead of their enforcement so future PRs only add logic, not strings. The emission gap (limits cache + QoS/association/partition enforcement + burst buffer) is tracked in #307.

QoS emission correctness

The QoS limit checks in qos.rs previously returned generic reasons (Resources, PartitionTimeLimit); they now return the specific Slurm reason for each cap (e.g. max wall → QOSMaxWallDurationPerJobLimit, MaxTRESPerJob cpu/mem → QOSMaxCpuPerJobLimit/QOSMaxMemoryPerJob, MaxTRESPerUser cpu → QOSMaxCpuPerUserLimit, max submit-jobs-per-user → QOSMaxSubmitJobPerUserLimit).

Observable wiring

pending_jobs() drops jobs blocked by reservation, license, or QoS limits before update_pending_reasons() runs, so those jobs' pending_reason never reflected the real cause. This PR adds tag_blocked_pending_reasons() — a write-locked scheduler pass mirroring cancel_unsatisfiable_dependency_jobs() — that sets the Reservation/Licenses/QoS reason, and extracts the eligibility checks into shared helpers (reservation_block, license_block, qos_block_for) so the drop decision and the displayed reason cannot diverge.

NodeDown/Resources fix

update_pending_reasons() treated a fully-allocated (busy-but-up) cluster as NodeDown, because it used NodeState::is_available() (Idle|Mixed only). This PR adds NodeState::is_up() (Idle|Mixed|Allocated) and uses it, so a saturated cluster reports Resources — matching Slurm — while only genuine down/drain/error/unknown/suspended states yield NodeDown.

Coordination with #274 (now merged)

This branch was rebased onto current upstream/main after #274 (exit-code reasons) merged. #274's 11 variants (NonZeroExitCode, RaisedSignal, JobLaunchFailure, JobHeldAdmin, BadConstraints, PartitionInactive, DependencyNeverSatisfied, InvalidAccount, InvalidQOS, BootFail, OutOfMemory) and this PR's 33 are disjoint; the only rebase conflict was the shared append point in the PendingReason enum + Display match, resolved by keeping both sets. FrontEndDown/WAIT_FRONT_END was intentionally omitted — front-end mode was removed in Slurm 25.11 and there is no such string to match.

Tests

Unit (spur-core): reason_vocab_display_matches_slurm_25_11 (Display + format for all 33), reason_vocab_serde_roundtrips (JSON round-trip for all 33); qos.rs test_blocked_by_max_wall/max_tres_per_job updated + max_mem_per_job/max_cpu_per_user/max_submit_jobs_per_user added (drive real check_qos_limits). spurctld: fully_allocated_cluster_reports_resources_not_nodedown, tag_blocked_sets_reservation_reason, tag_blocked_sets_licenses_reason, tag_blocked_preserves_held_reason. spur-tests: t21_acctmgr reasons updated; t55_format reason table extended.

Post-rebase results: spur-core 176 / spurctld 131 / spur-tests 360 pass; full workspace 0 fail; cargo clippy --all-targets 0 warnings; cargo fmt --all --check clean.

Live verification vs Slurm 25.11.6 (testbed .145 Slurm / .147 Spur)

4/4 PASS:

Busy-but-up cluster waiter: Slurm Resources == Spur Resources (was NodeDown).
Inactive/absent reservation: Slurm Reservation == Spur Reservation.
Unavailable license: Slurm Licenses == Spur Licenses.
Regression: trivial job ran to COMPLETED ExitCode=0 (scheduler refactor safe).
QoS: SKIP — not testable (no accounting source for QoS limits, by design).

Known caveats / divergences

QoS reasons are wired but inert: QoS limits are not yet sourced from the accounting DB (qos_block_for uses a default limitless Qos), so the specific QOS* reasons cannot fire until QoS-loading from spurdbd is implemented — see Wire QOS definitions into scheduling (replace default QOS) #282 and Enforce per-account resource limits during scheduling #283. The path is complete so the correct reason surfaces the moment configs exist.
Submit-time validation divergence (pre-existing, not addressed here): Slurm rejects an absent reservation / unconfigured license at submit time and never holds the job PENDING; Spur admits the job to PENDING and surfaces the correct reason instead. Spur is a superset — a separate submit-validation parity gap.

Follow-ups

Emission logic for the vocabulary-only reasons (associations, QoS group limits, partition config, burst buffer) — tracked in parity(slurm): wire emission logic for pending-reason vocabulary landed in #301 #307.
QoS limit loading from spurdbd → makes the wired QoS reasons live end-to-end (Wire QOS definitions into scheduling (replace default QOS) #282, Enforce per-account resource limits during scheduling #283).
Submit-time validation of --reservation / license: GRES to match Slurm's reject-at-submit behavior.
BeginTime reason: pending_jobs() drops future-begin_time jobs with no reason (Slurm shows BeginTime) — easy win via the same tag_blocked_pending_reasons() pass.

Expand PendingReason toward Slurm parity with 33 variants whose display strings match Slurm 25.11.6 job_reason_string() byte-for-byte: Reservation, PartitionConfig, SystemFailure, AccountingPolicy, the Assoc* and QOS* Grp/Max limit families, and BurstBuffer*. Strings reproduce Slurm's casing (Assoc* vs QOS*). Additions are non-breaking (Display-only flow, no exhaustive match, serde derive carries them in Raft snapshots). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

QoS limit checks reported generic reasons (Resources, PartitionTimeLimit) or the wrong QoS reason. Map each cap to the Slurm-specific reason so scraped output matches Slurm: max_wall -> QosMaxWallDurationPerJobLimit, max_tres_per_job CPU/Mem -> QosMaxCpuPerJobLimit/QosMaxMemoryPerJob, max_tres_per_user CPU -> QosMaxCpuPerUserLimit, max_submit_jobs_per_user -> QosMaxSubmitJobPerUserLimit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…Down Jobs blocked by a reservation, license shortfall, or QoS limit were dropped from scheduling in pending_jobs() before update_pending_reasons() ran, so their pending_reason never reflected the real cause (users saw a stale or generic reason). Add tag_blocked_pending_reasons() — a write-locked scheduler pass mirroring cancel_unsatisfiable_dependency_jobs() — that sets the Reservation/Licenses/QoS reason, and extract the eligibility checks into shared helpers (reservation_block/license_block/qos_block_for) so the drop decision in pending_jobs() and the displayed reason cannot diverge. QoS still uses a default (limitless) QoS until configs are sourced from accounting; the path is wired so the specific QOS* reason surfaces once they are. Also fix a NodeDown/Resources parity bug: update_pending_reasons treated a fully-allocated (busy-but-up) cluster as NodeDown because it used NodeState::is_available() (Idle|Mixed only). Add NodeState::is_up() (Idle|Mixed|Allocated) and use it, so a saturated cluster reports Resources — matching live Slurm 25.11.6 — while only genuine down/drain/error states yield NodeDown. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-claim Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

codecov-commenter · 2026-06-15T14:48:38Z

Codecov Report

❌ Patch coverage is 90.90909% with 40 lines in your changes missing coverage. Please review.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #301      +/-   ##
==========================================
+ Coverage   65.02%   65.82%   +0.80%     
==========================================
  Files         120      120              
  Lines       32215    32556     +341     
==========================================
+ Hits        20947    21428     +481     
+ Misses      11268    11128     -140

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Expands Spur’s PendingReason vocabulary toward Slurm 25.11 parity and fixes scheduler plumbing so reservation/license/QoS blocks and “busy-but-up” clusters surface the correct user-visible pending reason.

Changes:

Added 33 new PendingReason variants with Slurm-matching Display strings and serde/Display tests.
Updated QoS limit checks to emit specific QOS* pending reasons instead of generic ones.
Fixed reason observability by tagging reservation/license/QoS-blocked jobs before pending_jobs() drops them, and corrected busy-but-up clusters to report Resources (via NodeState::is_up()).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
docs/design/2026-06-12-reason-code-vocab.md	Design note documenting Slurm 25.11 string evidence and wiring plan for expanded reason vocabulary.
crates/spurctld/src/scheduler_loop.rs	Runs `tag_blocked_pending_reasons()` each leader scheduling cycle so dropped jobs still get accurate reasons.
crates/spurctld/src/cluster.rs	Refactors eligibility checks into shared helpers, tags blocked pending reasons, and fixes `NodeDown` vs `Resources` via `is_up()`.
crates/spur-core/src/qos.rs	Emits specific Slurm-parity QoS pending reasons and extends unit tests for new QoS reason mapping.
crates/spur-core/src/node.rs	Adds `NodeState::is_up()` to distinguish “busy” from “down”.
crates/spur-core/src/job.rs	Adds new `PendingReason` variants + Display mapping and tests for Display/serde parity.
crates/spur-tests/src/t55_format.rs	Extends CLI-format parity test cases for additional pending reasons.
crates/spur-tests/src/t21_acctmgr.rs	Updates expected QoS pending reasons to match new specific Slurm-parity reasons.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

- tag_blocked_pending_reasons(): match pending_jobs() drop precedence (QoS -> Licenses -> Reservation) so the displayed reason cannot diverge from the check that actually removed the job from the schedulable set. - Remove stray </content> artifact from the design note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

yansun1996 · 2026-06-15T14:54:29Z

Addressed the Copilot review in 876aa0c:

cluster.rs:1416 (reason precedence) — Fixed. tag_blocked_pending_reasons() now evaluates QoS → Licenses → Reservation, matching the order pending_jobs() applies its retain() filters. Previously it tagged Reservation → Licenses → QoS, so a job blocked by multiple constraints could display a reason different from the check that actually dropped it. The drop decision and displayed reason now agree.
design doc:189 (stray </content>) — Fixed. Removed the accidental artifact left between the original Verification section and the follow-up Update section.

No other comments; nothing rejected. cargo build/test (touched crates)/clippy --all-targets (0 warnings)/fmt --check all clean.

…dency pending_jobs() drops jobs in order Dependency -> QoS -> Licenses -> Reservation, and cancel_unsatisfiable_dependency_jobs() tags waiting jobs with PendingReason::Dependency one step earlier in the scheduler loop. tag_blocked_pending_reasons() only evaluated QoS/Licenses/Reservation, so a job blocked by both a dependency and one of those would have its Dependency reason overwritten by a lower-precedence reason -- contradicting the invariant that the displayed reason matches the drop decision. Re-evaluate dependency first in the precedence chain (mirroring pending_jobs) so it can no longer be clobbered. Addresses Copilot review point (1) on ROCm#301. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Condense multi-line comment blocks added for the pending-reason work into concise form (tag_blocked_pending_reasons doc + inline notes, is_up, reservation_block/qos_block_for docs, scheduler_loop call site, REASON_VOCAB doc). No logic change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The variant name plus the byte-exact string in display() are self-describing; the WAIT_*/FAIL_* mapping lives in the design doc. Remove the per-variant doc lines and repeated section headers in the enum, Display match, and test arrays. No logic change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The design note was a working artifact; the rationale lives in commit messages, code comments, and the PR description. Remove it so the PR carries only code and tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Cluster-wide license accounting used a mutable pool that was decremented on job start and re-added on completion. That design had three live failures: 1. Drift: completion added back via entry().or_insert(0)+=, which creates keys and never saturates, while start used get_mut()+saturating_sub. Returns could exceed subtracts, so the pool grew without bound across a controller's life and accumulated phantom licenses. 2. Config ignored after first snapshot: the pool was serialized into the Raft snapshot and restored on startup, overwriting the value loaded from config. Once a (drifted) snapshot existed, edits to the configured totals had no effect, and enforcement keyed off garbage. 3. Within-pass over-subscription: pending_jobs() filtered licenses with a stateless retain(), so multiple pending jobs in one scheduler pass each checked against the full pool and could collectively exceed it. Replace the mutable pool with derived availability: the configured totals are immutable, and current usage is computed on demand from jobs in the active set (Running/Suspended/Completing). available = total - in_use. This is restart-safe (usage is recomputed from the restored jobs), cannot drift, and keeps config authoritative. pending_jobs() now reserves licenses incrementally in priority order so a single pass cannot over-subscribe. The JobStart subtract, the completion return, and the snapshot restore of the pool are removed. Also fix double-counting: sbatch/srun pushed `-L` licenses into GRES while the controller (proto_to_job_spec) already folds the dedicated `licenses` field into GRES, so each license was counted twice. The CLI no longer pushes them. Tests: running_job_license_consumption_blocks_next_job (cross-pass), pending_jobs_does_not_overallocate_licenses_within_one_pass (within-pass), terminal_job_frees_its_licenses_without_drifting_total (free on completion, total never mutated). Verified live on the testbed: dedup, over-request, cross-tick, and within-tick all enforce correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

yansun1996 and others added 5 commits June 15, 2026 14:41

docs(spur-core): add reason-code vocabulary parity design note

f5ee1e8

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(spur-core): document reason emission wiring and correct QoS over…

6a883d5

…-claim Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 15, 2026 14:45

yansun1996 requested review from powderluv, sgopinath1 and shiv-tyagi as code owners June 15, 2026 14:45

Copilot started reviewing on behalf of yansun1996 June 15, 2026 14:46 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread crates/spurctld/src/cluster.rs

Comment thread docs/design/2026-06-12-reason-code-vocab.md Outdated

yansun1996 requested a review from Copilot June 15, 2026 14:49

Copilot started reviewing on behalf of yansun1996 June 15, 2026 14:50 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread crates/spurctld/src/cluster.rs

Comment thread docs/design/2026-06-12-reason-code-vocab.md Outdated

yansun1996 and others added 5 commits June 15, 2026 15:13

docs: drop internal reason-code design note from PR

2a24e62

The design note was a working artifact; the rationale lives in commit messages, code comments, and the PR description. Remove it so the PR carries only code and tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

yansun1996 mentioned this pull request Jun 15, 2026

parity(slurm): wire emission logic for pending-reason vocabulary landed in #301 #307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spur): expand pending-reason vocabulary and surface reservation/license/QoS reasons#301

feat(spur): expand pending-reason vocabulary and surface reservation/license/QoS reasons#301
yansun1996 wants to merge 11 commits into
ROCm:mainfrom
yansun1996:parity/reason-code-vocab

yansun1996 commented Jun 15, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

yansun1996 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yansun1996 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vocabulary

Emission status (important)

QoS emission correctness

Observable wiring

NodeDown/Resources fix

Coordination with #274 (now merged)

Tests

Live verification vs Slurm 25.11.6 (testbed .145 Slurm / .147 Spur)

Known caveats / divergences

Follow-ups

Uh oh!

codecov-commenter commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

yansun1996 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yansun1996 commented Jun 15, 2026 •

edited

Loading

codecov-commenter commented Jun 15, 2026 •

edited

Loading