feat(spur): expand pending-reason vocabulary and surface reservation/license/QoS reasons#301
Open
yansun1996 wants to merge 11 commits into
Open
feat(spur): expand pending-reason vocabulary and surface reservation/license/QoS reasons#301yansun1996 wants to merge 11 commits into
yansun1996 wants to merge 11 commits into
Conversation
Expand PendingReason toward Slurm parity with 33 variants whose display strings match Slurm 25.11.6 job_reason_string() byte-for-byte: Reservation, PartitionConfig, SystemFailure, AccountingPolicy, the Assoc* and QOS* Grp/Max limit families, and BurstBuffer*. Strings reproduce Slurm's casing (Assoc* vs QOS*). Additions are non-breaking (Display-only flow, no exhaustive match, serde derive carries them in Raft snapshots). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
QoS limit checks reported generic reasons (Resources, PartitionTimeLimit) or the wrong QoS reason. Map each cap to the Slurm-specific reason so scraped output matches Slurm: max_wall -> QosMaxWallDurationPerJobLimit, max_tres_per_job CPU/Mem -> QosMaxCpuPerJobLimit/QosMaxMemoryPerJob, max_tres_per_user CPU -> QosMaxCpuPerUserLimit, max_submit_jobs_per_user -> QosMaxSubmitJobPerUserLimit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Down Jobs blocked by a reservation, license shortfall, or QoS limit were dropped from scheduling in pending_jobs() before update_pending_reasons() ran, so their pending_reason never reflected the real cause (users saw a stale or generic reason). Add tag_blocked_pending_reasons() — a write-locked scheduler pass mirroring cancel_unsatisfiable_dependency_jobs() — that sets the Reservation/Licenses/QoS reason, and extract the eligibility checks into shared helpers (reservation_block/license_block/qos_block_for) so the drop decision in pending_jobs() and the displayed reason cannot diverge. QoS still uses a default (limitless) QoS until configs are sourced from accounting; the path is wired so the specific QOS* reason surfaces once they are. Also fix a NodeDown/Resources parity bug: update_pending_reasons treated a fully-allocated (busy-but-up) cluster as NodeDown because it used NodeState::is_available() (Idle|Mixed only). Add NodeState::is_up() (Idle|Mixed|Allocated) and use it, so a saturated cluster reports Resources — matching live Slurm 25.11.6 — while only genuine down/drain/error states yield NodeDown. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-claim Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #301 +/- ##
==========================================
+ Coverage 65.02% 65.82% +0.80%
==========================================
Files 120 120
Lines 32215 32556 +341
==========================================
+ Hits 20947 21428 +481
+ Misses 11268 11128 -140 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
Expands Spur’s PendingReason vocabulary toward Slurm 25.11 parity and fixes scheduler plumbing so reservation/license/QoS blocks and “busy-but-up” clusters surface the correct user-visible pending reason.
Changes:
- Added 33 new
PendingReasonvariants with Slurm-matching Display strings and serde/Display tests. - Updated QoS limit checks to emit specific
QOS*pending reasons instead of generic ones. - Fixed reason observability by tagging reservation/license/QoS-blocked jobs before
pending_jobs()drops them, and corrected busy-but-up clusters to reportResources(viaNodeState::is_up()).
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/design/2026-06-12-reason-code-vocab.md | Design note documenting Slurm 25.11 string evidence and wiring plan for expanded reason vocabulary. |
| crates/spurctld/src/scheduler_loop.rs | Runs tag_blocked_pending_reasons() each leader scheduling cycle so dropped jobs still get accurate reasons. |
| crates/spurctld/src/cluster.rs | Refactors eligibility checks into shared helpers, tags blocked pending reasons, and fixes NodeDown vs Resources via is_up(). |
| crates/spur-core/src/qos.rs | Emits specific Slurm-parity QoS pending reasons and extends unit tests for new QoS reason mapping. |
| crates/spur-core/src/node.rs | Adds NodeState::is_up() to distinguish “busy” from “down”. |
| crates/spur-core/src/job.rs | Adds new PendingReason variants + Display mapping and tests for Display/serde parity. |
| crates/spur-tests/src/t55_format.rs | Extends CLI-format parity test cases for additional pending reasons. |
| crates/spur-tests/src/t21_acctmgr.rs | Updates expected QoS pending reasons to match new specific Slurm-parity reasons. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- tag_blocked_pending_reasons(): match pending_jobs() drop precedence (QoS -> Licenses -> Reservation) so the displayed reason cannot diverge from the check that actually removed the job from the schedulable set. - Remove stray </content> artifact from the design note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Member
Author
|
Addressed the Copilot review in 876aa0c:
No other comments; nothing rejected. |
…dency pending_jobs() drops jobs in order Dependency -> QoS -> Licenses -> Reservation, and cancel_unsatisfiable_dependency_jobs() tags waiting jobs with PendingReason::Dependency one step earlier in the scheduler loop. tag_blocked_pending_reasons() only evaluated QoS/Licenses/Reservation, so a job blocked by both a dependency and one of those would have its Dependency reason overwritten by a lower-precedence reason -- contradicting the invariant that the displayed reason matches the drop decision. Re-evaluate dependency first in the precedence chain (mirroring pending_jobs) so it can no longer be clobbered. Addresses Copilot review point (1) on ROCm#301. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Condense multi-line comment blocks added for the pending-reason work into concise form (tag_blocked_pending_reasons doc + inline notes, is_up, reservation_block/qos_block_for docs, scheduler_loop call site, REASON_VOCAB doc). No logic change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The variant name plus the byte-exact string in display() are self-describing; the WAIT_*/FAIL_* mapping lives in the design doc. Remove the per-variant doc lines and repeated section headers in the enum, Display match, and test arrays. No logic change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The design note was a working artifact; the rationale lives in commit messages, code comments, and the PR description. Remove it so the PR carries only code and tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cluster-wide license accounting used a mutable pool that was decremented on job start and re-added on completion. That design had three live failures: 1. Drift: completion added back via entry().or_insert(0)+=, which creates keys and never saturates, while start used get_mut()+saturating_sub. Returns could exceed subtracts, so the pool grew without bound across a controller's life and accumulated phantom licenses. 2. Config ignored after first snapshot: the pool was serialized into the Raft snapshot and restored on startup, overwriting the value loaded from config. Once a (drifted) snapshot existed, edits to the configured totals had no effect, and enforcement keyed off garbage. 3. Within-pass over-subscription: pending_jobs() filtered licenses with a stateless retain(), so multiple pending jobs in one scheduler pass each checked against the full pool and could collectively exceed it. Replace the mutable pool with derived availability: the configured totals are immutable, and current usage is computed on demand from jobs in the active set (Running/Suspended/Completing). available = total - in_use. This is restart-safe (usage is recomputed from the restored jobs), cannot drift, and keeps config authoritative. pending_jobs() now reserves licenses incrementally in priority order so a single pass cannot over-subscribe. The JobStart subtract, the completion return, and the snapshot restore of the pool are removed. Also fix double-counting: sbatch/srun pushed `-L` licenses into GRES while the controller (proto_to_job_spec) already folds the dedicated `licenses` field into GRES, so each license was counted twice. The CLI no longer pushes them. Tests: running_job_license_consumption_blocks_next_job (cross-pass), pending_jobs_does_not_overallocate_licenses_within_one_pass (within-pass), terminal_job_frees_its_licenses_without_drifting_total (free on completion, total never mutated). Verified live on the testbed: dedup, over-request, cross-tick, and within-tick all enforce correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #300.
This PR closes the Slurm-parity gap where Spur shipped only ~14
PendingReasonvalues against Slurm's ~50, and where the reasons Spur did compute often never reached the user. It has two parts: vocabulary, and the wiring that makes that vocabulary observable.Vocabulary
Adds 33
PendingReasonvariants whose display strings match Slurm 25.11.6'sjob_reason_string()(tablejsra[]insrc/common/job_state_reason.c) byte-for-byte — these strings are scraped by CI gates and workflow engines, so exact parity matters. The set covers Reservation, PartitionConfig, SystemFailure, AccountingPolicy, the fullAssoc*andQOS*Grp/Max limit families, andBurstBuffer*. Slurm's deliberate casing inconsistency (QOS*uppercase vsAssoc*/Association*mixed) is reproduced exactly.Emission status (important)
Most of these are vocabulary only at this stage — the variant and its exact Slurm string exist so external tooling recognizes the reason, but Spur does not yet compute it. Concretely, of the 33:
Reservation— emitted byreservation_block()inpending_jobs()/tag_blocked_pending_reasons()against real reservations (verified live vs Slurm 25.11.6).QOSMaxCpuPerJobLimit,QOSMaxWallDurationPerJobLimit,QOSMaxMemoryPerJob,QOSMaxCpuPerUserLimit,QOSMaxSubmitJobPerUserLimit.check_qos_limits()returns them, but the only production caller (qos_block_for) passes a default limitlessQos, so they cannot fire until QoS definitions are loaded from spurdbd (Wire QOS definitions into scheduling (replace default QOS) #282/Enforce per-account resource limits during scheduling #283). Today they are exercised only byqos.rsunit tests.PartitionConfig,SystemFailure,AccountingPolicy, allAssoc*(11), the remainingQOS*(QOSJobLimit,QOSResourceLimit,QOSTimeLimit, allQOSGrp*,QOSMaxNodePerJobLimit), and bothBurstBuffer*. No code path assigns these; they are scaffolding so the strings match Slurm when the corresponding enforcement (association limits, group-TRES limits, burst buffer) is implemented.This PR's behavioral value is therefore the observable wiring + NodeDown/Resources fix +
Reservation/Licensesemission, not the full reason set. The remaining variants are intentionally landed ahead of their enforcement so future PRs only add logic, not strings. The emission gap (limits cache + QoS/association/partition enforcement + burst buffer) is tracked in #307.QoS emission correctness
The QoS limit checks in
qos.rspreviously returned generic reasons (Resources,PartitionTimeLimit); they now return the specific Slurm reason for each cap (e.g. max wall →QOSMaxWallDurationPerJobLimit, MaxTRESPerJob cpu/mem →QOSMaxCpuPerJobLimit/QOSMaxMemoryPerJob, MaxTRESPerUser cpu →QOSMaxCpuPerUserLimit, max submit-jobs-per-user →QOSMaxSubmitJobPerUserLimit).Observable wiring
pending_jobs()drops jobs blocked by reservation, license, or QoS limits beforeupdate_pending_reasons()runs, so those jobs'pending_reasonnever reflected the real cause. This PR addstag_blocked_pending_reasons()— a write-locked scheduler pass mirroringcancel_unsatisfiable_dependency_jobs()— that sets theReservation/Licenses/QoS reason, and extracts the eligibility checks into shared helpers (reservation_block,license_block,qos_block_for) so the drop decision and the displayed reason cannot diverge.NodeDown/Resources fix
update_pending_reasons()treated a fully-allocated (busy-but-up) cluster asNodeDown, because it usedNodeState::is_available()(Idle|Mixed only). This PR addsNodeState::is_up()(Idle|Mixed|Allocated) and uses it, so a saturated cluster reportsResources— matching Slurm — while only genuine down/drain/error/unknown/suspended states yieldNodeDown.Coordination with #274 (now merged)
This branch was rebased onto current
upstream/mainafter #274 (exit-code reasons) merged. #274's 11 variants (NonZeroExitCode, RaisedSignal, JobLaunchFailure, JobHeldAdmin, BadConstraints, PartitionInactive, DependencyNeverSatisfied, InvalidAccount, InvalidQOS, BootFail, OutOfMemory) and this PR's 33 are disjoint; the only rebase conflict was the shared append point in thePendingReasonenum + Display match, resolved by keeping both sets.FrontEndDown/WAIT_FRONT_ENDwas intentionally omitted — front-end mode was removed in Slurm 25.11 and there is no such string to match.Tests
Unit (spur-core):
reason_vocab_display_matches_slurm_25_11(Display + format for all 33),reason_vocab_serde_roundtrips(JSON round-trip for all 33);qos.rstest_blocked_by_max_wall/max_tres_per_jobupdated +max_mem_per_job/max_cpu_per_user/max_submit_jobs_per_useradded (drive realcheck_qos_limits). spurctld:fully_allocated_cluster_reports_resources_not_nodedown,tag_blocked_sets_reservation_reason,tag_blocked_sets_licenses_reason,tag_blocked_preserves_held_reason. spur-tests:t21_acctmgrreasons updated;t55_formatreason table extended.Post-rebase results: spur-core 176 / spurctld 131 / spur-tests 360 pass; full workspace 0 fail;
cargo clippy --all-targets0 warnings;cargo fmt --all --checkclean.Live verification vs Slurm 25.11.6 (testbed .145 Slurm / .147 Spur)
4/4 PASS:
Resources== SpurResources(wasNodeDown).Reservation== SpurReservation.Licenses== SpurLicenses.COMPLETEDExitCode=0 (scheduler refactor safe).Known caveats / divergences
qos_block_foruses a default limitlessQos), so the specificQOS*reasons cannot fire until QoS-loading from spurdbd is implemented — see Wire QOS definitions into scheduling (replace default QOS) #282 and Enforce per-account resource limits during scheduling #283. The path is complete so the correct reason surfaces the moment configs exist.Follow-ups
--reservation/license:GRES to match Slurm's reject-at-submit behavior.BeginTimereason:pending_jobs()drops future-begin_timejobs with no reason (Slurm showsBeginTime) — easy win via the sametag_blocked_pending_reasons()pass.