Summary
Spur ships only ~14 PendingReason values against Slurm's ~50, and several reasons Spur does compute never reach the user. Two problems:
- Vocabulary gap — missing Reservation, PartitionConfig, SystemFailure, AccountingPolicy, the full
Assoc*/QOS* Grp/Max limit families, and BurstBuffer*. Workflow engines and CI gates scrape the reason string, so an absent or generic reason breaks them.
- Observable-wiring gap —
pending_jobs() drops jobs blocked by reservation, license, or QoS limits before update_pending_reasons() runs, so those jobs' pending_reason never reflects the real cause.
- NodeDown/Resources bug — a fully-allocated (busy-but-up) cluster mis-reports as
NodeDown because reason emission uses NodeState::is_available() (Idle|Mixed only). Slurm reports Resources.
Versions probed
|
Version |
Source |
| Spur |
main (pre-fix) |
Built from source |
| Slurm |
25.11.6 |
job_reason_string() table jsra[] in src/common/job_state_reason.c |
Repro
# Busy-but-up cluster — saturate all nodes, then submit one more
spur submit -c <all-cpus> /tmp/sleep.sh # x2 to fill cluster
jid=$(spur submit -c <all-cpus> /tmp/sleep.sh | grep -oE '[0-9]+' | head -1)
spur show job $jid | grep Reason
# Spur → Reason=NodeDown ← wrong
# Slurm → Reason=Resources
# Absent/inactive reservation
jid=$(spur submit --reservation=nope /tmp/sleep.sh | grep -oE '[0-9]+' | head -1)
spur show job $jid | grep Reason
# Spur → (generic / dropped) ← reason never surfaces
# Slurm → Reason=Reservation
Expected behavior (Slurm parity)
- Add the missing
PendingReason variants with display strings byte-exact to Slurm 25.11.6's job_reason_string() (including Slurm's deliberate casing inconsistency — QOS* upper vs Assoc*/Association* mixed).
- QoS limit checks emit the specific reason per cap (max wall →
QOSMaxWallDurationPerJobLimit, MaxTRESPerJob cpu/mem → QOSMaxCpuPerJobLimit/QOSMaxMemoryPerJob, etc.) instead of generic Resources/PartitionTimeLimit.
- A scheduler pass tags reservation/license/QoS-blocked jobs with the real reason before they are dropped from scheduling, so the drop decision and the displayed reason cannot diverge.
- Add
NodeState::is_up() (Idle|Mixed|Allocated) and use it in reason emission so a saturated cluster reports Resources, while only genuine down/drain/error/unknown/suspended yields NodeDown.
Scope / size
Single-PR change. Estimated M (200–600 LOC):
spur-core/src/job.rs — new PendingReason variants + Display/serde
spur-core/src/qos.rs — specific QoS limit reasons
spur-core/src/node.rs — NodeState::is_up()
spurctld (cluster.rs, scheduler_loop.rs) — tag_blocked_pending_reasons() pass + shared eligibility helpers (reservation_block, license_block, qos_block_for); NodeDown→Resources fix
- Tests: Display/serde for all new variants, qos limit reasons, tag-blocked passes,
fully_allocated_cluster_reports_resources_not_nodedown
Priority
P1 — Slurm-visible behavior; mis-reports or drops the real reason rather than just being absent.
Known follow-ups (out of scope here)
Coordination
Variant set is disjoint from open PR #274 (NonZeroExitCode, RaisedSignal, JobLaunchFailure, JobHeldAdmin, BadConstraints, PartitionInactive, DependencyNeverSatisfied, InvalidAccount, InvalidQOS, BootFail, OutOfMemory) — can merge before or after #274 with no conflict.
Related
Part of broader Category-4 (Job Lifecycle & State Machine) parity work. Siblings: DEADLINE (#258), exit-code (#269), dependency-engine (#259).
Summary
Spur ships only ~14
PendingReasonvalues against Slurm's ~50, and several reasons Spur does compute never reach the user. Two problems:Assoc*/QOS*Grp/Max limit families, andBurstBuffer*. Workflow engines and CI gates scrape the reason string, so an absent or generic reason breaks them.pending_jobs()drops jobs blocked by reservation, license, or QoS limits beforeupdate_pending_reasons()runs, so those jobs'pending_reasonnever reflects the real cause.NodeDownbecause reason emission usesNodeState::is_available()(Idle|Mixed only). Slurm reportsResources.Versions probed
main(pre-fix)25.11.6job_reason_string()tablejsra[]insrc/common/job_state_reason.cRepro
Expected behavior (Slurm parity)
PendingReasonvariants with display strings byte-exact to Slurm 25.11.6'sjob_reason_string()(including Slurm's deliberate casing inconsistency —QOS*upper vsAssoc*/Association*mixed).QOSMaxWallDurationPerJobLimit, MaxTRESPerJob cpu/mem →QOSMaxCpuPerJobLimit/QOSMaxMemoryPerJob, etc.) instead of genericResources/PartitionTimeLimit.NodeState::is_up()(Idle|Mixed|Allocated) and use it in reason emission so a saturated cluster reportsResources, while only genuine down/drain/error/unknown/suspended yieldsNodeDown.Scope / size
Single-PR change. Estimated M (200–600 LOC):
spur-core/src/job.rs— newPendingReasonvariants + Display/serdespur-core/src/qos.rs— specific QoS limit reasonsspur-core/src/node.rs—NodeState::is_up()spurctld(cluster.rs,scheduler_loop.rs) —tag_blocked_pending_reasons()pass + shared eligibility helpers (reservation_block,license_block,qos_block_for); NodeDown→Resources fixfully_allocated_cluster_reports_resources_not_nodedownPriority
P1 — Slurm-visible behavior; mis-reports or drops the real reason rather than just being absent.
Known follow-ups (out of scope here)
--reservation/license:GRES to match Slurm's reject-at-submit behavior (Spur admits to PENDING and surfaces the reason — a superset).BeginTimereason:pending_jobs()drops future-begin_timejobs with no reason; easy win via the same tag-blocked pass.Coordination
Variant set is disjoint from open PR #274 (NonZeroExitCode, RaisedSignal, JobLaunchFailure, JobHeldAdmin, BadConstraints, PartitionInactive, DependencyNeverSatisfied, InvalidAccount, InvalidQOS, BootFail, OutOfMemory) — can merge before or after #274 with no conflict.
Related
Part of broader Category-4 (Job Lifecycle & State Machine) parity work. Siblings: DEADLINE (#258), exit-code (#269), dependency-engine (#259).