Skip to content

parity(slurm): close reason-code vocabulary gap — expand PendingReason 14→~50 and wire reservation/license/QoS reasons #300

@yansun1996

Description

@yansun1996

Summary

Spur ships only ~14 PendingReason values against Slurm's ~50, and several reasons Spur does compute never reach the user. Two problems:

  1. Vocabulary gap — missing Reservation, PartitionConfig, SystemFailure, AccountingPolicy, the full Assoc*/QOS* Grp/Max limit families, and BurstBuffer*. Workflow engines and CI gates scrape the reason string, so an absent or generic reason breaks them.
  2. Observable-wiring gappending_jobs() drops jobs blocked by reservation, license, or QoS limits before update_pending_reasons() runs, so those jobs' pending_reason never reflects the real cause.
  3. NodeDown/Resources bug — a fully-allocated (busy-but-up) cluster mis-reports as NodeDown because reason emission uses NodeState::is_available() (Idle|Mixed only). Slurm reports Resources.

Versions probed

Version Source
Spur main (pre-fix) Built from source
Slurm 25.11.6 job_reason_string() table jsra[] in src/common/job_state_reason.c

Repro

# Busy-but-up cluster — saturate all nodes, then submit one more
spur submit -c <all-cpus> /tmp/sleep.sh   # x2 to fill cluster
jid=$(spur submit -c <all-cpus> /tmp/sleep.sh | grep -oE '[0-9]+' | head -1)
spur show job $jid | grep Reason
# Spur → Reason=NodeDown     ← wrong
# Slurm → Reason=Resources

# Absent/inactive reservation
jid=$(spur submit --reservation=nope /tmp/sleep.sh | grep -oE '[0-9]+' | head -1)
spur show job $jid | grep Reason
# Spur → (generic / dropped)   ← reason never surfaces
# Slurm → Reason=Reservation

Expected behavior (Slurm parity)

  1. Add the missing PendingReason variants with display strings byte-exact to Slurm 25.11.6's job_reason_string() (including Slurm's deliberate casing inconsistency — QOS* upper vs Assoc*/Association* mixed).
  2. QoS limit checks emit the specific reason per cap (max wall → QOSMaxWallDurationPerJobLimit, MaxTRESPerJob cpu/mem → QOSMaxCpuPerJobLimit/QOSMaxMemoryPerJob, etc.) instead of generic Resources/PartitionTimeLimit.
  3. A scheduler pass tags reservation/license/QoS-blocked jobs with the real reason before they are dropped from scheduling, so the drop decision and the displayed reason cannot diverge.
  4. Add NodeState::is_up() (Idle|Mixed|Allocated) and use it in reason emission so a saturated cluster reports Resources, while only genuine down/drain/error/unknown/suspended yields NodeDown.

Scope / size

Single-PR change. Estimated M (200–600 LOC):

  • spur-core/src/job.rs — new PendingReason variants + Display/serde
  • spur-core/src/qos.rs — specific QoS limit reasons
  • spur-core/src/node.rsNodeState::is_up()
  • spurctld (cluster.rs, scheduler_loop.rs) — tag_blocked_pending_reasons() pass + shared eligibility helpers (reservation_block, license_block, qos_block_for); NodeDown→Resources fix
  • Tests: Display/serde for all new variants, qos limit reasons, tag-blocked passes, fully_allocated_cluster_reports_resources_not_nodedown

Priority

P1 — Slurm-visible behavior; mis-reports or drops the real reason rather than just being absent.

Known follow-ups (out of scope here)

Coordination

Variant set is disjoint from open PR #274 (NonZeroExitCode, RaisedSignal, JobLaunchFailure, JobHeldAdmin, BadConstraints, PartitionInactive, DependencyNeverSatisfied, InvalidAccount, InvalidQOS, BootFail, OutOfMemory) — can merge before or after #274 with no conflict.

Related

Part of broader Category-4 (Job Lifecycle & State Machine) parity work. Siblings: DEADLINE (#258), exit-code (#269), dependency-engine (#259).

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions