Summary
Spur has no OUT_OF_MEMORY job state. When a job is OOM-killed it currently reports as a generic Failed (or, on the bare-metal path, as a SIGKILL with no disambiguation), diverging from Slurm which reports JobState=OUT_OF_MEMORY with Reason=OutOfMemory.
This is part of the broader Category-4 (job lifecycle & state machine) parity effort. It is not purely a job-state change — the detection lives in the node agent (spurd) cgroup plumbing and must be threaded back to the controller.
Current state
spurd already builds a per-job cgroup-v2 and sets memory.max / memory.oom.group (crates/spurd/src/executor.rs::setup_cgroup); the monitor captures the cgroup path before teardown.
- An OOM kill arrives as SIGKILL/9, which is indistinguishable from any other SIGKILL — so the cgroup's
memory.events:oom_kill counter is the only reliable disambiguator.
- The k8s backend already detects
OOMKilled (crates/spur-k8s/src/job_controller.rs::extract_failure_details) but maps it to generic Failed.
Gap / work
Acceptance Criteria
Caveat / testing
The lab Slurm has cgroup-memory disabled (TaskPlugin=(null)) and Spur spurd runs non-root on the testbed, so this is not reproducible on the current lab setup — it needs a privileged spurd / dedicated e2e runner.
Related
Summary
Spur has no
OUT_OF_MEMORYjob state. When a job is OOM-killed it currently reports as a genericFailed(or, on the bare-metal path, as a SIGKILL with no disambiguation), diverging from Slurm which reportsJobState=OUT_OF_MEMORYwithReason=OutOfMemory.This is part of the broader Category-4 (job lifecycle & state machine) parity effort. It is not purely a job-state change — the detection lives in the node agent (
spurd) cgroup plumbing and must be threaded back to the controller.Current state
spurdalready builds a per-job cgroup-v2 and setsmemory.max/memory.oom.group(crates/spurd/src/executor.rs::setup_cgroup); the monitor captures the cgroup path before teardown.memory.events:oom_killcounter is the only reliable disambiguator.OOMKilled(crates/spur-k8s/src/job_controller.rs::extract_failure_details) but maps it to genericFailed.Gap / work
spurd): after a job's processes exit, readmemory.events:oom_killfrom the captured cgroup path; if non-zero, flag the completion as OOM.JobState::OutOfMemory(+ displayOUT_OF_MEMORY) and theOutOfMemorypending/completion reason (thePendingReason::OutOfMemorystring already exists from feat(spur): close exit-code reporting gap — exit:signal, DerivedExitCode, reasons #274's set).signalchannel via a sentinel, so no new proto/WAL fields are required.OOMKilledto the new state instead of genericFailed.Acceptance Criteria
JobState=OUT_OF_MEMORY/Reason=OutOfMemory(controller +scontrol show job/squeue)OOMKilledmaps to the new stateCaveat / testing
The lab Slurm has cgroup-memory disabled (
TaskPlugin=(null)) and Spurspurdruns non-root on the testbed, so this is not reproducible on the current lab setup — it needs a privilegedspurd/ dedicated e2e runner.Related
OutOfMemoryreason string landed with the reason vocabulary (feat(spur): expand pending-reason vocabulary and surface reservation/license/QoS reasons #301).