Skip to content

parity(slurm): add OUT_OF_MEMORY job state — detect cgroup oom_kill and report #308

@yansun1996

Description

@yansun1996

Summary

Spur has no OUT_OF_MEMORY job state. When a job is OOM-killed it currently reports as a generic Failed (or, on the bare-metal path, as a SIGKILL with no disambiguation), diverging from Slurm which reports JobState=OUT_OF_MEMORY with Reason=OutOfMemory.

This is part of the broader Category-4 (job lifecycle & state machine) parity effort. It is not purely a job-state change — the detection lives in the node agent (spurd) cgroup plumbing and must be threaded back to the controller.

Current state

  • spurd already builds a per-job cgroup-v2 and sets memory.max / memory.oom.group (crates/spurd/src/executor.rs::setup_cgroup); the monitor captures the cgroup path before teardown.
  • An OOM kill arrives as SIGKILL/9, which is indistinguishable from any other SIGKILL — so the cgroup's memory.events:oom_kill counter is the only reliable disambiguator.
  • The k8s backend already detects OOMKilled (crates/spur-k8s/src/job_controller.rs::extract_failure_details) but maps it to generic Failed.

Gap / work

Acceptance Criteria

  • OOM-killed batch job reports JobState=OUT_OF_MEMORY / Reason=OutOfMemory (controller + scontrol show job / squeue)
  • Bare-metal path distinguishes OOM-kill from a plain SIGKILL via the cgroup counter
  • k8s OOMKilled maps to the new state
  • Test that drives the real detection path (cgroup counter → state), not just a Display/serde check

Caveat / testing

The lab Slurm has cgroup-memory disabled (TaskPlugin=(null)) and Spur spurd runs non-root on the testbed, so this is not reproducible on the current lab setup — it needs a privileged spurd / dedicated e2e runner.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions