parity(slurm): add OUT_OF_MEMORY job state — detect cgroup oom_kill and report

## Summary

Spur has no `OUT_OF_MEMORY` job state. When a job is OOM-killed it currently reports as a generic `Failed` (or, on the bare-metal path, as a SIGKILL with no disambiguation), diverging from Slurm which reports `JobState=OUT_OF_MEMORY` with `Reason=OutOfMemory`.

This is part of the broader Category-4 (job lifecycle & state machine) parity effort. It is **not purely a job-state change** — the detection lives in the node agent (`spurd`) cgroup plumbing and must be threaded back to the controller.

## Current state

- `spurd` **already** builds a per-job cgroup-v2 and sets `memory.max` / `memory.oom.group` (`crates/spurd/src/executor.rs::setup_cgroup`); the monitor captures the cgroup path before teardown.
- An OOM kill arrives as SIGKILL/9, which is indistinguishable from any other SIGKILL — so the cgroup's `memory.events:oom_kill` counter is the **only** reliable disambiguator.
- The k8s backend already detects `OOMKilled` (`crates/spur-k8s/src/job_controller.rs::extract_failure_details`) but maps it to generic `Failed`.

## Gap / work

- **Node agent (`spurd`)**: after a job's processes exit, read `memory.events:oom_kill` from the captured cgroup path; if non-zero, flag the completion as OOM.
- **State**: add `JobState::OutOfMemory` (+ display `OUT_OF_MEMORY`) and the `OutOfMemory` pending/completion reason (the `PendingReason::OutOfMemory` string already exists from #274's set).
- **Reporting**: thread the OOM signal back to the controller. The design reuses #274's per-node `signal` channel via a sentinel, so **no new proto/WAL fields** are required.
- **k8s backend**: map detected `OOMKilled` to the new state instead of generic `Failed`.

## Acceptance Criteria

- [ ] OOM-killed batch job reports `JobState=OUT_OF_MEMORY` / `Reason=OutOfMemory` (controller + `scontrol show job` / `squeue`)
- [ ] Bare-metal path distinguishes OOM-kill from a plain SIGKILL via the cgroup counter
- [ ] k8s `OOMKilled` maps to the new state
- [ ] Test that drives the real detection path (cgroup counter → state), not just a Display/serde check

## Caveat / testing

The lab Slurm has cgroup-memory disabled (`TaskPlugin=(null)`) and Spur `spurd` runs non-root on the testbed, so this is **not reproducible on the current lab setup** — it needs a privileged `spurd` / dedicated e2e runner.

## Related

- #274 — exit:signal channel this reuses; the `OutOfMemory` reason string landed with the reason vocabulary (#301).
- #307 — pending-reason emission umbrella.
- Part of Category-4 job-lifecycle parity (siblings: DEADLINE #263, dependency engine #271, exit-code #274, suspend/resume #275).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parity(slurm): add OUT_OF_MEMORY job state — detect cgroup oom_kill and report #308

Summary

Current state

Gap / work

Acceptance Criteria

Caveat / testing

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

parity(slurm): add OUT_OF_MEMORY job state — detect cgroup oom_kill and report #308

Description

Summary

Current state

Gap / work

Acceptance Criteria

Caveat / testing

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions