fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible#4879
fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible#4879dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR fixes a scheduler bug where higher-priority jobs could over-pack a node whose resources are fully held by non-preemptible lower-priority incumbents. The root cause was that
Confidence Score: 5/5Safe to merge — the change is narrowly scoped to non-preemptible job resource accounting and preserves all existing behavior for preemptible jobs. The fix is symmetric across bind, unbind, and evict, the updated test expectations have been verified by hand against the new code path, and the new regression test covers the exact over-packing scenario. No other code paths are affected. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["bind / unbind / evict job"] --> B{"job.PriorityClass().Preemptible?"}
B -- "Yes (e.g. PriorityClass0/1)" --> C["cutoff = scheduledPriority\n(existing behaviour)"]
B -- "No (e.g. PriorityClass2NonPreemptible/3)" --> D["cutoff = math.MaxInt32\n(nonPreemptibleCutoff)"]
C --> E["markAllocated/markAllocatable\nfor priorities <= scheduledPriority"]
D --> F["markAllocated/markAllocatable\nfor ALL priority buckets"]
E --> G["Higher-priority jobs\nsee free capacity (can preempt if needed)"]
F --> H["Higher-priority jobs\nsee NO free capacity (non-preemptible = immovable)"]
H --> I["Scheduler refuses to over-pack node"]
Reviews (8): Last reviewed commit: "Deduct non-preemptible job resources at ..." | Re-trigger Greptile |
5eaf6a7 to
13f4b64
Compare
c752973 to
994c9c6
Compare
4c52a1b to
475ead0
Compare
… are non-preemptible Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…r-pack Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
475ead0 to
f9ad2fd
Compare
What type of PR is this?
/kind bug
/kind cleanup
What this PR does / why we need it
The scheduler can over-pack a node when non-preemptible jobs at a lower priority hold all of its resources and a higher-priority job shows up. The higher-priority job lands on the node anyway, putting it over its declared capacity. The same gap is there for cpu, memory, and pods.
Three pieces are involved, each one defensible on its own:
MarkAllocated(p, rs)ininternaltypes/resource_list_map_util.go:67only deducts allocatable from priorities<= p. From a higher-priority view, lower-priority resources look free, because the assumption is "I could just preempt them if I needed to."preempting_queue_scheduler.go:118) skips non-preemptible jobs.eviction.go:164) is the safety net for exactly this situation. It does detect the negative allocatable, but it also refuses to evict non-preemptible jobs, so it ends up with nothing to do.For preemptible incumbents the assumption holds and eviction does its job. For non-preemptible ones the assumption is wrong and the over-pack stays.
The fix lives in
nodedb/nodedb.go. The bind/unbind/evict paths now compute apriorityCutoffFor(job, scheduledPriority): preemptible jobs use their scheduled priority as the cutoff (existing behavior), non-preemptible jobs use a sentinelnonPreemptibleCutoff = math.MaxInt32so the existingmarkAllocated/markAllocatablehelpers deduct (or release) at every real priority. OnceAllocatableByPriorityreflects what the node really has free, the matcher and the OversubscribedEvictor do the right thing without any further changes.The PR has two commits so the bug and the fix are easy to see separately:
Reproducer:addsTestPreemptingQueueScheduler_NonPreemptibleOverPack. It uses cpu rather than pods so the assertion is on the priority model itself. Run against this commit alone, the test fails.Deduct non-preemptible...is the fix. The reproducer now passes and nothing else in the suite needed touching, exceptTestEviction, which had hardcoded expected values reflecting the pre-fix accounting at high priorities. Updated.Which issue(s) this PR fixes
Fixes #
Special notes for your reviewer
A few things to flag:
I found this while validating #4841 (the
respectNodePodLimitsflag). That PR works fine in the common cases (preemptible incumbents, free slots, gangs); it just surfaces this older scheduler-wide issue, which is what's being addressed here at its real root.Performance: each bind/unbind/evict for a non-preemptible job iterates all priority levels (about 7 in practice) instead of
<= p. Invisible at scale.Behavior change for fair share: non-preemptible jobs now consume from higher-priority queues' "available" budget. If any workload was implicitly counting on the over-allocation, it would show up as different scheduling decisions. Happy to flag-gate the rollout if that's a concern.
I ran the full test suite locally across
internal/scheduler/...,internal/executor/...,internal/server/..., andinternal/scheduleringester. Everything passes.