fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible by dejanzele · Pull Request #4879 · armadaproject/armada

dejanzele · 2026-04-27T11:04:46Z

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it

The scheduler can over-pack a node when non-preemptible jobs at a lower priority hold all of its resources and a higher-priority job shows up. The higher-priority job lands on the node anyway, putting it over its declared capacity. The same gap is there for cpu, memory, and pods.

Three pieces are involved, each one defensible on its own:

MarkAllocated(p, rs) in internaltypes/resource_list_map_util.go:67 only deducts allocatable from priorities <= p. From a higher-priority view, lower-priority resources look free, because the assumption is "I could just preempt them if I needed to."
The rebalance eviction phase (preempting_queue_scheduler.go:118) skips non-preemptible jobs.
The OversubscribedEvictor (eviction.go:164) is the safety net for exactly this situation. It does detect the negative allocatable, but it also refuses to evict non-preemptible jobs, so it ends up with nothing to do.

For preemptible incumbents the assumption holds and eviction does its job. For non-preemptible ones the assumption is wrong and the over-pack stays.

The fix lives in nodedb/nodedb.go. The bind/unbind/evict paths now compute a priorityCutoffFor(job, scheduledPriority): preemptible jobs use their scheduled priority as the cutoff (existing behavior), non-preemptible jobs use a sentinel nonPreemptibleCutoff = math.MaxInt32 so the existing markAllocated/markAllocatable helpers deduct (or release) at every real priority. Once AllocatableByPriority reflects what the node really has free, the matcher and the OversubscribedEvictor do the right thing without any further changes.

The PR has two commits so the bug and the fix are easy to see separately:

Reproducer: adds TestPreemptingQueueScheduler_NonPreemptibleOverPack. It uses cpu rather than pods so the assertion is on the priority model itself. Run against this commit alone, the test fails.
Deduct non-preemptible... is the fix. The reproducer now passes and nothing else in the suite needed touching, except TestEviction, which had hardcoded expected values reflecting the pre-fix accounting at high priorities. Updated.

Which issue(s) this PR fixes

Fixes #

Special notes for your reviewer

A few things to flag:

I found this while validating #4841 (the respectNodePodLimits flag). That PR works fine in the common cases (preemptible incumbents, free slots, gangs); it just surfaces this older scheduler-wide issue, which is what's being addressed here at its real root.

Performance: each bind/unbind/evict for a non-preemptible job iterates all priority levels (about 7 in practice) instead of <= p. Invisible at scale.

Behavior change for fair share: non-preemptible jobs now consume from higher-priority queues' "available" budget. If any workload was implicitly counting on the over-allocation, it would show up as different scheduling decisions. Happy to flag-gate the rollout if that's a concern.

I ran the full test suite locally across internal/scheduler/..., internal/executor/..., internal/server/..., and internal/scheduleringester. Everything passes.

greptile-apps · 2026-04-27T11:07:52Z

Greptile Summary

This PR fixes a scheduler bug where higher-priority jobs could over-pack a node whose resources are fully held by non-preemptible lower-priority incumbents. The root cause was that markAllocated/markAllocatable only accounted for resources at priorities <= scheduledPriority, making non-preemptible jobs invisible to higher-priority accounting — and since neither the rebalance nor the OversubscribedEvictor will evict non-preemptible jobs, the over-allocation was never corrected.

Core fix in nodedb.go: introduces priorityCutoffFor(job, scheduledPriority) which returns the job's scheduled priority for preemptible jobs (preserving existing behavior) and math.MaxInt32 for non-preemptible ones, so markAllocated/markAllocatable deducts from every priority bucket rather than only those at or below the scheduled priority.
Test update in nodedb_test.go: corrects TestEviction expected values for high-priority buckets (28000–30000) to reflect the non-preemptible job being deducted there, and adds clarifying comments.
Regression test in preempting_queue_scheduler_test.go: TestPreemptingQueueScheduler_NonPreemptibleOverPack creates a 5-CPU node fully saturated by non-preemptible incumbents and asserts that a higher-priority challenger is not scheduled (and no preemptions occur).

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to non-preemptible job resource accounting and preserves all existing behavior for preemptible jobs.

The fix is symmetric across bind, unbind, and evict, the updated test expectations have been verified by hand against the new code path, and the new regression test covers the exact over-packing scenario. No other code paths are affected.

No files require special attention.

Important Files Changed

Filename	Overview
internal/scheduler/nodedb/nodedb.go	Introduces `priorityCutoffFor` and `nonPreemptibleCutoff = math.MaxInt32` sentinel; applies it consistently in bind, unbind, and evict paths. Logic is symmetric and correct.
internal/scheduler/nodedb/nodedb_test.go	Updates `TestEviction` expected allocatable values at high priority buckets to reflect non-preemptible deduction; manually verified the arithmetic is consistent with the new code path.
internal/scheduler/scheduling/preempting_queue_scheduler_test.go	Adds `TestPreemptingQueueScheduler_NonPreemptibleOverPack` as a regression guard; follows existing test construction patterns (uncommitted WriteTxn passed as jobRepo, which is idiomatic for memdb tests here).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["bind / unbind / evict job"] --> B{"job.PriorityClass().Preemptible?"}
    B -- "Yes (e.g. PriorityClass0/1)" --> C["cutoff = scheduledPriority\n(existing behaviour)"]
    B -- "No (e.g. PriorityClass2NonPreemptible/3)" --> D["cutoff = math.MaxInt32\n(nonPreemptibleCutoff)"]
    C --> E["markAllocated/markAllocatable\nfor priorities <= scheduledPriority"]
    D --> F["markAllocated/markAllocatable\nfor ALL priority buckets"]
    E --> G["Higher-priority jobs\nsee free capacity (can preempt if needed)"]
    F --> H["Higher-priority jobs\nsee NO free capacity (non-preemptible = immovable)"]
    H --> I["Scheduler refuses to over-pack node"]

_{Reviews (8): Last reviewed commit: "Deduct non-preemptible job resources at ..." | Re-trigger Greptile}

dejanzele · 2026-04-27T11:23:43Z

@greptileai

… are non-preemptible Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

…r-pack Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread internal/scheduler/scheduling/preempting_queue_scheduler_test.go

dejanzele force-pushed the repro-non-preemptible-overpack branch from 5eaf6a7 to 13f4b64 Compare April 27, 2026 11:22

dejanzele force-pushed the repro-non-preemptible-overpack branch 3 times, most recently from c752973 to 994c9c6 Compare April 27, 2026 11:30

dejanzele changed the title ~~Reproducer: scheduler over-packs nodes when lower-priority incumbents are non-preemptible~~ fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible Apr 27, 2026

dejanzele force-pushed the repro-non-preemptible-overpack branch 4 times, most recently from 4c52a1b to 475ead0 Compare April 27, 2026 13:51

dejanzele added 2 commits May 8, 2026 15:34

Reproducer: scheduler over-packs nodes when lower-priority incumbents…

5df9593

… are non-preemptible Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

Deduct non-preemptible job resources at all priorities to prevent ove…

f9ad2fd

…r-pack Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele force-pushed the repro-non-preemptible-overpack branch from 475ead0 to f9ad2fd Compare May 8, 2026 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible#4879

fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible#4879
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:repro-non-preemptible-overpack

dejanzele commented Apr 27, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

dejanzele commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dejanzele commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Uh oh!

greptile-apps Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

dejanzele commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dejanzele commented Apr 27, 2026 •

edited

Loading

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading