Skip to content

fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible#4879

Open
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:repro-non-preemptible-overpack
Open

fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible#4879
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:repro-non-preemptible-overpack

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 27, 2026

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it

The scheduler can over-pack a node when non-preemptible jobs at a lower priority hold all of its resources and a higher-priority job shows up. The higher-priority job lands on the node anyway, putting it over its declared capacity. The same gap is there for cpu, memory, and pods.

Three pieces are involved, each one defensible on its own:

  1. MarkAllocated(p, rs) in internaltypes/resource_list_map_util.go:67 only deducts allocatable from priorities <= p. From a higher-priority view, lower-priority resources look free, because the assumption is "I could just preempt them if I needed to."
  2. The rebalance eviction phase (preempting_queue_scheduler.go:118) skips non-preemptible jobs.
  3. The OversubscribedEvictor (eviction.go:164) is the safety net for exactly this situation. It does detect the negative allocatable, but it also refuses to evict non-preemptible jobs, so it ends up with nothing to do.

For preemptible incumbents the assumption holds and eviction does its job. For non-preemptible ones the assumption is wrong and the over-pack stays.

The fix lives in nodedb/nodedb.go. The bind/unbind/evict paths now compute a priorityCutoffFor(job, scheduledPriority): preemptible jobs use their scheduled priority as the cutoff (existing behavior), non-preemptible jobs use a sentinel nonPreemptibleCutoff = math.MaxInt32 so the existing markAllocated/markAllocatable helpers deduct (or release) at every real priority. Once AllocatableByPriority reflects what the node really has free, the matcher and the OversubscribedEvictor do the right thing without any further changes.

The PR has two commits so the bug and the fix are easy to see separately:

  • Reproducer: adds TestPreemptingQueueScheduler_NonPreemptibleOverPack. It uses cpu rather than pods so the assertion is on the priority model itself. Run against this commit alone, the test fails.
  • Deduct non-preemptible... is the fix. The reproducer now passes and nothing else in the suite needed touching, except TestEviction, which had hardcoded expected values reflecting the pre-fix accounting at high priorities. Updated.

Which issue(s) this PR fixes

Fixes #

Special notes for your reviewer

A few things to flag:

I found this while validating #4841 (the respectNodePodLimits flag). That PR works fine in the common cases (preemptible incumbents, free slots, gangs); it just surfaces this older scheduler-wide issue, which is what's being addressed here at its real root.

Performance: each bind/unbind/evict for a non-preemptible job iterates all priority levels (about 7 in practice) instead of <= p. Invisible at scale.

Behavior change for fair share: non-preemptible jobs now consume from higher-priority queues' "available" budget. If any workload was implicitly counting on the over-allocation, it would show up as different scheduling decisions. Happy to flag-gate the rollout if that's a concern.

I ran the full test suite locally across internal/scheduler/..., internal/executor/..., internal/server/..., and internal/scheduleringester. Everything passes.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR fixes a scheduler bug where higher-priority jobs could over-pack a node whose resources are fully held by non-preemptible lower-priority incumbents. The root cause was that markAllocated/markAllocatable only accounted for resources at priorities <= scheduledPriority, making non-preemptible jobs invisible to higher-priority accounting — and since neither the rebalance nor the OversubscribedEvictor will evict non-preemptible jobs, the over-allocation was never corrected.

  • Core fix in nodedb.go: introduces priorityCutoffFor(job, scheduledPriority) which returns the job's scheduled priority for preemptible jobs (preserving existing behavior) and math.MaxInt32 for non-preemptible ones, so markAllocated/markAllocatable deducts from every priority bucket rather than only those at or below the scheduled priority.
  • Test update in nodedb_test.go: corrects TestEviction expected values for high-priority buckets (28000–30000) to reflect the non-preemptible job being deducted there, and adds clarifying comments.
  • Regression test in preempting_queue_scheduler_test.go: TestPreemptingQueueScheduler_NonPreemptibleOverPack creates a 5-CPU node fully saturated by non-preemptible incumbents and asserts that a higher-priority challenger is not scheduled (and no preemptions occur).

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to non-preemptible job resource accounting and preserves all existing behavior for preemptible jobs.

The fix is symmetric across bind, unbind, and evict, the updated test expectations have been verified by hand against the new code path, and the new regression test covers the exact over-packing scenario. No other code paths are affected.

No files require special attention.

Important Files Changed

Filename Overview
internal/scheduler/nodedb/nodedb.go Introduces priorityCutoffFor and nonPreemptibleCutoff = math.MaxInt32 sentinel; applies it consistently in bind, unbind, and evict paths. Logic is symmetric and correct.
internal/scheduler/nodedb/nodedb_test.go Updates TestEviction expected allocatable values at high priority buckets to reflect non-preemptible deduction; manually verified the arithmetic is consistent with the new code path.
internal/scheduler/scheduling/preempting_queue_scheduler_test.go Adds TestPreemptingQueueScheduler_NonPreemptibleOverPack as a regression guard; follows existing test construction patterns (uncommitted WriteTxn passed as jobRepo, which is idiomatic for memdb tests here).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["bind / unbind / evict job"] --> B{"job.PriorityClass().Preemptible?"}
    B -- "Yes (e.g. PriorityClass0/1)" --> C["cutoff = scheduledPriority\n(existing behaviour)"]
    B -- "No (e.g. PriorityClass2NonPreemptible/3)" --> D["cutoff = math.MaxInt32\n(nonPreemptibleCutoff)"]
    C --> E["markAllocated/markAllocatable\nfor priorities <= scheduledPriority"]
    D --> F["markAllocated/markAllocatable\nfor ALL priority buckets"]
    E --> G["Higher-priority jobs\nsee free capacity (can preempt if needed)"]
    F --> H["Higher-priority jobs\nsee NO free capacity (non-preemptible = immovable)"]
    H --> I["Scheduler refuses to over-pack node"]
Loading

Reviews (8): Last reviewed commit: "Deduct non-preemptible job resources at ..." | Re-trigger Greptile

Comment thread internal/scheduler/scheduling/preempting_queue_scheduler_test.go
@dejanzele dejanzele force-pushed the repro-non-preemptible-overpack branch from 5eaf6a7 to 13f4b64 Compare April 27, 2026 11:22
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the repro-non-preemptible-overpack branch 3 times, most recently from c752973 to 994c9c6 Compare April 27, 2026 11:30
@dejanzele dejanzele changed the title Reproducer: scheduler over-packs nodes when lower-priority incumbents are non-preemptible fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible Apr 27, 2026
@dejanzele dejanzele force-pushed the repro-non-preemptible-overpack branch 4 times, most recently from 4c52a1b to 475ead0 Compare April 27, 2026 13:51
dejanzele added 2 commits May 8, 2026 15:34
… are non-preemptible

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…r-pack

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the repro-non-preemptible-overpack branch from 475ead0 to f9ad2fd Compare May 8, 2026 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant