fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible#4879
fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible#4879dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR fixes a scheduler over-packing bug where non-preemptible jobs at lower priorities left higher-priority buckets showing artificially free resources, allowing a new job to land on an already-saturated node. The fix introduces Confidence Score: 5/5Safe to merge — fix is logically correct, well-tested, and symmetric across bind/evict/unbind. No P0 or P1 issues found. The sentinel value No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["bindJobToNodeInPlace / unbindJobFromNodeInPlace / evictJobFromNodeInPlace"] --> B{"isEvicted?"}
B -- "Yes (unbind only)" --> C["markAllocatable(EvictedPriority, rs)\n(same for preemptible & non-preemptible)"]
B -- "No" --> D{"priorityCutoffFor(job, priority)"}
D -- "job.Preemptible == true" --> E["cutoff = scheduledPriority\n→ deduct from buckets ≤ priority"]
D -- "job.Preemptible == false" --> F["cutoff = math.MaxInt32\n→ deduct from ALL buckets"]
E --> G["markAllocated / markAllocatable applied"]
F --> G
G --> H{"isEvicted? (bind only)"}
H -- "Yes" --> I["markAllocatable(EvictedPriority, rs)\n(release evicted slot)"]
H -- "No" --> J["Done"]
I --> J
Reviews (7): Last reviewed commit: "Deduct non-preemptible job resources at ..." | Re-trigger Greptile |
5eaf6a7 to
13f4b64
Compare
c752973 to
994c9c6
Compare
279fcdc to
0912df5
Compare
… are non-preemptible Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
0912df5 to
4c52a1b
Compare
…r-pack Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
4c52a1b to
475ead0
Compare
What type of PR is this?
/kind bug
/kind cleanup
What this PR does / why we need it
The scheduler can over-pack a node when non-preemptible jobs at a lower priority hold all of its resources and a higher-priority job shows up. The higher-priority job lands on the node anyway, putting it over its declared capacity. The same gap is there for cpu, memory, and pods.
Three pieces are involved, each one defensible on its own:
MarkAllocated(p, rs)ininternaltypes/resource_list_map_util.go:67only deducts allocatable from priorities<= p. From a higher-priority view, lower-priority resources look free, because the assumption is "I could just preempt them if I needed to."preempting_queue_scheduler.go:118) skips non-preemptible jobs.eviction.go:164) is the safety net for exactly this situation. It does detect the negative allocatable, but it also refuses to evict non-preemptible jobs, so it ends up with nothing to do.For preemptible incumbents the assumption holds and eviction does its job. For non-preemptible ones the assumption is wrong and the over-pack stays.
The fix lives in
nodedb/nodedb.go. The bind/unbind/evict paths now compute apriorityCutoffFor(job, scheduledPriority): preemptible jobs use their scheduled priority as the cutoff (existing behavior), non-preemptible jobs use a sentinelnonPreemptibleCutoff = math.MaxInt32so the existingmarkAllocated/markAllocatablehelpers deduct (or release) at every real priority. OnceAllocatableByPriorityreflects what the node really has free, the matcher and the OversubscribedEvictor do the right thing without any further changes.The PR has two commits so the bug and the fix are easy to see separately:
Reproducer:addsTestPreemptingQueueScheduler_NonPreemptibleOverPack. It uses cpu rather than pods so the assertion is on the priority model itself. Run against this commit alone, the test fails.Deduct non-preemptible...is the fix. The reproducer now passes and nothing else in the suite needed touching, exceptTestEviction, which had hardcoded expected values reflecting the pre-fix accounting at high priorities. Updated.Which issue(s) this PR fixes
Fixes #
Special notes for your reviewer
A few things to flag:
I found this while validating #4841 (the
respectNodePodLimitsflag). That PR works fine in the common cases (preemptible incumbents, free slots, gangs); it just surfaces this older scheduler-wide issue, which is what's being addressed here at its real root.Performance: each bind/unbind/evict for a non-preemptible job iterates all priority levels (about 7 in practice) instead of
<= p. Invisible at scale.Behavior change for fair share: non-preemptible jobs now consume from higher-priority queues' "available" budget. If any workload was implicitly counting on the over-allocation, it would show up as different scheduling decisions. Happy to flag-gate the rollout if that's a concern.
I ran the full test suite locally across
internal/scheduler/...,internal/executor/...,internal/server/..., andinternal/scheduleringester. Everything passes.