Skip to content

fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible#4879

Open
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:repro-non-preemptible-overpack
Open

fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible#4879
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:repro-non-preemptible-overpack

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 27, 2026

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it

The scheduler can over-pack a node when non-preemptible jobs at a lower priority hold all of its resources and a higher-priority job shows up. The higher-priority job lands on the node anyway, putting it over its declared capacity. The same gap is there for cpu, memory, and pods.

Three pieces are involved, each one defensible on its own:

  1. MarkAllocated(p, rs) in internaltypes/resource_list_map_util.go:67 only deducts allocatable from priorities <= p. From a higher-priority view, lower-priority resources look free, because the assumption is "I could just preempt them if I needed to."
  2. The rebalance eviction phase (preempting_queue_scheduler.go:118) skips non-preemptible jobs.
  3. The OversubscribedEvictor (eviction.go:164) is the safety net for exactly this situation. It does detect the negative allocatable, but it also refuses to evict non-preemptible jobs, so it ends up with nothing to do.

For preemptible incumbents the assumption holds and eviction does its job. For non-preemptible ones the assumption is wrong and the over-pack stays.

The fix lives in nodedb/nodedb.go. The bind/unbind/evict paths now compute a priorityCutoffFor(job, scheduledPriority): preemptible jobs use their scheduled priority as the cutoff (existing behavior), non-preemptible jobs use a sentinel nonPreemptibleCutoff = math.MaxInt32 so the existing markAllocated/markAllocatable helpers deduct (or release) at every real priority. Once AllocatableByPriority reflects what the node really has free, the matcher and the OversubscribedEvictor do the right thing without any further changes.

The PR has two commits so the bug and the fix are easy to see separately:

  • Reproducer: adds TestPreemptingQueueScheduler_NonPreemptibleOverPack. It uses cpu rather than pods so the assertion is on the priority model itself. Run against this commit alone, the test fails.
  • Deduct non-preemptible... is the fix. The reproducer now passes and nothing else in the suite needed touching, except TestEviction, which had hardcoded expected values reflecting the pre-fix accounting at high priorities. Updated.

Which issue(s) this PR fixes

Fixes #

Special notes for your reviewer

A few things to flag:

I found this while validating #4841 (the respectNodePodLimits flag). That PR works fine in the common cases (preemptible incumbents, free slots, gangs); it just surfaces this older scheduler-wide issue, which is what's being addressed here at its real root.

Performance: each bind/unbind/evict for a non-preemptible job iterates all priority levels (about 7 in practice) instead of <= p. Invisible at scale.

Behavior change for fair share: non-preemptible jobs now consume from higher-priority queues' "available" budget. If any workload was implicitly counting on the over-allocation, it would show up as different scheduling decisions. Happy to flag-gate the rollout if that's a concern.

I ran the full test suite locally across internal/scheduler/..., internal/executor/..., internal/server/..., and internal/scheduleringester. Everything passes.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR fixes a scheduler over-packing bug where non-preemptible jobs at lower priorities left higher-priority buckets showing artificially free resources, allowing a new job to land on an already-saturated node. The fix introduces priorityCutoffFor in nodedb.go, routing non-preemptible jobs through nonPreemptibleCutoff = math.MaxInt32 so markAllocated/markAllocatable deducts from every priority bucket rather than only those ≤ scheduledPriority. Bind, evict, and unbind operations are all updated symmetrically, and the evicted-state path correctly continues to use EvictedPriority regardless of preemptibility.

Confidence Score: 5/5

Safe to merge — fix is logically correct, well-tested, and symmetric across bind/evict/unbind.

No P0 or P1 issues found. The sentinel value math.MaxInt32 correctly covers all real priority buckets (Kubernetes priorities are in [0, 1,000,000,000]). The math import is already present. Bind/evict/unbind accounting is symmetric. The regression test clearly exercises the fixed scenario. The TestEviction expected-value updates are arithmetically correct.

No files require special attention.

Important Files Changed

Filename Overview
internal/scheduler/nodedb/nodedb.go Core fix: adds priorityCutoffFor and nonPreemptibleCutoff=math.MaxInt32 sentinel so non-preemptible jobs deduct from every priority bucket; bind/evict/unbind operations are symmetric and correctly handle the evicted-state branching.
internal/scheduler/nodedb/nodedb_test.go Updates TestEviction expected values to reflect that a non-preemptible priority-3 job now deducts from all priority buckets (including 28000/29000/30000), not just those ≤ 3.
internal/scheduler/scheduling/preempting_queue_scheduler_test.go Adds TestPreemptingQueueScheduler_NonPreemptibleOverPack regression test: 5-CPU node saturated by non-preemptible priority-2 incumbents must reject a higher-priority challenger after the fix.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["bindJobToNodeInPlace / unbindJobFromNodeInPlace / evictJobFromNodeInPlace"] --> B{"isEvicted?"}
    B -- "Yes (unbind only)" --> C["markAllocatable(EvictedPriority, rs)\n(same for preemptible & non-preemptible)"]
    B -- "No" --> D{"priorityCutoffFor(job, priority)"}
    D -- "job.Preemptible == true" --> E["cutoff = scheduledPriority\n→ deduct from buckets ≤ priority"]
    D -- "job.Preemptible == false" --> F["cutoff = math.MaxInt32\n→ deduct from ALL buckets"]
    E --> G["markAllocated / markAllocatable applied"]
    F --> G
    G --> H{"isEvicted? (bind only)"}
    H -- "Yes" --> I["markAllocatable(EvictedPriority, rs)\n(release evicted slot)"]
    H -- "No" --> J["Done"]
    I --> J
Loading

Reviews (7): Last reviewed commit: "Deduct non-preemptible job resources at ..." | Re-trigger Greptile

Comment thread internal/scheduler/scheduling/preempting_queue_scheduler_test.go
@dejanzele dejanzele force-pushed the repro-non-preemptible-overpack branch from 5eaf6a7 to 13f4b64 Compare April 27, 2026 11:22
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the repro-non-preemptible-overpack branch 3 times, most recently from c752973 to 994c9c6 Compare April 27, 2026 11:30
@dejanzele dejanzele changed the title Reproducer: scheduler over-packs nodes when lower-priority incumbents are non-preemptible fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible Apr 27, 2026
@dejanzele dejanzele force-pushed the repro-non-preemptible-overpack branch 2 times, most recently from 279fcdc to 0912df5 Compare April 27, 2026 12:36
… are non-preemptible

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the repro-non-preemptible-overpack branch from 0912df5 to 4c52a1b Compare April 27, 2026 13:40
…r-pack

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the repro-non-preemptible-overpack branch from 4c52a1b to 475ead0 Compare April 27, 2026 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant