Skip to content

fix(mocker): gate vLLM waiting-request admission on full ISL fit (#9718)#9721

Open
nealvaidya wants to merge 4 commits into
mainfrom
nealv/fix-mocker-vllm-reserve-full-isl-9718
Open

fix(mocker): gate vLLM waiting-request admission on full ISL fit (#9718)#9721
nealvaidya wants to merge 4 commits into
mainfrom
nealv/fix-mocker-vllm-reserve-full-isl-9718

Conversation

@nealvaidya
Copy link
Copy Markdown
Contributor

@nealvaidya nealvaidya commented May 19, 2026

Summary

  • Adds scheduler_reserve_full_isl: bool to MockEngineArgs (default true), mirroring vLLM's upstream default.
  • In the mocker's vLLM scheduler (lib/mocker/src/scheduler/vllm/core.rs), gates the from_waiting=true admission path on whether the full ISL (minus prefix-cache hits) fits in currently free KV capacity. Returns ScheduleOutcome::Blocked before the shared preemption branch, so running requests are no longer evicted to make room for an admission real vLLM would have refused.
  • Wires the kwarg through the pyo3 binding (lib/bindings/python/rust/llm/replay.rs) and the _core.pyi stub.

Fixes #9718.

Why

Real vLLM passes full_sequence_must_fit=self.scheduler_reserve_full_isl into kv_cache_manager.allocate_slots() on the waiting-request path (vllm/v1/core/sched/scheduler.py:721-740); on None it breaks the waiting loop without preempting. The mocker had no equivalent gate — schedule_request() used a single path for running and waiting requests with a chunk-sized allocation target, so a partial allocation for a fresh admission fell through to state.preempt(), producing preemption thrash on long-context multi-session replays (>26 min vs ~32s on a Mooncake replay).

SGLang's path is unchanged: real SGLang's admission gate is unconditional (see python/sglang/srt/managers/schedule_policy.py:add_one_req, if total_tokens >= self.rem_total_tokens: return AddReqResult.NO_TOKEN), and the mocker's lib/mocker/src/scheduler/sglang/prefill.rs:75-79 already mirrors that unconditional check.

What does vLLM actually do with a request that can't currently fit?

It defers admission, it does not reject. Two distinct mechanisms layered in vLLM, only one of which this PR mirrors:

  1. Hard reject at request ingest (out of scope for this PR). vLLM rejects a request with len(prompt_tokens) > max_model_len at the API/engine layer with a 400. This is independent of the scheduler. The mocker is a simulator, not an engine, and doesn't model this path — requests are accepted regardless of size.

  2. Scheduler admission deferral (what this PR mirrors). Inside the waiting-request loop:

    new_blocks = self.kv_cache_manager.allocate_slots(
        request, num_new_tokens, ...,
        full_sequence_must_fit=self.scheduler_reserve_full_isl,
    )
    if new_blocks is None:
        # The request cannot be scheduled.
        ...
        break

    (vllm/v1/core/sched/scheduler.py:721-740)

    The request was peek_request'd (not popped) before this, so on break it stays at the head of self.waiting. Next scheduler pass tries again. There is no time bound and no derived "this will never fit" rejection — vLLM trusts that num_gpu_blocks is sized to admit any prompt that passed the ingest check (~max_model_len tokens), so any well-formed request is admissible eventually.

The all-or-nothing precheck inside allocate_slots is here: vllm/v1/core/kv_cache_manager.py:346-360. And the default flag value is here: vllm/config/scheduler.py:140.

The mocker behavior with this PR matches mechanism #2: a request whose full ISL doesn't fit stays in waiting, the gate returns ScheduleOutcome::Blocked for that pass, no preempt fires, and the next pass retries.

Test plan

  • cargo test -p dynamo-mocker --lib — 252 passed (2 new tests below + all existing).
  • cargo clippy -p dynamo-mocker --all-targets --no-deps -- -D warnings — clean.
  • cargo fmt --all -- --check — clean.
  • New test test_full_isl_gate_blocks_admission_without_preempt: regression for bug(mocker): vLLM scheduler over-admits chunked-prefill requests under KV pressure (missing scheduler_reserve_full_isl semantics) #9718 — r3 (12-token prompt, needs 3 blocks) with 2 free blocks left after r1+r2 are running. With the default gate enabled: r3 stays in waiting, r2's num_preemptions == 0.
  • New test test_full_isl_gate_disabled_falls_back_to_preempt: same setup with scheduler_reserve_full_isl(false) — r2 is preempted, verifying the opt-out preserves legacy behavior.
  • Two existing preemption tests (test_preemption_requeues_newest_running_request, test_preemption_recompute_events_apply_cleanly) explicitly opt out via .scheduler_reserve_full_isl(false) to keep exercising LIFO/KV-removal event paths.
  • Round-trip JSON test asserts scheduler_reserve_full_isl survives serde.
  • Python bindings: macOS build of lib/bindings/python blocked by pre-existing Linux-only deps (O_DIRECT, NUMA, fallocate) — relying on Linux CI for full verification.

🤖 Generated with Claude Code

@nealvaidya nealvaidya requested a review from a team as a code owner May 19, 2026 00:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Walkthrough

This PR implements a KV-capacity admission gate for waiting requests in the vLLM scheduler mocker. The new scheduler_reserve_full_isl flag prevents the scheduler from admitting requests from the waiting queue when their full input sequence cannot fit available KV cache, matching real vLLM behavior and eliminating preemption thrash under memory pressure.

Changes

Full-ISL Admission Gate

Layer / File(s) Summary
MockEngineArgs config field and JSON serialization
lib/mocker/src/common/protocols.rs
scheduler_reserve_full_isl: bool field added to MockEngineArgs with builder default true. JSON serialization and deserialization updated to handle the field; unit test updated to round-trip the new field.
Python bindings and type stubs
lib/bindings/python/rust/llm/replay.rs, lib/bindings/python/src/dynamo/_core.pyi
PyO3 constructor defaults updated to include scheduler_reserve_full_isl and wire it into the Rust builder. dump_json() extended to emit the field. Type-stub signature declares the new parameter with default True.
vLLM scheduler waiting-request admission gate
lib/mocker/src/scheduler/vllm/core.rs
When admitting a waiting request and scheduler_reserve_full_isl is enabled, check if full input sequence (accounting for prefix-cache hits via prefill_cost.new_blocks) fits available KV capacity; return Blocked without preempting if it cannot fit. Refactor to compute prefill_cost once and derive cached_prefix_tokens from prefill_cost.cached_tokens.
Scheduler regression tests and test adjustments
lib/mocker/src/scheduler/vllm/tests.rs
Add two regression tests covering gate semantics: enabled state rejects over-capacity waiting requests without preemption; disabled state uses legacy chunk-fits-then-preempt behavior. Update existing preemption tests to explicitly disable the gate so LIFO ordering and KV-removal assertions remain valid.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a gate for vLLM waiting-request admission based on full ISL fit, and fixes the referenced issue #9718.
Linked Issues check ✅ Passed The PR addresses all coding requirements from #9718: adds scheduler_reserve_full_isl flag, gates admission for waiting requests based on full ISL fit, prevents preemption when ISL cannot fit, and implements all-or-none admission semantics.
Out of Scope Changes check ✅ Passed All changes directly address #9718 requirements: MockEngineArgs additions, vLLM scheduler gate implementation, Python bindings wiring, and regression tests. No unrelated changes detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description comprehensively covers the overview, detailed changes, rationale, implementation details, and test plan with specific file references and linked issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lib/mocker/src/common/protocols.rs (1)

1014-1044: ⚡ Quick win

Assert scheduler_reserve_full_isl in the JSON round-trip test.

The payload now carries this field, but the test does not verify it survived deserialization, so regressions on this key can pass unnoticed.

Suggested test assertion
         let restored = MockEngineArgs::from_json_str(&payload.to_string()).unwrap();

         assert_eq!(restored.worker_type, WorkerType::Decode);
         assert_eq!(restored.max_num_seqs, None);
         assert_eq!(restored.max_num_batched_tokens, None);
+        assert_eq!(
+            restored.scheduler_reserve_full_isl,
+            args.scheduler_reserve_full_isl
+        );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/mocker/src/common/protocols.rs` around lines 1014 - 1044, The JSON
round-trip test constructs payload including "scheduler_reserve_full_isl" but
never asserts it after deserialization; update the test that builds payload,
calls MockEngineArgs::from_json_str, and compares fields (the variable restored
of type MockEngineArgs) to include an assertion that
restored.scheduler_reserve_full_isl equals args.scheduler_reserve_full_isl (or
the expected value used in the test) so the key is verified through
serialization/deserialization.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lib/mocker/src/common/protocols.rs`:
- Around line 1014-1044: The JSON round-trip test constructs payload including
"scheduler_reserve_full_isl" but never asserts it after deserialization; update
the test that builds payload, calls MockEngineArgs::from_json_str, and compares
fields (the variable restored of type MockEngineArgs) to include an assertion
that restored.scheduler_reserve_full_isl equals args.scheduler_reserve_full_isl
(or the expected value used in the test) so the key is verified through
serialization/deserialization.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ec54f5c5-f169-40cb-9b99-790794cde860

📥 Commits

Reviewing files that changed from the base of the PR and between 828abba and 3d6fb9b.

📒 Files selected for processing (5)
  • lib/bindings/python/rust/llm/replay.rs
  • lib/bindings/python/src/dynamo/_core.pyi
  • lib/mocker/src/common/protocols.rs
  • lib/mocker/src/scheduler/vllm/core.rs
  • lib/mocker/src/scheduler/vllm/tests.rs

@nealvaidya nealvaidya force-pushed the nealv/fix-mocker-vllm-reserve-full-isl-9718 branch 2 times, most recently from 09d35f9 to 9828ca8 Compare May 19, 2026 00:38
@nealvaidya
Copy link
Copy Markdown
Contributor Author

/ok to test 9828ca8

@dynamo-ops
Copy link
Copy Markdown

lib/bindings/python/rust/llm/replay.rs:136 — Adding scheduler_reserve_full_isl before speedup_ratio changes the public Python constructor's positional argument order, so existing positional callers bind speedup_ratio and later arguments to the wrong parameters or fail. Fix: append the new optional parameter after the existing parameters or make it keyword-only while preserving the old positional order.
lib/mocker/src/scheduler/vllm/core.rs:735 — The unconditional get_prefill_cost call makes every running decode pass scan the sequence blocks even though the value is only used for zero-computed or waiting-gate cases, causing an O(context blocks) hot-path regression. Fix: compute the prefill cost lazily only when request.num_computed_tokens == 0 or the waiting-request gate is active.
lib/mocker/src/scheduler/vllm/core.rs:770 — Comparing only prefill_cost.new_blocks to free_blocks undercounts inactive prefix-cache hits because free_blocks includes inactive blocks that become active when reused, so a cached-prefix request with an uncached suffix can still pass the gate and then preempt. Fix: count all blocks that are not already active, or split prefix hits into active versus inactive before the fit check.

@nealvaidya
Copy link
Copy Markdown
Contributor Author

/ok to test a0d481a

Copy link
Copy Markdown
Contributor

@dreamtalen dreamtalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, thanks for the fix!

Comment thread lib/mocker/src/scheduler/vllm/core.rs Outdated
// running requests are not evicted to make room for an admission
// that real vLLM would never have accepted.
if from_waiting && self.args.scheduler_reserve_full_isl && remaining_known_tokens > 0 {
let cost = self.kv_manager.get_prefill_cost(&request.sequence);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a get_prefill_cost call for this request above at L747. Could you refactor this a bit to avoid calling it twice?

};
if self.registered_blocks.contains_key(plh) {
overlap += 1;
inactive_overlap += 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a regression case that exercises this inactive-overlap branch? The new scheduler tests use enable_prefix_caching(false), so inactive_overlap_blocks stays 0 and the cost.new_blocks + cost.inactive_overlap_blocks admission check is only covered for fully cold prompts. A small prefix-caching test with an inactive cached prefix and tight capacity would lock down the vLLM parity behavior where touching an evictable cached block consumes free capacity.

Copy link
Copy Markdown
Contributor

@PeaBrane PeaBrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on condition that @dreamtalen and my comments are addressed

nealvaidya and others added 4 commits May 19, 2026 12:06
Mirror vLLM's `scheduler_reserve_full_isl=True` default. Before admitting a
waiting request, refuse if the full ISL (minus prefix-cache hits) cannot
fit in currently free KV capacity. Returning `Blocked` here bypasses the
shared preemption branch, so running requests are no longer evicted to
make room for an admission that real vLLM would never accept.

Previously the mocker's `schedule_request()` used a single path for
running and waiting requests with a chunk-sized allocation target; a
partial allocation for a fresh admission fell through to `state.preempt()`,
producing preemption thrash on long-context multi-session replays
(observed: 26+ min vs 32s on a Mooncake replay with max_num_seqs=32).

The SGLang mocker path is unchanged — real SGLang's admission gate is
unconditional, and the mocker's sglang prefill admit already mirrors it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Neal Vaidya <nealv@nvidia.com>
- Move `scheduler_reserve_full_isl` to the end of the Python
  `MockEngineArgs(...)` signature so positional callers of the
  existing kwargs aren't shifted.
- Restore lazy `get_prefill_cost` computation: only call it when
  `num_computed_tokens == 0` (for `cached_prefix_tokens`) or when
  the waiting-admission gate actually fires. Removes the
  O(seq_blocks) scan that the unconditional call introduced on
  every running-decode pass.
- Tighten the gate: include `inactive_overlap_blocks` in the
  demand. Reusing a cached block from the inactive pool promotes
  it from inactive to active and consumes free-pool capacity,
  so omitting it from the demand could let a request slip past
  the gate and still hit preemption on allocation.
- Extend `PrefillCost` with `inactive_overlap_blocks` and split the
  overlap loop in `KvManager::get_prefill_cost` into active vs
  inactive.
- Assert `scheduler_reserve_full_isl` in the JSON round-trip test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Neal Vaidya <nealv@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Neal Vaidya <nealv@nvidia.com>
- Compute `get_prefill_cost` exactly once per call. Hoist into an
  `Option<PrefillCost>` shared by the `cached_prefix_tokens` calc
  and the waiting-admission gate so a fresh-from-waiting admit no
  longer scans the sequence blocks twice. (per @dreamtalen)
- Add `test_full_isl_gate_counts_inactive_overlap_blocks`: drives
  r1 to completion with prefix caching enabled so its blocks land
  in the inactive pool, then submits r2 sharing 12 prefix tokens
  with r1 plus 8 new tokens against a 2-block-pinned worker.
  Demand = 2 new + 3 inactive_overlap = 5, free = 4 → blocked
  without preempting. Locks down the inactive-overlap branch that
  the existing `enable_prefix_caching=false` tests didn't reach.
  (per @PeaBrane)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Neal Vaidya <nealv@nvidia.com>
@nealvaidya nealvaidya force-pushed the nealv/fix-mocker-vllm-reserve-full-isl-9718 branch from edfea90 to 54733f4 Compare May 19, 2026 19:40
@nealvaidya
Copy link
Copy Markdown
Contributor Author

/ok to test 54733f4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(mocker): vLLM scheduler over-admits chunked-prefill requests under KV pressure (missing scheduler_reserve_full_isl semantics)

4 participants