fix(mocker): gate vLLM waiting-request admission on full ISL fit (#9718) by nealvaidya · Pull Request #9721 · ai-dynamo/dynamo

nealvaidya · 2026-05-19T00:27:49Z

Summary

Adds scheduler_reserve_full_isl: bool to MockEngineArgs (default true), mirroring vLLM's upstream default.
In the mocker's vLLM scheduler (lib/mocker/src/scheduler/vllm/core.rs), gates the from_waiting=true admission path on whether the full ISL (minus prefix-cache hits) fits in currently free KV capacity. Returns ScheduleOutcome::Blocked before the shared preemption branch, so running requests are no longer evicted to make room for an admission real vLLM would have refused.
Wires the kwarg through the pyo3 binding (lib/bindings/python/rust/llm/replay.rs) and the _core.pyi stub.

Why

Real vLLM passes full_sequence_must_fit=self.scheduler_reserve_full_isl into kv_cache_manager.allocate_slots() on the waiting-request path (vllm/v1/core/sched/scheduler.py:721-740); on None it breaks the waiting loop without preempting. The mocker had no equivalent gate — schedule_request() used a single path for running and waiting requests with a chunk-sized allocation target, so a partial allocation for a fresh admission fell through to state.preempt(), producing preemption thrash on long-context multi-session replays (>26 min vs ~32s on a Mooncake replay).

SGLang's path is unchanged: real SGLang's admission gate is unconditional (see python/sglang/srt/managers/schedule_policy.py:add_one_req, if total_tokens >= self.rem_total_tokens: return AddReqResult.NO_TOKEN), and the mocker's lib/mocker/src/scheduler/sglang/prefill.rs:75-79 already mirrors that unconditional check.

What does vLLM actually do with a request that can't currently fit?

It defers admission, it does not reject. Two distinct mechanisms layered in vLLM, only one of which this PR mirrors:

Hard reject at request ingest (out of scope for this PR). vLLM rejects a request with len(prompt_tokens) > max_model_len at the API/engine layer with a 400. This is independent of the scheduler. The mocker is a simulator, not an engine, and doesn't model this path — requests are accepted regardless of size.
Scheduler admission deferral (what this PR mirrors). Inside the waiting-request loop:
```
new_blocks = self.kv_cache_manager.allocate_slots(
    request, num_new_tokens, ...,
    full_sequence_must_fit=self.scheduler_reserve_full_isl,
)
if new_blocks is None:
    # The request cannot be scheduled.
    ...
    break
```
(vllm/v1/core/sched/scheduler.py:721-740)

The request was peek_request'd (not popped) before this, so on break it stays at the head of self.waiting. Next scheduler pass tries again. There is no time bound and no derived "this will never fit" rejection — vLLM trusts that num_gpu_blocks is sized to admit any prompt that passed the ingest check (~max_model_len tokens), so any well-formed request is admissible eventually.

The all-or-nothing precheck inside allocate_slots is here: vllm/v1/core/kv_cache_manager.py:346-360. And the default flag value is here: vllm/config/scheduler.py:140.

The mocker behavior with this PR matches mechanism #2: a request whose full ISL doesn't fit stays in waiting, the gate returns ScheduleOutcome::Blocked for that pass, no preempt fires, and the next pass retries.

Test plan

cargo test -p dynamo-mocker --lib — 252 passed (2 new tests below + all existing).
cargo clippy -p dynamo-mocker --all-targets --no-deps -- -D warnings — clean.
cargo fmt --all -- --check — clean.
New test test_full_isl_gate_blocks_admission_without_preempt: regression for bug(mocker): vLLM scheduler over-admits chunked-prefill requests under KV pressure (missing scheduler_reserve_full_isl semantics) #9718 — r3 (12-token prompt, needs 3 blocks) with 2 free blocks left after r1+r2 are running. With the default gate enabled: r3 stays in waiting, r2's num_preemptions == 0.
New test test_full_isl_gate_disabled_falls_back_to_preempt: same setup with scheduler_reserve_full_isl(false) — r2 is preempted, verifying the opt-out preserves legacy behavior.
Two existing preemption tests (test_preemption_requeues_newest_running_request, test_preemption_recompute_events_apply_cleanly) explicitly opt out via .scheduler_reserve_full_isl(false) to keep exercising LIFO/KV-removal event paths.
Round-trip JSON test asserts scheduler_reserve_full_isl survives serde.
Python bindings: macOS build of lib/bindings/python blocked by pre-existing Linux-only deps (O_DIRECT, NUMA, fallocate) — relying on Linux CI for full verification.

🤖 Generated with Claude Code

copy-pr-bot · 2026-05-19T00:27:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-19T00:33:09Z

Walkthrough

This PR implements a KV-capacity admission gate for waiting requests in the vLLM scheduler mocker. The new scheduler_reserve_full_isl flag prevents the scheduler from admitting requests from the waiting queue when their full input sequence cannot fit available KV cache, matching real vLLM behavior and eliminating preemption thrash under memory pressure.

Changes

Full-ISL Admission Gate

Layer / File(s)	Summary
MockEngineArgs config field and JSON serialization `lib/mocker/src/common/protocols.rs`	`scheduler_reserve_full_isl: bool` field added to `MockEngineArgs` with builder default `true`. JSON serialization and deserialization updated to handle the field; unit test updated to round-trip the new field.
Python bindings and type stubs `lib/bindings/python/rust/llm/replay.rs`, `lib/bindings/python/src/dynamo/_core.pyi`	PyO3 constructor defaults updated to include `scheduler_reserve_full_isl` and wire it into the Rust builder. `dump_json()` extended to emit the field. Type-stub signature declares the new parameter with default `True`.
vLLM scheduler waiting-request admission gate `lib/mocker/src/scheduler/vllm/core.rs`	When admitting a waiting request and `scheduler_reserve_full_isl` is enabled, check if full input sequence (accounting for prefix-cache hits via `prefill_cost.new_blocks`) fits available KV capacity; return `Blocked` without preempting if it cannot fit. Refactor to compute `prefill_cost` once and derive `cached_prefix_tokens` from `prefill_cost.cached_tokens`.
Scheduler regression tests and test adjustments `lib/mocker/src/scheduler/vllm/tests.rs`	Add two regression tests covering gate semantics: enabled state rejects over-capacity waiting requests without preemption; disabled state uses legacy chunk-fits-then-preempt behavior. Update existing preemption tests to explicitly disable the gate so LIFO ordering and KV-removal assertions remain valid.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding a gate for vLLM waiting-request admission based on full ISL fit, and fixes the referenced issue `#9718`.
Linked Issues check	✅ Passed	The PR addresses all coding requirements from `#9718`: adds scheduler_reserve_full_isl flag, gates admission for waiting requests based on full ISL fit, prevents preemption when ISL cannot fit, and implements all-or-none admission semantics.
Out of Scope Changes check	✅ Passed	All changes directly address `#9718` requirements: MockEngineArgs additions, vLLM scheduler gate implementation, Python bindings wiring, and regression tests. No unrelated changes detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The pull request description comprehensively covers the overview, detailed changes, rationale, implementation details, and test plan with specific file references and linked issue.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

lib/mocker/src/common/protocols.rs (1)

1014-1044: ⚡ Quick win

Assert scheduler_reserve_full_isl in the JSON round-trip test.

The payload now carries this field, but the test does not verify it survived deserialization, so regressions on this key can pass unnoticed.

Suggested test assertion

         let restored = MockEngineArgs::from_json_str(&payload.to_string()).unwrap();

         assert_eq!(restored.worker_type, WorkerType::Decode);
         assert_eq!(restored.max_num_seqs, None);
         assert_eq!(restored.max_num_batched_tokens, None);
+        assert_eq!(
+            restored.scheduler_reserve_full_isl,
+            args.scheduler_reserve_full_isl
+        );

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/mocker/src/common/protocols.rs` around lines 1014 - 1044, The JSON
round-trip test constructs payload including "scheduler_reserve_full_isl" but
never asserts it after deserialization; update the test that builds payload,
calls MockEngineArgs::from_json_str, and compares fields (the variable restored
of type MockEngineArgs) to include an assertion that
restored.scheduler_reserve_full_isl equals args.scheduler_reserve_full_isl (or
the expected value used in the test) so the key is verified through
serialization/deserialization.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lib/mocker/src/common/protocols.rs`:
- Around line 1014-1044: The JSON round-trip test constructs payload including
"scheduler_reserve_full_isl" but never asserts it after deserialization; update
the test that builds payload, calls MockEngineArgs::from_json_str, and compares
fields (the variable restored of type MockEngineArgs) to include an assertion
that restored.scheduler_reserve_full_isl equals args.scheduler_reserve_full_isl
(or the expected value used in the test) so the key is verified through
serialization/deserialization.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ec54f5c5-f169-40cb-9b99-790794cde860

📥 Commits

Reviewing files that changed from the base of the PR and between 828abba and 3d6fb9b.

📒 Files selected for processing (5)

lib/bindings/python/rust/llm/replay.rs
lib/bindings/python/src/dynamo/_core.pyi
lib/mocker/src/common/protocols.rs
lib/mocker/src/scheduler/vllm/core.rs
lib/mocker/src/scheduler/vllm/tests.rs

nealvaidya · 2026-05-19T00:39:15Z

/ok to test 9828ca8

dynamo-ops · 2026-05-19T00:39:30Z

lib/bindings/python/rust/llm/replay.rs:136 — Adding scheduler_reserve_full_isl before speedup_ratio changes the public Python constructor's positional argument order, so existing positional callers bind speedup_ratio and later arguments to the wrong parameters or fail. Fix: append the new optional parameter after the existing parameters or make it keyword-only while preserving the old positional order.
lib/mocker/src/scheduler/vllm/core.rs:735 — The unconditional get_prefill_cost call makes every running decode pass scan the sequence blocks even though the value is only used for zero-computed or waiting-gate cases, causing an O(context blocks) hot-path regression. Fix: compute the prefill cost lazily only when request.num_computed_tokens == 0 or the waiting-request gate is active.
lib/mocker/src/scheduler/vllm/core.rs:770 — Comparing only prefill_cost.new_blocks to free_blocks undercounts inactive prefix-cache hits because free_blocks includes inactive blocks that become active when reused, so a cached-prefix request with an uncached suffix can still pass the gate and then preempt. Fix: count all blocks that are not already active, or split prefix hits into active versus inactive before the fit check.

nealvaidya · 2026-05-19T01:01:43Z

/ok to test a0d481a

dreamtalen

LGTM overall, thanks for the fix!

dreamtalen · 2026-05-19T06:38:45Z

+        // running requests are not evicted to make room for an admission
+        // that real vLLM would never have accepted.
+        if from_waiting && self.args.scheduler_reserve_full_isl && remaining_known_tokens > 0 {
+            let cost = self.kv_manager.get_prefill_cost(&request.sequence);


We have a get_prefill_cost call for this request above at L747. Could you refactor this a bit to avoid calling it twice?

PeaBrane · 2026-05-19T14:35:01Z

                        };
                        if self.registered_blocks.contains_key(plh) {
                            overlap += 1;
+                            inactive_overlap += 1;


Can we add a regression case that exercises this inactive-overlap branch? The new scheduler tests use enable_prefix_caching(false), so inactive_overlap_blocks stays 0 and the cost.new_blocks + cost.inactive_overlap_blocks admission check is only covered for fully cold prompts. A small prefix-caching test with an inactive cached prefix and tight capacity would lock down the vLLM parity behavior where touching an evictable cached block consumes free capacity.

PeaBrane

Approving on condition that @dreamtalen and my comments are addressed

Mirror vLLM's `scheduler_reserve_full_isl=True` default. Before admitting a waiting request, refuse if the full ISL (minus prefix-cache hits) cannot fit in currently free KV capacity. Returning `Blocked` here bypasses the shared preemption branch, so running requests are no longer evicted to make room for an admission that real vLLM would never accept. Previously the mocker's `schedule_request()` used a single path for running and waiting requests with a chunk-sized allocation target; a partial allocation for a fresh admission fell through to `state.preempt()`, producing preemption thrash on long-context multi-session replays (observed: 26+ min vs 32s on a Mooncake replay with max_num_seqs=32). The SGLang mocker path is unchanged — real SGLang's admission gate is unconditional, and the mocker's sglang prefill admit already mirrors it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com>

- Move `scheduler_reserve_full_isl` to the end of the Python `MockEngineArgs(...)` signature so positional callers of the existing kwargs aren't shifted. - Restore lazy `get_prefill_cost` computation: only call it when `num_computed_tokens == 0` (for `cached_prefix_tokens`) or when the waiting-admission gate actually fires. Removes the O(seq_blocks) scan that the unconditional call introduced on every running-decode pass. - Tighten the gate: include `inactive_overlap_blocks` in the demand. Reusing a cached block from the inactive pool promotes it from inactive to active and consumes free-pool capacity, so omitting it from the demand could let a request slip past the gate and still hit preemption on allocation. - Extend `PrefillCost` with `inactive_overlap_blocks` and split the overlap loop in `KvManager::get_prefill_cost` into active vs inactive. - Assert `scheduler_reserve_full_isl` in the JSON round-trip test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com>

@dreamtalen

- Compute `get_prefill_cost` exactly once per call. Hoist into an `Option<PrefillCost>` shared by the `cached_prefix_tokens` calc and the waiting-admission gate so a fresh-from-waiting admit no longer scans the sequence blocks twice. (per @dreamtalen) - Add `test_full_isl_gate_counts_inactive_overlap_blocks`: drives r1 to completion with prefix caching enabled so its blocks land in the inactive pool, then submits r2 sharing 12 prefix tokens with r1 plus 8 new tokens against a 2-block-pinned worker. Demand = 2 new + 3 inactive_overlap = 5, free = 4 → blocked without preempting. Locks down the inactive-overlap branch that the existing `enable_prefix_caching=false` tests didn't reach. (per @PeaBrane) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com>

nealvaidya · 2026-05-20T02:33:51Z

/ok to test 54733f4

nealvaidya requested a review from a team as a code owner May 19, 2026 00:27

pull-request-size Bot added the size/L label May 19, 2026

github-actions Bot added the fix label May 19, 2026

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

nealvaidya force-pushed the nealv/fix-mocker-vllm-reserve-full-isl-9718 branch 2 times, most recently from 09d35f9 to 9828ca8 Compare May 19, 2026 00:38

copy-pr-bot Bot temporarily deployed to GITLAB May 19, 2026 01:01 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 19, 2026 04:12 Inactive

dreamtalen reviewed May 19, 2026

View reviewed changes

PeaBrane reviewed May 19, 2026

View reviewed changes

PeaBrane approved these changes May 19, 2026

View reviewed changes

nealvaidya and others added 4 commits May 19, 2026 12:06

style(mocker): cargo fmt

ea9d8fb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com>

nealvaidya force-pushed the nealv/fix-mocker-vllm-reserve-full-isl-9718 branch from edfea90 to 54733f4 Compare May 19, 2026 19:40

copy-pr-bot Bot temporarily deployed to GITLAB May 20, 2026 02:33 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 20, 2026 03:41 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mocker): gate vLLM waiting-request admission on full ISL fit (#9718)#9721

fix(mocker): gate vLLM waiting-request admission on full ISL fit (#9718)#9721
nealvaidya wants to merge 4 commits into
mainfrom
nealv/fix-mocker-vllm-reserve-full-isl-9718

nealvaidya commented May 19, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

nealvaidya commented May 19, 2026

Uh oh!

dynamo-ops commented May 19, 2026

Uh oh!

nealvaidya commented May 19, 2026

Uh oh!

dreamtalen left a comment

Uh oh!

dreamtalen May 19, 2026

Uh oh!

PeaBrane May 19, 2026

Uh oh!

PeaBrane left a comment •

edited

Loading

Uh oh!

nealvaidya commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nealvaidya commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What does vLLM actually do with a request that can't currently fit?

Test plan

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

nealvaidya commented May 19, 2026

Uh oh!

dynamo-ops commented May 19, 2026

Uh oh!

nealvaidya commented May 19, 2026

Uh oh!

dreamtalen left a comment

Choose a reason for hiding this comment

Uh oh!

dreamtalen May 19, 2026

Choose a reason for hiding this comment

Uh oh!

PeaBrane May 19, 2026

Choose a reason for hiding this comment

Uh oh!

PeaBrane left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealvaidya commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nealvaidya commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

PeaBrane left a comment •

edited

Loading