fix(mocker): gate vLLM waiting-request admission on full ISL fit (#9718)#9721
fix(mocker): gate vLLM waiting-request admission on full ISL fit (#9718)#9721nealvaidya wants to merge 4 commits into
Conversation
WalkthroughThis PR implements a KV-capacity admission gate for waiting requests in the vLLM scheduler mocker. The new ChangesFull-ISL Admission Gate
🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
lib/mocker/src/common/protocols.rs (1)
1014-1044: ⚡ Quick winAssert
scheduler_reserve_full_islin the JSON round-trip test.The payload now carries this field, but the test does not verify it survived deserialization, so regressions on this key can pass unnoticed.
Suggested test assertion
let restored = MockEngineArgs::from_json_str(&payload.to_string()).unwrap(); assert_eq!(restored.worker_type, WorkerType::Decode); assert_eq!(restored.max_num_seqs, None); assert_eq!(restored.max_num_batched_tokens, None); + assert_eq!( + restored.scheduler_reserve_full_isl, + args.scheduler_reserve_full_isl + );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/mocker/src/common/protocols.rs` around lines 1014 - 1044, The JSON round-trip test constructs payload including "scheduler_reserve_full_isl" but never asserts it after deserialization; update the test that builds payload, calls MockEngineArgs::from_json_str, and compares fields (the variable restored of type MockEngineArgs) to include an assertion that restored.scheduler_reserve_full_isl equals args.scheduler_reserve_full_isl (or the expected value used in the test) so the key is verified through serialization/deserialization.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@lib/mocker/src/common/protocols.rs`:
- Around line 1014-1044: The JSON round-trip test constructs payload including
"scheduler_reserve_full_isl" but never asserts it after deserialization; update
the test that builds payload, calls MockEngineArgs::from_json_str, and compares
fields (the variable restored of type MockEngineArgs) to include an assertion
that restored.scheduler_reserve_full_isl equals args.scheduler_reserve_full_isl
(or the expected value used in the test) so the key is verified through
serialization/deserialization.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ec54f5c5-f169-40cb-9b99-790794cde860
📒 Files selected for processing (5)
lib/bindings/python/rust/llm/replay.rslib/bindings/python/src/dynamo/_core.pyilib/mocker/src/common/protocols.rslib/mocker/src/scheduler/vllm/core.rslib/mocker/src/scheduler/vllm/tests.rs
09d35f9 to
9828ca8
Compare
|
/ok to test 9828ca8 |
|
lib/bindings/python/rust/llm/replay.rs:136 — Adding |
|
/ok to test a0d481a |
dreamtalen
left a comment
There was a problem hiding this comment.
LGTM overall, thanks for the fix!
| // running requests are not evicted to make room for an admission | ||
| // that real vLLM would never have accepted. | ||
| if from_waiting && self.args.scheduler_reserve_full_isl && remaining_known_tokens > 0 { | ||
| let cost = self.kv_manager.get_prefill_cost(&request.sequence); |
There was a problem hiding this comment.
We have a get_prefill_cost call for this request above at L747. Could you refactor this a bit to avoid calling it twice?
| }; | ||
| if self.registered_blocks.contains_key(plh) { | ||
| overlap += 1; | ||
| inactive_overlap += 1; |
There was a problem hiding this comment.
Can we add a regression case that exercises this inactive-overlap branch? The new scheduler tests use enable_prefix_caching(false), so inactive_overlap_blocks stays 0 and the cost.new_blocks + cost.inactive_overlap_blocks admission check is only covered for fully cold prompts. A small prefix-caching test with an inactive cached prefix and tight capacity would lock down the vLLM parity behavior where touching an evictable cached block consumes free capacity.
There was a problem hiding this comment.
Approving on condition that @dreamtalen and my comments are addressed
Mirror vLLM's `scheduler_reserve_full_isl=True` default. Before admitting a waiting request, refuse if the full ISL (minus prefix-cache hits) cannot fit in currently free KV capacity. Returning `Blocked` here bypasses the shared preemption branch, so running requests are no longer evicted to make room for an admission that real vLLM would never accept. Previously the mocker's `schedule_request()` used a single path for running and waiting requests with a chunk-sized allocation target; a partial allocation for a fresh admission fell through to `state.preempt()`, producing preemption thrash on long-context multi-session replays (observed: 26+ min vs 32s on a Mooncake replay with max_num_seqs=32). The SGLang mocker path is unchanged — real SGLang's admission gate is unconditional, and the mocker's sglang prefill admit already mirrors it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com>
- Move `scheduler_reserve_full_isl` to the end of the Python `MockEngineArgs(...)` signature so positional callers of the existing kwargs aren't shifted. - Restore lazy `get_prefill_cost` computation: only call it when `num_computed_tokens == 0` (for `cached_prefix_tokens`) or when the waiting-admission gate actually fires. Removes the O(seq_blocks) scan that the unconditional call introduced on every running-decode pass. - Tighten the gate: include `inactive_overlap_blocks` in the demand. Reusing a cached block from the inactive pool promotes it from inactive to active and consumes free-pool capacity, so omitting it from the demand could let a request slip past the gate and still hit preemption on allocation. - Extend `PrefillCost` with `inactive_overlap_blocks` and split the overlap loop in `KvManager::get_prefill_cost` into active vs inactive. - Assert `scheduler_reserve_full_isl` in the JSON round-trip test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com>
- Compute `get_prefill_cost` exactly once per call. Hoist into an `Option<PrefillCost>` shared by the `cached_prefix_tokens` calc and the waiting-admission gate so a fresh-from-waiting admit no longer scans the sequence blocks twice. (per @dreamtalen) - Add `test_full_isl_gate_counts_inactive_overlap_blocks`: drives r1 to completion with prefix caching enabled so its blocks land in the inactive pool, then submits r2 sharing 12 prefix tokens with r1 plus 8 new tokens against a 2-block-pinned worker. Demand = 2 new + 3 inactive_overlap = 5, free = 4 → blocked without preempting. Locks down the inactive-overlap branch that the existing `enable_prefix_caching=false` tests didn't reach. (per @PeaBrane) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com>
edfea90 to
54733f4
Compare
|
/ok to test 54733f4 |
Summary
scheduler_reserve_full_isl: booltoMockEngineArgs(defaulttrue), mirroring vLLM's upstream default.lib/mocker/src/scheduler/vllm/core.rs), gates thefrom_waiting=trueadmission path on whether the full ISL (minus prefix-cache hits) fits in currently free KV capacity. ReturnsScheduleOutcome::Blockedbefore the shared preemption branch, so running requests are no longer evicted to make room for an admission real vLLM would have refused.lib/bindings/python/rust/llm/replay.rs) and the_core.pyistub.Fixes #9718.
Why
Real vLLM passes
full_sequence_must_fit=self.scheduler_reserve_full_islintokv_cache_manager.allocate_slots()on the waiting-request path (vllm/v1/core/sched/scheduler.py:721-740); onNoneitbreaks the waiting loop without preempting. The mocker had no equivalent gate —schedule_request()used a single path for running and waiting requests with a chunk-sized allocation target, so a partial allocation for a fresh admission fell through tostate.preempt(), producing preemption thrash on long-context multi-session replays (>26 min vs ~32s on a Mooncake replay).SGLang's path is unchanged: real SGLang's admission gate is unconditional (see
python/sglang/srt/managers/schedule_policy.py:add_one_req,if total_tokens >= self.rem_total_tokens: return AddReqResult.NO_TOKEN), and the mocker'slib/mocker/src/scheduler/sglang/prefill.rs:75-79already mirrors that unconditional check.What does vLLM actually do with a request that can't currently fit?
It defers admission, it does not reject. Two distinct mechanisms layered in vLLM, only one of which this PR mirrors:
Hard reject at request ingest (out of scope for this PR). vLLM rejects a request with
len(prompt_tokens) > max_model_lenat the API/engine layer with a 400. This is independent of the scheduler. The mocker is a simulator, not an engine, and doesn't model this path — requests are accepted regardless of size.Scheduler admission deferral (what this PR mirrors). Inside the waiting-request loop:
(
vllm/v1/core/sched/scheduler.py:721-740)The request was
peek_request'd (not popped) before this, so onbreakit stays at the head ofself.waiting. Next scheduler pass tries again. There is no time bound and no derived "this will never fit" rejection — vLLM trusts thatnum_gpu_blocksis sized to admit any prompt that passed the ingest check (~max_model_lentokens), so any well-formed request is admissible eventually.The all-or-nothing precheck inside
allocate_slotsis here:vllm/v1/core/kv_cache_manager.py:346-360. And the default flag value is here:vllm/config/scheduler.py:140.The mocker behavior with this PR matches mechanism #2: a request whose full ISL doesn't fit stays in
waiting, the gate returnsScheduleOutcome::Blockedfor that pass, no preempt fires, and the next pass retries.Test plan
cargo test -p dynamo-mocker --lib— 252 passed (2 new tests below + all existing).cargo clippy -p dynamo-mocker --all-targets --no-deps -- -D warnings— clean.cargo fmt --all -- --check— clean.test_full_isl_gate_blocks_admission_without_preempt: regression for bug(mocker): vLLM scheduler over-admits chunked-prefill requests under KV pressure (missing scheduler_reserve_full_isl semantics) #9718 — r3 (12-token prompt, needs 3 blocks) with 2 free blocks left after r1+r2 are running. With the default gate enabled: r3 stays inwaiting, r2'snum_preemptions == 0.test_full_isl_gate_disabled_falls_back_to_preempt: same setup withscheduler_reserve_full_isl(false)— r2 is preempted, verifying the opt-out preserves legacy behavior.test_preemption_requeues_newest_running_request,test_preemption_recompute_events_apply_cleanly) explicitly opt out via.scheduler_reserve_full_isl(false)to keep exercising LIFO/KV-removal event paths.scheduler_reserve_full_islsurvives serde.lib/bindings/pythonblocked by pre-existing Linux-only deps (O_DIRECT, NUMA,fallocate) — relying on Linux CI for full verification.🤖 Generated with Claude Code