feat(mocker): KVBM G2 offload for on/offline replay#8184
Conversation
88aabb6 to
b7c678b
Compare
WalkthroughThis PR adds KVBM (KV Block Manager) G1↔G2 offload functionality to simulate hierarchical KV cache memory with configurable parameters, transfer delays, and both live async and offline replay modes. Three new configuration arguments are introduced, accompanied by corresponding Rust and Python binding updates, a new KVBM orchestration module, and integration into the scheduler and KV manager systems. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
lib/kvbm-logical/src/manager/mod.rs (1)
264-268: Consider a batched inactive-pool existence API to reduce per-hash overhead.
has_blockscurrently performs oneinactive_pool.has_blockcall per hash. If this path is hot, a single batched lookup inInactivePoolcan reduce lock churn and improve offline replay throughput.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/kvbm-logical/src/manager/mod.rs` around lines 264 - 268, has_blocks currently calls InactivePool::has_block in a loop, causing per-hash lock overhead; add a batched existence API on InactivePool (e.g., InactivePool::has_blocks or has_many that takes &[SequenceHash] and returns Vec<bool> or a HashSet of present hashes), implement the internal lookup under a single lock/scan to reduce churn, then modify Manager::has_blocks to call the new batched method (keeping the public signature of Manager::has_blocks) so callers get the same Vec<bool> while benefiting from the single-shot lookup; ensure tests covering both single and multiple hashes are updated accordingly.lib/mocker/src/kv_manager/vllm_backend.rs (1)
152-182: LGTM!The
complete_ready_offloadsmethod correctly iterates pending offloads and completes those whose deadline has arrived. The Arc clone is lightweight (just refcount increment).Optional: Minor simplification opportunity
The drain + collect pattern could be simplified using
retain:-let mut still_pending = Vec::new(); -for offload in self.pending_offloads.drain(..) { - if now_ms >= offload.complete_at_ms { - engine.complete_offload(offload.block_id, offload.seq_hash); - completed += 1; - } else { - still_pending.push(offload); - } -} -self.pending_offloads = still_pending; +self.pending_offloads.retain(|offload| { + if now_ms >= offload.complete_at_ms { + engine.complete_offload(offload.block_id, offload.seq_hash); + completed += 1; + false + } else { + true + } +});However, this requires
completedto be accessible in the closure (via a Cell or moving the counter). The current approach is clear and works correctly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/mocker/src/kv_manager/vllm_backend.rs` around lines 152 - 182, complete_ready_offloads currently drains pending_offloads and rebuilds a vector; you can simplify by using Vec::retain to keep items whose complete_at_ms is in the future and call engine.complete_offload for items being completed, while tracking the count via a Cell/AtomicUsize captured in the closure; specifically, in complete_ready_offloads use &self.offload_engine (clone Arc as needed), call retain on self.pending_offloads and inside the closure check now_ms >= offload.complete_at_ms to call engine.complete_offload(offload.block_id, offload.seq_hash) and increment the counter, then use the counter for the tracing::debug call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/mocker/src/kv_manager/kvbm_offload.rs`:
- Around line 38-40: Offline replay is ignoring
KvbmOffloadConfig.offload_batch_size, causing virtual evictions to always use
transfer_delay_ms(1); propagate offload_batch_size into the sync engine setup
(the code path that builds the SyncEngine/virtual eviction) and use it to
compute batched transfer latency when marking N evicted blocks ready: compute
number_of_batches = ceil(evicted_count / offload_batch_size) and apply
transfer_delay_ms = per_batch_transfer_ms * number_of_batches (or equivalent
batching formula used by the live KVBM pipeline) instead of using a fixed 1 ms;
update the SyncEngine construction/site that currently hardcodes
transfer_delay_ms(1) to accept and use offload_batch_size from
KvbmOffloadConfig.
In `@lib/mocker/src/scheduler/mod.rs`:
- Around line 153-182: The init_kvbm_offline function currently ignores
num_g2_blocks > 0 for non-Vllm engines (Sglang), making invalid KVBM configs
silently accepted; change init_kvbm_offline to fail fast instead of no-op:
update init_kvbm_offline signature to return Result<(), E> (or propagate an
existing error type), check early if args.num_g2_blocks > 0 and match self — if
Self::Vllm proceed as before, but if Self::Sglang return Err (or panic if you
prefer) with a clear message ("KVBM config requires Vllm engine; found Sglang")
so callers/config normalization will catch the invalid config. Ensure references
to init_kvbm_offline, Self::Vllm, Self::Sglang, and args.num_g2_blocks are
updated where this function is called.
---
Nitpick comments:
In `@lib/kvbm-logical/src/manager/mod.rs`:
- Around line 264-268: has_blocks currently calls InactivePool::has_block in a
loop, causing per-hash lock overhead; add a batched existence API on
InactivePool (e.g., InactivePool::has_blocks or has_many that takes
&[SequenceHash] and returns Vec<bool> or a HashSet of present hashes), implement
the internal lookup under a single lock/scan to reduce churn, then modify
Manager::has_blocks to call the new batched method (keeping the public signature
of Manager::has_blocks) so callers get the same Vec<bool> while benefiting from
the single-shot lookup; ensure tests covering both single and multiple hashes
are updated accordingly.
In `@lib/mocker/src/kv_manager/vllm_backend.rs`:
- Around line 152-182: complete_ready_offloads currently drains pending_offloads
and rebuilds a vector; you can simplify by using Vec::retain to keep items whose
complete_at_ms is in the future and call engine.complete_offload for items being
completed, while tracking the count via a Cell/AtomicUsize captured in the
closure; specifically, in complete_ready_offloads use &self.offload_engine
(clone Arc as needed), call retain on self.pending_offloads and inside the
closure check now_ms >= offload.complete_at_ms to call
engine.complete_offload(offload.block_id, offload.seq_hash) and increment the
counter, then use the counter for the tracing::debug call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c4ea5b97-1fa7-4769-8544-7119c3e31de6
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.locklib/bindings/python/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (18)
components/src/dynamo/mocker/args.pycomponents/src/dynamo/mocker/config.pycomponents/src/dynamo/mocker/tests/unit/test_config.pylib/bindings/python/Cargo.tomllib/bindings/python/rust/llm/replay.rslib/bindings/python/src/dynamo/_core.pyilib/kvbm-logical/src/manager/mod.rslib/mocker/Cargo.tomllib/mocker/src/common/protocols.rslib/mocker/src/kv_manager/kvbm_offload.rslib/mocker/src/kv_manager/mod.rslib/mocker/src/kv_manager/vllm_backend.rslib/mocker/src/replay/offline/core.rslib/mocker/src/replay/offline/state.rslib/mocker/src/scheduler/mod.rslib/mocker/src/scheduler/vllm/core.rslib/mocker/src/scheduler/vllm/live.rslib/mocker/src/scheduler/vllm/tests.rs
b7c678b to
7568efc
Compare
|
would want @jthomson04 to have a look as well to see if there's any interaction with his remote indexing work |
There was a problem hiding this comment.
High-level review pass — the diff is a bit hard to read in its current shape, and I'd like the refactors below landed first before I do a more comprehensive review of the featural / logical bits (correctness, causality, test coverage). The architecture itself looks sound; most of the friction is readability and a few parallel-method pairs that double the surface area without adding expressiveness.
-
Extract
KvManager::try_batch_swap_infromlib/mocker/src/kv_manager/vllm_backend.rs:295–431. Return an enum like{ NoHits, Scheduled { allocated, defer } }soprocess()becomes linear again. This is the biggest readability win on the branch. -
Extract
contiguous_g2_prefix_hits(remaining, batch_results)as a free pure function fromvllm_backend.rs:341–354. Thebatch_idx/FullBlock/ first-miss-break logic deserves its own unit test. -
KvbmOffloadConfig::from_args(&MockEngineArgs) -> Option<Self>. Dedupslive.rs:125–133andscheduler/mod.rs:168–174. Both sites reconstruct the same config with the sameblock_size * bptboilerplate. -
Collapse the async/virtual method pairs on
MockOffloadEngine(lib/mocker/src/kv_manager/kvbm_offload.rs):enqueue_g1_eviction(bid, sh, now_ms)— one method; branch onself.offload_engine.is_some()internally.start_swap_in(num_blocks, now_ms)— same.- Merge
MockWorker::transfer_delayandMockOffloadEngine::transfer_delay_msinto one helper (one returnsDuration, the otherf64ms — gratuitous). SwapInHandle::is_complete(now_ms)— single method, live ignoresnow_ms. Kills the two panic paths.
Cuts the public surface roughly in half and removes the "which mode am I in?" cognitive load at every call site.
-
Hoist virtual-time bookkeeping onto
MockOffloadEngine. CurrentlyKvManagerownspending_offloads+drain_pending_offloads+pending_offload_deadlines+complete_ready_offloads+virtual_time. These are all engine concerns, not cache concerns.Shape:
engine.record_eviction(bid, sh, now_ms)— does the virtual-time branch internally.engine.tick(now_ms)— called at pass start; replacescomplete_ready_offloads.engine.earliest_pending_deadline() -> Option<f64>— feeds the stall-advance incore.rs:471–480.
Payoff:
KvManagerstops knowing about virtual time entirely for offloads. Thevirtual_time: boolflag either moves onto the engine or disappears (inferred fromoffload_enginebeing sync vs async). Removes ~50 lines of#[cfg]fields and methods fromKvManager. -
(Optional extension of #5) Move
pending_swap_insoffVllmCoreonto the engine too, withengine.tick(now_ms) -> Vec<PromotionReady { uuid, reused_input_tokens }>. Completes the story — all KVBM state lives in one place.VllmCorejust iterates the returned promotions, doesprepend_waiting, and reports admits. Worth it only if #5 alone still leaves too much KVBM logic inVllmCore.
|
@PeaBrane thanks for the feedback! Makes sense, will ping once refactor is completed |
There was a problem hiding this comment.
Note for posterity — per-worker virtual time only holds because G2 is per-worker
Writing this down explicitly so a future reader (human or AI) who extends this to shared storage doesn't trip on it. (We should also probably comment / doc this out briefly somewhere in the code if not already)
Current situation — fine as-is. The offline virtual-time machinery is entirely contained inside one worker: pending_offloads, pending_swap_ins, and BlockManager<G2> all live on that worker's KvManager. No PendingOffload deadline ever needs to be visible to another worker. This works because G2 is modeled as per-worker host memory (each KvManager owns its own BlockManager<G2> sized by num_g2_blocks), so no worker ever needs to query another worker's G2 state. The only externally observable effects — token completion timestamps in the trace and the synthetic Stored/Removed events going to the router — are already routed through existing per-worker pumps (TraceCollector, EnginePassResult.kv_events) and don't require cross-worker coordination. The router tracks per-worker radix trees, so each worker announces its own tier state independently.
This assumption breaks once shared storage is introduced. If someone later adds a G3 tier that's a genuinely shared pool (CXL fabric, RDMA host-memory pool, shared NVMe, NDS-style global cache), worker B at virtual time t will need to be able to observe "block X landed in G3 at t' ≤ t because worker A offloaded it." At that point the pending-completion queue cannot remain a private Vec<PendingOffload> on one KvManager — it has to move up to a shared structure indexed by virtual time. The natural shape is:
- A single
BlockManager<G3>(or equivalent shared map) owned by the offline harness, not per-worker. - A global virtual-time event queue of G3 operations keyed by
complete_at_ms. Workers append on evict; workers drain up tonow_msat pass-start. - G3
find_in_tiersbecomes a query against the shared state. "Is this block ready yet?" is answered by whether the global queue has advanced past the block'scomplete_at_ms.
Architecturally this would look much more like the KV event pump looks today (a shared, virtual-time-ordered stream of tier mutations) than like the current per-worker G2 plumbing. A clean way to hook it in: extend EnginePassResult with something like tier_events: Vec<TierEvent { tier, op, block, complete_at_ms }>, have the offline coordinator merge those into a global priority queue, and re-dispatch completions at the right virtual time. Same pattern as kv_events, just with a different consumer (a shared tier manager instead of the router).
Bottom line: nothing to do here. The per-worker containment is correct for the current model and clean. But we cannot assume it still works once G3 shared storage is added — at that point the virtual-time offload bookkeeping needs to be refactored onto whatever cross-worker time-ordered machinery the harness has, which today is effectively only the router event pump but would need to be generalized.
cc @ryanolson if you have any idea on parallelization (pdes) over shared block usage / transfer
There was a problem hiding this comment.
A couple concerns here in terms of correctness.
Offline-mode logic mirrors kvbm internals by hand
build_sync() bypasses OffloadEngine, InstanceLeader, and PipelineBuilder, reimplementing three contracts that will silently drift if kvbm changes:
- complete_offload mirrors TransferExecutor's post-transfer sequence (allocate_blocks → stage → register_block → drop).
- scan_matches vs match_blocks — relies on a specific semantic difference between two closely-related kvbm APIs.
- offload_batch_size is inert offline — delays are computed per single block.
- No CI test runs the same trace through live + offline and asserts equivalence, so drift would pass unnoticed.
No bandwidth contention
transfer_delay = bytes / bandwidth_gbps is computed per transfer with no shared-resource state. Concurrent offloads and swap-ins all get full peak bandwidth; bursty evictions finish as fast as a single one. Under-estimates TTFT under offload pressure.
GPU slot freed before the offload completes
release_block_id returns the slot to block_id_pool immediately on eviction — long before the simulated host transfer might be finished. The new allocator pulls the same slot right back while the "transfer" is still in flight. Effective G1 capacity is inflated, and the scheduler can admit work that a real system wouldn't. Impacts scheduling decisions, not just timing.
Some final thoughts
On benchmarks with high kv pressure or long context, results from offline replay will likely be radically different than reality. This current approach will also make it very difficult to integrate offline replay with G3.
|
@jthomson04 thanks for the reviews. I'm refactoring with KVBM-logical as the G1 manager, which should unify some offline paths. Will ping you when ready. |
ed993d8 to
46fd899
Compare
|
Hi @PeaBrane @jthomson04, I pushed a large refactor based on your feedback. The two biggest changes are:
This should address most of the previous concerns, but the change is now chunky. Would you mind doing a high-level architecture/readability pass first? I’m also happy to do a quick walkthrough if that’s easier. |
46fd899 to
44b0516
Compare
|
@dreamtalen can you put this PR back to review and trigger the CIs if needed |
|
/ok to test 44b0516 |
PeaBrane
left a comment
There was a problem hiding this comment.
Approving this iteration. A few follow-ups I would like tracked:
-
Please hook G2 KV events into the router/storage-tier event protocol. When blocks land in G2, emit HostPinned-tier Stored events; when they leave G2, emit the matching lower-tier Removed events. This can be a separate PR.
-
In the new transfer hot path, consider using FxHashMap for the TransferId-keyed maps instead of std::HashMap.
-
cc @rolson for visibility on the network/bandwidth modeling bits; worth coordinating after the velo network math pieces are refactored.
419c4e3 to
79a4777
Compare
|
/ok to test 79a4777 |
Signed-off-by: Yongming Ding <yongmingd@nvidia.com>
Signed-off-by: Yongming Ding <yongmingd@nvidia.com>
79a4777 to
23b4fed
Compare
|
/ok to test 23b4fed |
Signed-off-by: Yongming Ding <yongmingd@nvidia.com>
Overview:
Absorbed #8033
This PR adds optional KVBM-backed G1↔G2 offload simulation for the vLLM mocker, for both online/offline replay.
The current shape intentionally uses the same in-process
kvbm-enginestack in both modes:OffloadEngine+InstanceLeader+PipelineBuilder+ a mockWorker.Live mode drives the offload engine with wall-clock time. Offline replay drives the same hot path with replay virtual time.
Details:
This PR introduces a
kvbm-offloadfeature ondynamo-mockerand exposes it to Python asmocker-kvbm-offload.Main pieces:
lib/mocker/src/kvbm_offload/engine.rskvbm-engine::OffloadEngineandInstanceLeader.lib/mocker/src/kvbm_offload/worker.rskvbm-engineworker traits without moving real memory.lib/mocker/src/kvbm_offload/bandwidth_sharing_model.rslib/mocker/src/kv_manager/kvbm_backend.rslib/mocker/src/scheduler/vllm/core.rslib/kvbm-engine/src/offload/*Use example:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Relates to #8190, #6383
Summary by CodeRabbit
Release Notes
New Features
Tests