feat(mocker): KVBM G2 offload for on/offline replay by dreamtalen · Pull Request #8184 · ai-dynamo/dynamo

dreamtalen · 2026-04-14T19:42:00Z

Overview:

Absorbed #8033
This PR adds optional KVBM-backed G1↔G2 offload simulation for the vLLM mocker, for both online/offline replay.

The current shape intentionally uses the same in-process kvbm-engine stack in both modes:
OffloadEngine + InstanceLeader + PipelineBuilder + a mock Worker.

Live mode drives the offload engine with wall-clock time. Offline replay drives the same hot path with replay virtual time.

Details:

This PR introduces a kvbm-offload feature on dynamo-mocker and exposes it to Python as mocker-kvbm-offload.

Main pieces:

lib/mocker/src/kvbm_offload/engine.rs
- Builds an in-process kvbm-engine::OffloadEngine and InstanceLeader.
lib/mocker/src/kvbm_offload/worker.rs
- Implements kvbm-engine worker traits without moving real memory.
lib/mocker/src/kvbm_offload/bandwidth_sharing_model.rs
- Deterministic processor-sharing bandwidth model.
- Concurrent transfers on the same link share throughput.
lib/mocker/src/kv_manager/kvbm_backend.rs
- G2→G1 swap-in now reserves destination G1 slots before starting transfer bandwidth reservation.
lib/mocker/src/scheduler/vllm/core.rs
- Ticks the offload engine at pass start.
- Parks requests waiting on G2→G1 swap-in and promotes them once the handle completes.
lib/kvbm-engine/src/offload/*
- Adds small support hooks needed by the mock worker path: queue notification instead of fixed polling.

Use example:

cd lib/bindings/python
maturin develop --features mocker-kvbm-offload --uv --release

python3 -m dynamo.replay mooncake_trace_1000.jsonl \
  --replay-mode offline \
  --num-workers 1 \
  --trace-block-size 512 \
  --extra-engine-args '{"num_g2_blocks":10000,"num_gpu_blocks":8192,"kv_bytes_per_token":131072}'

Where should the reviewer start?

lib/mocker/src/kvbm_offload/*
lib/mocker/src/kv_manager/kvbm_backend.rs
lib/mocker/src/scheduler/vllm/core.rs
lib/kvbm-engine/src/offload/*

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to #8190, #6383

Summary by CodeRabbit

Release Notes

New Features
- Added KV cache offload support for multi-tier memory with configurable parameters: number of G2 blocks, offload batch size, and bandwidth limits.
- Introduced virtual-time replay mode for offline KV cache offload simulation.
Tests
- Updated unit tests to validate new KV cache offload configuration parameters.

copy-pr-bot · 2026-04-14T19:42:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-14T21:07:11Z

Walkthrough

This PR adds KVBM (KV Block Manager) G1↔G2 offload functionality to simulate hierarchical KV cache memory with configurable parameters, transfer delays, and both live async and offline replay modes. Three new configuration arguments are introduced, accompanied by corresponding Rust and Python binding updates, a new KVBM orchestration module, and integration into the scheduler and KV manager systems.

Changes

Cohort / File(s)	Summary
CLI & Configuration Setup `components/src/dynamo/mocker/args.py`, `components/src/dynamo/mocker/config.py`, `components/src/dynamo/mocker/tests/unit/test_config.py`	Added three new CLI arguments (`--num-g2-blocks`, `--kvbm-offload-batch-size`, `--kvbm-bandwidth-g1-g2`) with defaults and corresponding config builder updates; unit test extended to validate JSON payload includes new fields.
Cargo Features & Core Types `lib/bindings/python/Cargo.toml`, `lib/mocker/Cargo.toml`, `lib/mocker/src/common/protocols.rs`	Added `mocker-kvbm` and `kvbm` Cargo features, declared optional KVBM dependencies (kvbm-engine, kvbm-logical, kvbm-physical, velo, futures), and extended `MockEngineArgs` struct with three new fields and JSON parsing logic.
Python Bindings `lib/bindings/python/rust/llm/replay.rs`, `lib/bindings/python/src/dynamo/_core.pyi`	Updated `MockEngineArgs` Python constructor signature and type stubs to accept three new parameters; updated `dump_json()` to include serialized KVBM configuration fields.
KVBM Block Manager `lib/kvbm-logical/src/manager/mod.rs`	Added public `has_blocks()` method for non-destructive hash-to-existence-check queries against the inactive pool.
KV Manager KVBM Integration `lib/mocker/src/kv_manager/mod.rs`, `lib/mocker/src/kv_manager/vllm_backend.rs`	Added conditional `kvbm_offload` module export; extended `KvManager` with offload engine state, batch slot tracking, virtual-time support, and updated `process()` signature to accept `now_ms` parameter for time-aware event handling and offload completion scheduling.
KVBM Offload Engine `lib/mocker/src/kv_manager/kvbm_offload.rs`	New 776-line module implementing `KvbmOffloadConfig`, `MockWorker`, and `MockOffloadEngine` with dual build paths (async/live vs. sync/offline replay), transfer delay simulation based on bandwidth, batch scheduling, and G2 presence queries via `InstanceLeader` or direct `BlockManager` access.
Scheduler Integration `lib/mocker/src/scheduler/mod.rs`, `lib/mocker/src/scheduler/vllm/core.rs`, `lib/mocker/src/scheduler/vllm/live.rs`	Added `init_kvbm_offload()` initialization method on `EngineCore`; extended `VllmCore` with pending swap-in tracking, offload engine forwarding, and updated `execute_pass_internal` to poll swap-in completion, advance virtual time, and propagate `now_ms` through all KV event processing; live scheduler now asynchronously initializes offload engine on startup.
Replay System Updates `lib/mocker/src/replay/offline/core.rs`, `lib/mocker/src/replay/offline/state.rs`, `lib/mocker/src/scheduler/vllm/tests.rs`	Updated `ReplayWorkerCore` and `OfflineWorkerState` to clone `args` and conditionally call `init_kvbm_offline()`; updated test calls to `KvManager.process()` to include new `now_ms` parameter.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 88.37% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The PR description follows the template structure with all required sections (Overview, Details, Where should the reviewer start, Related Issues) completed and substantive content provided.
Title check	✅ Passed	The PR title 'feat(mocker): KVBM G2 offload for on/offline replay' accurately summarizes the main change - adding KVBM G2 offload functionality for both online and offline replay modes in the mocker component.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

lib/kvbm-logical/src/manager/mod.rs (1)

264-268: Consider a batched inactive-pool existence API to reduce per-hash overhead.

has_blocks currently performs one inactive_pool.has_block call per hash. If this path is hot, a single batched lookup in InactivePool can reduce lock churn and improve offline replay throughput.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lib/kvbm-logical/src/manager/mod.rs` around lines 264 - 268, has_blocks
currently calls InactivePool::has_block in a loop, causing per-hash lock
overhead; add a batched existence API on InactivePool (e.g.,
InactivePool::has_blocks or has_many that takes &[SequenceHash] and returns
Vec<bool> or a HashSet of present hashes), implement the internal lookup under a
single lock/scan to reduce churn, then modify Manager::has_blocks to call the
new batched method (keeping the public signature of Manager::has_blocks) so
callers get the same Vec<bool> while benefiting from the single-shot lookup;
ensure tests covering both single and multiple hashes are updated accordingly.

lib/mocker/src/kv_manager/vllm_backend.rs (1)

152-182: LGTM!

The complete_ready_offloads method correctly iterates pending offloads and completes those whose deadline has arrived. The Arc clone is lightweight (just refcount increment).

Optional: Minor simplification opportunity

The drain + collect pattern could be simplified using retain:

-let mut still_pending = Vec::new();
-for offload in self.pending_offloads.drain(..) {
-    if now_ms >= offload.complete_at_ms {
-        engine.complete_offload(offload.block_id, offload.seq_hash);
-        completed += 1;
-    } else {
-        still_pending.push(offload);
-    }
-}
-self.pending_offloads = still_pending;
+self.pending_offloads.retain(|offload| {
+    if now_ms >= offload.complete_at_ms {
+        engine.complete_offload(offload.block_id, offload.seq_hash);
+        completed += 1;
+        false
+    } else {
+        true
+    }
+});

However, this requires completed to be accessible in the closure (via a Cell or moving the counter). The current approach is clear and works correctly.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lib/mocker/src/kv_manager/vllm_backend.rs` around lines 152 - 182,
complete_ready_offloads currently drains pending_offloads and rebuilds a vector;
you can simplify by using Vec::retain to keep items whose complete_at_ms is in
the future and call engine.complete_offload for items being completed, while
tracking the count via a Cell/AtomicUsize captured in the closure; specifically,
in complete_ready_offloads use &self.offload_engine (clone Arc as needed), call
retain on self.pending_offloads and inside the closure check now_ms >=
offload.complete_at_ms to call engine.complete_offload(offload.block_id,
offload.seq_hash) and increment the counter, then use the counter for the
tracing::debug call.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/mocker/src/kv_manager/kvbm_offload.rs`:
- Around line 38-40: Offline replay is ignoring
KvbmOffloadConfig.offload_batch_size, causing virtual evictions to always use
transfer_delay_ms(1); propagate offload_batch_size into the sync engine setup
(the code path that builds the SyncEngine/virtual eviction) and use it to
compute batched transfer latency when marking N evicted blocks ready: compute
number_of_batches = ceil(evicted_count / offload_batch_size) and apply
transfer_delay_ms = per_batch_transfer_ms * number_of_batches (or equivalent
batching formula used by the live KVBM pipeline) instead of using a fixed 1 ms;
update the SyncEngine construction/site that currently hardcodes
transfer_delay_ms(1) to accept and use offload_batch_size from
KvbmOffloadConfig.

In `@lib/mocker/src/scheduler/mod.rs`:
- Around line 153-182: The init_kvbm_offline function currently ignores
num_g2_blocks > 0 for non-Vllm engines (Sglang), making invalid KVBM configs
silently accepted; change init_kvbm_offline to fail fast instead of no-op:
update init_kvbm_offline signature to return Result<(), E> (or propagate an
existing error type), check early if args.num_g2_blocks > 0 and match self — if
Self::Vllm proceed as before, but if Self::Sglang return Err (or panic if you
prefer) with a clear message ("KVBM config requires Vllm engine; found Sglang")
so callers/config normalization will catch the invalid config. Ensure references
to init_kvbm_offline, Self::Vllm, Self::Sglang, and args.num_g2_blocks are
updated where this function is called.

---

Nitpick comments:
In `@lib/kvbm-logical/src/manager/mod.rs`:
- Around line 264-268: has_blocks currently calls InactivePool::has_block in a
loop, causing per-hash lock overhead; add a batched existence API on
InactivePool (e.g., InactivePool::has_blocks or has_many that takes
&[SequenceHash] and returns Vec<bool> or a HashSet of present hashes), implement
the internal lookup under a single lock/scan to reduce churn, then modify
Manager::has_blocks to call the new batched method (keeping the public signature
of Manager::has_blocks) so callers get the same Vec<bool> while benefiting from
the single-shot lookup; ensure tests covering both single and multiple hashes
are updated accordingly.

In `@lib/mocker/src/kv_manager/vllm_backend.rs`:
- Around line 152-182: complete_ready_offloads currently drains pending_offloads
and rebuilds a vector; you can simplify by using Vec::retain to keep items whose
complete_at_ms is in the future and call engine.complete_offload for items being
completed, while tracking the count via a Cell/AtomicUsize captured in the
closure; specifically, in complete_ready_offloads use &self.offload_engine
(clone Arc as needed), call retain on self.pending_offloads and inside the
closure check now_ms >= offload.complete_at_ms to call
engine.complete_offload(offload.block_id, offload.seq_hash) and increment the
counter, then use the counter for the tracing::debug call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c4ea5b97-1fa7-4769-8544-7119c3e31de6

📥 Commits

Reviewing files that changed from the base of the PR and between 07c7cc8 and b7c678b.

⛔ Files ignored due to path filters (2)

Cargo.lock is excluded by !**/*.lock
lib/bindings/python/Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (18)

components/src/dynamo/mocker/args.py
components/src/dynamo/mocker/config.py
components/src/dynamo/mocker/tests/unit/test_config.py
lib/bindings/python/Cargo.toml
lib/bindings/python/rust/llm/replay.rs
lib/bindings/python/src/dynamo/_core.pyi
lib/kvbm-logical/src/manager/mod.rs
lib/mocker/Cargo.toml
lib/mocker/src/common/protocols.rs
lib/mocker/src/kv_manager/kvbm_offload.rs
lib/mocker/src/kv_manager/mod.rs
lib/mocker/src/kv_manager/vllm_backend.rs
lib/mocker/src/replay/offline/core.rs
lib/mocker/src/replay/offline/state.rs
lib/mocker/src/scheduler/mod.rs
lib/mocker/src/scheduler/vllm/core.rs
lib/mocker/src/scheduler/vllm/live.rs
lib/mocker/src/scheduler/vllm/tests.rs

PeaBrane · 2026-04-20T17:09:48Z

would want @jthomson04 to have a look as well to see if there's any interaction with his remote indexing work

PeaBrane

High-level review pass — the diff is a bit hard to read in its current shape, and I'd like the refactors below landed first before I do a more comprehensive review of the featural / logical bits (correctness, causality, test coverage). The architecture itself looks sound; most of the friction is readability and a few parallel-method pairs that double the surface area without adding expressiveness.

Extract KvManager::try_batch_swap_in from lib/mocker/src/kv_manager/vllm_backend.rs:295–431. Return an enum like { NoHits, Scheduled { allocated, defer } } so process() becomes linear again. This is the biggest readability win on the branch.
Extract contiguous_g2_prefix_hits(remaining, batch_results) as a free pure function from vllm_backend.rs:341–354. The batch_idx / FullBlock / first-miss-break logic deserves its own unit test.
KvbmOffloadConfig::from_args(&MockEngineArgs) -> Option<Self>. Dedups live.rs:125–133 and scheduler/mod.rs:168–174. Both sites reconstruct the same config with the same block_size * bpt boilerplate.
Collapse the async/virtual method pairs on MockOffloadEngine (lib/mocker/src/kv_manager/kvbm_offload.rs):
- enqueue_g1_eviction(bid, sh, now_ms) — one method; branch on self.offload_engine.is_some() internally.
- start_swap_in(num_blocks, now_ms) — same.
- Merge MockWorker::transfer_delay and MockOffloadEngine::transfer_delay_ms into one helper (one returns Duration, the other f64 ms — gratuitous).
- SwapInHandle::is_complete(now_ms) — single method, live ignores now_ms. Kills the two panic paths.
Cuts the public surface roughly in half and removes the "which mode am I in?" cognitive load at every call site.
Hoist virtual-time bookkeeping onto MockOffloadEngine. Currently KvManager owns pending_offloads + drain_pending_offloads + pending_offload_deadlines + complete_ready_offloads + virtual_time. These are all engine concerns, not cache concerns.

Shape:
- engine.record_eviction(bid, sh, now_ms) — does the virtual-time branch internally.
- engine.tick(now_ms) — called at pass start; replaces complete_ready_offloads.
- engine.earliest_pending_deadline() -> Option<f64> — feeds the stall-advance in core.rs:471–480.
Payoff: KvManager stops knowing about virtual time entirely for offloads. The virtual_time: bool flag either moves onto the engine or disappears (inferred from offload_engine being sync vs async). Removes ~50 lines of #[cfg] fields and methods from KvManager.
(Optional extension of #5) Move pending_swap_ins off VllmCore onto the engine too, with engine.tick(now_ms) -> Vec<PromotionReady { uuid, reused_input_tokens }>. Completes the story — all KVBM state lives in one place. VllmCore just iterates the returned promotions, does prepend_waiting, and reports admits. Worth it only if #5 alone still leaves too much KVBM logic in VllmCore.

dreamtalen · 2026-04-20T17:25:32Z

@PeaBrane thanks for the feedback! Makes sense, will ping once refactor is completed

PeaBrane

Note for posterity — per-worker virtual time only holds because G2 is per-worker

Writing this down explicitly so a future reader (human or AI) who extends this to shared storage doesn't trip on it. (We should also probably comment / doc this out briefly somewhere in the code if not already)

Current situation — fine as-is. The offline virtual-time machinery is entirely contained inside one worker: pending_offloads, pending_swap_ins, and BlockManager<G2> all live on that worker's KvManager. No PendingOffload deadline ever needs to be visible to another worker. This works because G2 is modeled as per-worker host memory (each KvManager owns its own BlockManager<G2> sized by num_g2_blocks), so no worker ever needs to query another worker's G2 state. The only externally observable effects — token completion timestamps in the trace and the synthetic Stored/Removed events going to the router — are already routed through existing per-worker pumps (TraceCollector, EnginePassResult.kv_events) and don't require cross-worker coordination. The router tracks per-worker radix trees, so each worker announces its own tier state independently.

This assumption breaks once shared storage is introduced. If someone later adds a G3 tier that's a genuinely shared pool (CXL fabric, RDMA host-memory pool, shared NVMe, NDS-style global cache), worker B at virtual time t will need to be able to observe "block X landed in G3 at t' ≤ t because worker A offloaded it." At that point the pending-completion queue cannot remain a private Vec<PendingOffload> on one KvManager — it has to move up to a shared structure indexed by virtual time. The natural shape is:

A single BlockManager<G3> (or equivalent shared map) owned by the offline harness, not per-worker.
A global virtual-time event queue of G3 operations keyed by complete_at_ms. Workers append on evict; workers drain up to now_ms at pass-start.
G3 find_in_tiers becomes a query against the shared state. "Is this block ready yet?" is answered by whether the global queue has advanced past the block's complete_at_ms.

Architecturally this would look much more like the KV event pump looks today (a shared, virtual-time-ordered stream of tier mutations) than like the current per-worker G2 plumbing. A clean way to hook it in: extend EnginePassResult with something like tier_events: Vec<TierEvent { tier, op, block, complete_at_ms }>, have the offline coordinator merge those into a global priority queue, and re-dispatch completions at the right virtual time. Same pattern as kv_events, just with a different consumer (a shared tier manager instead of the router).

Bottom line: nothing to do here. The per-worker containment is correct for the current model and clean. But we cannot assume it still works once G3 shared storage is added — at that point the virtual-time offload bookkeeping needs to be refactored onto whatever cross-worker time-ordered machinery the harness has, which today is effectively only the router event pump but would need to be generalized.

cc @ryanolson if you have any idea on parallelization (pdes) over shared block usage / transfer

jthomson04

A couple concerns here in terms of correctness.

Offline-mode logic mirrors kvbm internals by hand

build_sync() bypasses OffloadEngine, InstanceLeader, and PipelineBuilder, reimplementing three contracts that will silently drift if kvbm changes:

complete_offload mirrors TransferExecutor's post-transfer sequence (allocate_blocks → stage → register_block → drop).
scan_matches vs match_blocks — relies on a specific semantic difference between two closely-related kvbm APIs.
offload_batch_size is inert offline — delays are computed per single block.
No CI test runs the same trace through live + offline and asserts equivalence, so drift would pass unnoticed.

No bandwidth contention

transfer_delay = bytes / bandwidth_gbps is computed per transfer with no shared-resource state. Concurrent offloads and swap-ins all get full peak bandwidth; bursty evictions finish as fast as a single one. Under-estimates TTFT under offload pressure.

GPU slot freed before the offload completes

release_block_id returns the slot to block_id_pool immediately on eviction — long before the simulated host transfer might be finished. The new allocator pulls the same slot right back while the "transfer" is still in flight. Effective G1 capacity is inflated, and the scheduler can admit work that a real system wouldn't. Impacts scheduling decisions, not just timing.

Some final thoughts

On benchmarks with high kv pressure or long context, results from offline replay will likely be radically different than reality. This current approach will also make it very difficult to integrate offline replay with G3.

dreamtalen · 2026-04-20T21:02:08Z

@jthomson04 thanks for the reviews. I'm refactoring with KVBM-logical as the G1 manager, which should unify some offline paths. Will ping you when ready.

dreamtalen · 2026-04-28T21:10:54Z

Hi @PeaBrane @jthomson04, I pushed a large refactor based on your feedback. The two biggest changes are:

Added a simple processor-sharing bandwidth model, so concurrent transfers on the same link share bandwidth
Reworked live/offline replay to use the same kvbm-engine offload path (OffloadEngine + InstanceLeader + PipelineBuilder + mock Worker). Offline now drives the same hot path with virtual time; live drives it with wall-clock time.

This should address most of the previous concerns, but the change is now chunky. Would you mind doing a high-level architecture/readability pass first? I’m also happy to do a quick walkthrough if that’s easier.

PeaBrane · 2026-04-29T18:34:04Z

@dreamtalen can you put this PR back to review and trigger the CIs if needed

dreamtalen · 2026-04-29T18:35:55Z

/ok to test 44b0516

PeaBrane

Approving this iteration. A few follow-ups I would like tracked:

Please hook G2 KV events into the router/storage-tier event protocol. When blocks land in G2, emit HostPinned-tier Stored events; when they leave G2, emit the matching lower-tier Removed events. This can be a separate PR.
In the new transfer hot path, consider using FxHashMap for the TransferId-keyed maps instead of std::HashMap.
cc @rolson for visibility on the network/bandwidth modeling bits; worth coordinating after the velo network math pieces are refactored.

dreamtalen · 2026-04-29T20:15:26Z

/ok to test 79a4777

Signed-off-by: Yongming Ding <yongmingd@nvidia.com>

dreamtalen · 2026-04-29T22:03:22Z

/ok to test 23b4fed

Signed-off-by: Yongming Ding <yongmingd@nvidia.com>

pull-request-size Bot added the size/XXL label Apr 14, 2026

github-actions Bot added the feat label Apr 14, 2026

dreamtalen force-pushed the yongmingd/replay-kvbm-engine-2 branch 2 times, most recently from 88aabb6 to b7c678b Compare April 14, 2026 20:40

dreamtalen marked this pull request as ready for review April 14, 2026 20:56

dreamtalen requested review from a team and PeaBrane as code owners April 14, 2026 20:56

dreamtalen mentioned this pull request Apr 14, 2026

[FEATURE]: Dynamo Mocker Enhancements #6383

Open

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread lib/mocker/src/kv_manager/kvbm_offload.rs Outdated

Comment thread lib/mocker/src/scheduler/mod.rs Outdated

dreamtalen mentioned this pull request Apr 14, 2026

[FEATURE]: KVBM-Mocker integration: multi-tier KV cache offload simulation #8190

Open

8 tasks

dreamtalen force-pushed the yongmingd/replay-kvbm-engine-2 branch from b7c678b to 7568efc Compare April 14, 2026 22:27

dreamtalen requested review from jthomson04 and ryanolson April 20, 2026 17:12

PeaBrane reviewed Apr 20, 2026

View reviewed changes

jthomson04 requested changes Apr 20, 2026

View reviewed changes

dreamtalen marked this pull request as draft April 20, 2026 21:02

dreamtalen mentioned this pull request Apr 21, 2026

refactor(mocker): replace vllm block manager with kvbm-logical #8451

Merged

dreamtalen force-pushed the yongmingd/replay-kvbm-engine-2 branch 3 times, most recently from ed993d8 to 46fd899 Compare April 28, 2026 20:17

dreamtalen force-pushed the yongmingd/replay-kvbm-engine-2 branch from 46fd899 to 44b0516 Compare April 28, 2026 21:56

dreamtalen marked this pull request as ready for review April 29, 2026 18:35

PeaBrane approved these changes Apr 29, 2026

View reviewed changes

jthomson04 self-requested a review April 29, 2026 18:53

jthomson04 approved these changes Apr 29, 2026

View reviewed changes

PeaBrane reviewed Apr 29, 2026

View reviewed changes

Comment thread lib/mocker/src/kvbm_offload/engine.rs Outdated

Comment thread lib/mocker/src/kvbm_offload/engine.rs Outdated

dreamtalen force-pushed the yongmingd/replay-kvbm-engine-2 branch from 419c4e3 to 79a4777 Compare April 29, 2026 19:32

copy-pr-bot Bot temporarily deployed to GITLAB April 29, 2026 20:15 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 29, 2026 20:48 Inactive

dreamtalen changed the title ~~feat(mocker): KVBM G2 offload for offline replay~~ feat(mocker): KVBM G2 offload for on/offline replay Apr 29, 2026

dreamtalen added 2 commits April 29, 2026 14:32

feat(mocker): KVBM G1↔G2 offload with unified live/offline virtual-clock

d3f93be

Signed-off-by: Yongming Ding <yongmingd@nvidia.com>

Address comments for G2 block manager init

23b4fed

Signed-off-by: Yongming Ding <yongmingd@nvidia.com>

dreamtalen force-pushed the yongmingd/replay-kvbm-engine-2 branch from 79a4777 to 23b4fed Compare April 29, 2026 21:33

copy-pr-bot Bot temporarily deployed to GITLAB April 29, 2026 22:03 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 29, 2026 23:11 Inactive

dreamtalen merged commit f332454 into main Apr 30, 2026
221 of 227 checks passed

dreamtalen deleted the yongmingd/replay-kvbm-engine-2 branch April 30, 2026 00:15

dreamtalen mentioned this pull request Apr 30, 2026

feat(mocker): integrate kvbm-engine for vllm G2 offload #8033

Closed

keivenchang mentioned this pull request Apr 30, 2026

test(revalidate): feat(mocker): KVBM G2 offload for on/offline replay #8184 #8882

Closed

dreamtalen mentioned this pull request May 1, 2026

feat(mocker): publish tiered KV events for G2 offload #8961

Merged

furionw pushed a commit that referenced this pull request May 2, 2026

feat(mocker): KVBM G2 offload for on/offline replay (#8184)

8d2ebd7

Signed-off-by: Yongming Ding <yongmingd@nvidia.com>

Conversation

dreamtalen commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Apr 14, 2026

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PeaBrane commented Apr 20, 2026

Uh oh!

PeaBrane left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dreamtalen commented Apr 20, 2026

Uh oh!

PeaBrane left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Note for posterity — per-worker virtual time only holds because G2 is per-worker

Uh oh!

jthomson04 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dreamtalen commented Apr 20, 2026

Uh oh!

dreamtalen commented Apr 28, 2026

Uh oh!

PeaBrane commented Apr 29, 2026

Uh oh!

dreamtalen commented Apr 29, 2026

Uh oh!

PeaBrane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dreamtalen commented Apr 29, 2026

Uh oh!

dreamtalen commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dreamtalen commented Apr 14, 2026 •

edited

Loading

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

PeaBrane left a comment •

edited

Loading

PeaBrane left a comment •

edited

Loading

jthomson04 left a comment •

edited

Loading