feat(planner/replay): KV reuse awareness in load + throughput scaling by tedzhouhk · Pull Request #8314 · ai-dynamo/dynamo

tedzhouhk · 2026-04-17T18:19:17Z

Summary

The KV router publishes dynamo_component_router_kv_hit_rate (predicted prefix-cache hit rate at routing time), but the planner ignored it — so scaling decisions over-counted prefill compute work on reuse-heavy workloads and scaled prefill up unnecessarily. This PR threads the signal through both scaling paths, through offline replay, and (in the latest commit) through the right tick cadence for each deployment mode.

Discount math

Load planner (reactive, FPM-driven): estimate_next_ttft in the prefill and agg regressions takes an optional kv_hit_rate and scales (queued + avg_isl) * (1 − clamp(hit_rate)). The regression's x-feature is untouched — only the simulation aggregate is discounted, so no double-counting against post-cache sum_prefill_tokens.
Throughput planner (predictive): a fourth load predictor (_kv_hit_rate_predictor) is added alongside num_req/isl/osl. Per requirement it is not warmed from the mooncake trace (no good offline proxy). Predicted hit rate discounts only the prefill portion of capacity sizing; decode KV residency uses raw ISL because cache hits don't shrink the decode footprint. find_best_engine_agg_rps gains a kv_hit_rate param.
Replay integration: mocker's TrafficAccumulator tracks overlap + ISL blocks via a new on_admission(overlap, isl_blocks) call from the prefill-router dispatch path. TrafficStats carries avg_kv_hit_rate = Σoverlap / Σisl_blocks, matching the real router's per-request histogram semantics. Python binding and replay_adapter thread the field into TrafficObservation.

Tick cadence (latest commit)

Two cadence gaps from the initial implementation are now fixed:

Load-only deployments (enable_throughput_scaling=False): the kv-hit-rate scrape now rides on each load tick over the load interval, via a one-query _collect_kv_hit_rate_observation. Previously the scrape was tied to throughput ticks, so load-only mode silently saw _last_kv_hit_rate=None forever and applied no discount.
Mixed mode (both scalings on): _advance_throughput now promotes the _kv_hit_rate_predictor's smoothed forecast to _last_kv_hit_rate. All load ticks between throughput ticks consume the predicted value rather than the raw last-window observation.
Replay needs no Rust change: the same scheduler flag that drives Prometheus scrapes also drains the mocker TrafficAccumulator at the right cadence per mode.

Safety

_clamp_kv_hit_rate caps at 0.95 (a stale 1.0 reading would zero out queued work and mask backlog), and treats None/NaN as 0.0 (no discount — preserves prior behavior exactly). Diagnostics expose the observed kv_hit_rate and the predicted value per tick.

Test plan

17 Python unit tests across test_load_based_scaling, test_state_machine, test_prometheus: regression discount, clamping, None/NaN fallback, warmup exclusion, diagnostics propagation, end-to-end throughput-scaling demand reduction, prefill/agg parity, load-only direct-pass, mixed-mode predictor promotion, scheduler flag in both modes, predictor-feed gating.
3 Rust tests on TrafficAccumulator including the weighted-by-isl-blocks property.
Full planner unit suite: 126 / 126 passing.
Full mocker suite: 216 / 216 passing.
pre-commit (ruff, isort, black, flake8, codespell) clean on changed files; cargo fmt clean.
End-to-end replay validation on shared-prefix-heavy synthetic workload (reviewer-suggested follow-up).

Out of scope

Per-worker hit rate (router histogram is aggregated; per-worker needs new labels).
Tier-aware discount (GPU vs host vs disk) — current metric conflates tiers.
Decode-side reuse (sum_decode_kv_tokens is already directly measured).
FPM-sourced hit rate (would require engine emitter schema extensions).
Predicting hit rate from the offline trace — explicitly skipped per design.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Integrated KV cache hit rate collection from metrics sources and added hit rate prediction capabilities. Updated load scaling models to incorporate hit rate data for improved load estimation and performance prediction accuracy.
Tests
- Added comprehensive unit test coverage for KV cache hit rate integration across regression models and diagnostic tracking.

coderabbitai · 2026-04-17T18:30:18Z

Walkthrough

This PR introduces KV cache hit-rate tracking across the planner system. Changes propagate hit-rate metrics from Prometheus through traffic observations, integrate them into performance model estimations via scaling factors, add KV hit-rate prediction capabilities, and extend offline replay and monitoring infrastructure to capture and report these metrics.

Changes

Cohort / File(s)	Summary
Core Data Structures `components/src/dynamo/planner/core/types.py`, `components/src/dynamo/planner/monitoring/traffic_metrics.py`	Added `kv_hit_rate` fields to `TrafficObservation` and `TickDiagnostics`. Extended `Metrics` dataclass with `kv_hit_rate` field and added `PrometheusAPIClient.get_avg_kv_hit_rate()` method to query router KV hit rate from Prometheus.
Traffic & State Management `components/src/dynamo/planner/core/base.py`, `components/src/dynamo/planner/core/state_machine.py`	Extended traffic metrics collection to fetch `kv_hit_rate` from Prometheus and propagate into `TrafficObservation`. Added `_kv_hit_rate_predictor` and `_last_kv_hit_rate` tracking to state machine; extended diagnostics with `predicted_kv_hit_rate`.
Performance Model Foundation `components/src/dynamo/planner/core/perf_model/base.py`	Added `_clamp_kv_hit_rate()` helper function and `_MAX_KV_HIT_RATE_DISCOUNT` constant to normalize hit-rate values for downstream scaling operations.
Performance Model Implementations `components/src/dynamo/planner/core/perf_model/prefill.py`, `components/src/dynamo/planner/core/perf_model/agg.py`	Updated `estimate_next_ttft()` methods to accept optional `kv_hit_rate` parameter and apply computed scaling factor `1.0 - clamped_hit_rate` to reduce prefill token work. `AggRegressionModel.find_best_engine_agg_rps()` now adjusts effective ISL based on cache hit rate.
Scaling Decision Logic `components/src/dynamo/planner/core/load_scaling.py`, `components/src/dynamo/planner/core/throughput_scaling.py`	Updated load and throughput scaling decisions to pass `kv_hit_rate` into regression estimators. Added KV hit-rate prediction and exception handling in throughput scaling; extended logging to show raw and effective ISL values.
Monitoring & Offline Replay `components/src/dynamo/planner/monitoring/diagnostics_recorder.py`, `components/src/dynamo/planner/offline/replay_adapter.py`	Extended `TickSnapshot` to capture `observed_kv_hit_rate` and `predicted_kv_hit_rate`. Updated replay adapter to bridge `avg_kv_hit_rate` from trace data into planner observations.
Mocker Runtime Components `lib/mocker/src/replay/offline/components/types.rs`, `lib/mocker/src/replay/offline/components/router.rs`	Added `overlap_blocks` and `isl_blocks` fields to `WorkerAdmission`. Introduced `TrafficAccumulator.on_admission()` method and updated `drain()` signature to compute and return `avg_kv_hit_rate` as `total_overlap_blocks / total_isl_blocks`. Router now captures admission metadata in `AdmitOutcome`.
Mocker Aggregated & Disaggregated Runtimes `lib/mocker/src/replay/offline/agg.rs`, `lib/mocker/src/replay/offline/disagg.rs`, `lib/mocker/src/replay/planner_handle.rs`	Updated `drain_traffic()` return type from 4-tuple to 5-tuple to include `avg_kv_hit_rate`. Both runtimes now call `traffic.on_admission()` when dispatching admissions. PlannerReplayHandle delegates to runtime's drain method without change in logic.
Python Bindings `lib/bindings/python/rust/llm/replay.rs`, `lib/bindings/python/src/dynamo/prometheus_names.py`	Updated `PlannerReplayBridge.drain_traffic()` return to include `avg_kv_hit_rate`. Added `router.KV_HIT_RATE` Prometheus metric name constant.
Test Coverage `components/src/dynamo/planner/tests/unit/test_load_based_scaling.py`, `components/src/dynamo/planner/tests/unit/test_prometheus.py`, `components/src/dynamo/planner/tests/unit/test_state_machine.py`	Added comprehensive unit tests covering `kv_hit_rate` parameter handling in regression models (clamping, scaling behavior), Prometheus client method for both router and non-router sources, and end-to-end planner plumbing with prediction and diagnostics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 95.77% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	PR description comprehensively covers overview, changes, discount math, tick cadence fixes, safety mechanisms, test coverage, and out-of-scope items.
Title check	✅ Passed	The PR title clearly describes the main feature: KV reuse awareness in load and throughput scaling. It accurately summarizes the primary change across the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

components/src/dynamo/planner/core/perf_model/agg.py (1)
181-194: ⚠️ Potential issue | 🟠 Major

Inner estimate_next_ttft call drops kv_hit_rate — TTFT SLA check is inconsistent with the discount applied everywhere else.

find_best_engine_agg_rps builds effective_isl = isl * (1 - clamp(kv_hit_rate)) and uses it to scale prefill_per_iter, but then calls self.estimate_next_ttft(...) without forwarding kv_hit_rate. Inside that call, scale defaults to 1.0, so the hypothetical next request’s avg_isl contribution is added back at full cost. The net effect: the TTFT SLA gate in the bs sweep sees an undiscounted TTFT even when cache reuse is high, which can cause the sweep to break early and under-report achievable RPS — the exact behavior this PR is trying to avoid. The docstring on Lines 150-154 also explicitly says the prefill portion is discounted, so this is a behavioral/docstring mismatch.
🔧 Proposed fix
             est_ttft = self.estimate_next_ttft(
                 queued_prefill_tokens=int(prefill_per_iter),
                 max_num_batched_tokens=max_num_batched_tokens,
                 current_decode_kv=int(decode_kv),
+                kv_hit_rate=kv_hit_rate,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/core/perf_model/agg.py` around lines 181 - 194,
The TTFT calculation in find_best_engine_agg_rps is using a discounted
effective_isl for prefill_per_iter but calls self.estimate_next_ttft(...)
without forwarding kv_hit_rate (or equivalent scale), causing the TTFT gate to
use an undiscounted avg_isl; fix by passing the same kv_hit_rate (or compute
scale = 1.0 - clamp(kv_hit_rate)) into estimate_next_ttft so its scale parameter
reflects the cache hit discount used for prefill_per_iter, e.g., update the call
site in find_best_engine_agg_rps to include kv_hit_rate or scale so
estimate_next_ttft uses the same discounting logic.
components/src/dynamo/planner/core/throughput_scaling.py (1)
212-243: ⚠️ Potential issue | 🟠 Major

Verify asymmetry: agg throughput's TTFT estimate does not receive the kv_hit_rate discount.

In find_best_engine_agg_rps (agg.py line 188–193), the call to estimate_next_ttft omits the kv_hit_rate parameter entirely. This means the TTFT estimate computes scale = 1.0 (no discount), while prefill_per_iter was already discounted via effective_isl = isl * (1.0 - clamp(kv_hit_rate)). The prefill throughput path pre-discounts isl before calling find_best_engine_prefill_rps, applying the discount once in a single regression call. However:

Prefill path: Discount applied upfront → wall_time(discounted_isl)

Agg path: Discount applied to prefill_per_iter for _predict_2d call, but estimate_next_ttft receives no discount and treats aggregate as full-scale

Result: Different regression kernels (_predict_2d vs. _predict_wall_time, 2D vs. 1D) receive inconsistent inputs under cache hit conditions, yielding mismatched TTFT and ITL estimates

Align by passing kv_hit_rate to the estimate_next_ttft call in the agg loop, or document why the TTFT estimate intentionally omits the discount.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/core/throughput_scaling.py` around lines 212 -
243, The agg path omits the kv_hit_rate when calling estimate_next_ttft in
find_best_engine_agg_rps, causing TTFT to be computed without the same KV-hit
discount applied elsewhere; fix by passing the kv_hit_rate (after clamping with
_clamp_kv_hit_rate) into estimate_next_ttft so the same effective ISL/discount
used for prefill_per_iter and _predict_2d is also applied to the
_predict_wall_time TTFT estimate (adjust inputs to estimate_next_ttft in
find_best_engine_agg_rps to include kv_hit_rate), or add a clear comment in
find_best_engine_agg_rps explaining intentional asymmetry if that was
deliberate.

🧹 Nitpick comments (4)

components/src/dynamo/planner/core/base.py (1)
465-469: Minor: kv_hit_rate may be NaN, not just None.

The is None guard on Line 465 only handles None; if Prometheus returns NaN (e.g., empty series converted to NaN by the client), the log will render "nan" and downstream consumers rely on _clamp_kv_hit_rate to normalize it. Either NaN-guard here for consistency with _clamp_kv_hit_rate, or leave as-is and rely on the clamp — just make sure it’s intentional.
-        hit_rate_str = f"{m.kv_hit_rate:.3f}" if m.kv_hit_rate is not None else "n/a"
+        hit_rate_str = (
+            f"{m.kv_hit_rate:.3f}"
+            if m.kv_hit_rate is not None and not math.isnan(m.kv_hit_rate)
+            else "n/a"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/core/base.py` around lines 465 - 469, The log
currently treats m.kv_hit_rate only for None and will print "nan" if
m.kv_hit_rate is NaN; update the logger formatting around m.kv_hit_rate in the
logger.info block (the hit_rate_str assignment used above the logger) to guard
against NaN the same way _clamp_kv_hit_rate does (e.g., treat NaN like None and
render "n/a" or the normalized value), so that hit_rate_str is never the literal
"nan" when passed into logger.info.
components/src/dynamo/planner/monitoring/diagnostics_recorder.py (1)
58-66: LGTM.

Observed vs predicted hit-rate are surfaced symmetrically; both default to None so existing snapshots/HTML reports handle missing data gracefully. Consider adding a subplot for them in _build_report_html in a follow-up so the new signal is actually visible in the generated report.

Also applies to: 169-179
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/monitoring/diagnostics_recorder.py` around
lines 58 - 66, Add a new subplot in the _build_report_html function to visualize
observed_kv_hit_rate alongside predicted_kv_hit_rate so the new signal is
visible in generated reports; locate the _build_report_html method and add a
plot/trace that reads DiagnosticsRecorder.observed_kv_hit_rate and
DiagnosticsRecorder.predicted_kv_hit_rate (or the local variables named
observed_kv_hit_rate / predicted_kv_hit_rate) with appropriate labels/legend and
handle None values gracefully (skip or show gaps) to match existing plotting
patterns used for the other predicted_* vs observed_* signals.
lib/mocker/src/replay/offline/components/router.rs (1)
487-565: Clean refactor; AdmitOutcome threads the two overlap counters through both admit paths consistently.

A couple of small observations (non-blocking):

Line 516-517: u32::try_from(...).unwrap_or(u32::MAX) silently saturates. For realistic ISL this can't trigger, but if it ever did, downstream ratios would skew. A debug_assert! (or tracing::warn! with the ISL) would make the saturation observable without affecting release behavior.

Line 265-267: request.uuid.expect(...) is redundant with the uuid extraction already performed inside build_pending_request (line 438-440). You could pull uuid from pending.uuid before admit_request consumes it and drop the panic path entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/mocker/src/replay/offline/components/router.rs` around lines 487 - 565,
The admit_request path currently silently saturates isl_blocks with
u32::try_from(...).unwrap_or(u32::MAX); add a debug_assert!(isl_blocks !=
u32::MAX, "saturated isl_blocks: {}", request.isl_tokens) or emit a
tracing::warn! with the original ISL value right after computing isl_blocks to
surface unexpected saturation in debug/tracing builds; and in drain_pending,
avoid the redundant expect by extracting uuid from the QueueEntry before calling
admit_request (use the uuid variable from the popped QueueEntry instead of
relying on request.uuid inside admit_request) so you don't need the panic path
in build_pending_request and to keep admission consumption semantics correct.
components/src/dynamo/planner/core/throughput_scaling.py (1)
84-99: Broad except Exception without re-raise.

Per the repo's Python guidelines ("Prefer failing fast … if catching Exception, log then re-raise"), and per Ruff BLE001. The cold-start-fallback intent is clear from the docstring, but swallowing any exception here will mask genuine predictor bugs (shape mismatches, serialization errors) as "no prediction available." If a None-on-cold-start contract is really what you want, consider catching the predictor's specific "not enough data" exception type instead of bare Exception.

As per coding guidelines: "Prefer failing fast (avoid broad excepts; if catching Exception, log then re-raise)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/core/throughput_scaling.py` around lines 84 -
99, The method _predict_kv_hit_rate currently swallows all exceptions from
self._kv_hit_rate_predictor.predict_next; change this to only treat the
predictor's "not ready" exception as a benign cold-start (e.g., catch the
predictor-specific exception such as PredictorNotReady / NotEnoughDataError
thrown by the predictor and in that branch set self._diag_predicted_kv_hit_rate
= None and return None), and for any other Exception from predict_next log the
error and re-raise so real bugs surface; update the except block around
self._kv_hit_rate_predictor.predict_next accordingly and reference
_diag_predicted_kv_hit_rate and _predict_kv_hit_rate when implementing the
narrower handling.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/planner/core/state_machine.py`:
- Around line 304-306: The code only updates _last_kv_hit_rate and calls
_kv_hit_rate_predictor.add_data_point inside _observe_traffic(), which on_tick()
only invokes in the throughput-scaling branch, so load-only deployments never
learn kv_hit_rate; to fix, move or duplicate the KV-hit observation so it runs
on load ticks when SLA load scaling is enabled: call the same update logic that
sets self._last_kv_hit_rate and calls
self._kv_hit_rate_predictor.add_data_point(traffic.kv_hit_rate) from on_tick()’s
load-scaling path (or extract it into a helper method e.g.,
_record_kv_hit_rate(traffic) and invoke it from both _observe_traffic() and the
load-scaling branch), ensuring you check for traffic.kv_hit_rate is not None and
not math.isnan before updating; also ensure load_scaling.py’s TTFT discount
calculation reads self._last_kv_hit_rate after this change so it no longer
remains None.

In `@components/src/dynamo/planner/monitoring/traffic_metrics.py`:
- Around line 305-321: get_avg_kv_hit_rate currently lets _get_average_metric
collapse “no data / query failed” into 0.0 which downstream
(PlannerStateMachine._observe_traffic) treats as a valid observation; change the
code so missing router data is preserved as None. Update _get_average_metric to
return Optional[float] and avoid coercing empty/no-result Prometheus responses
to 0.0 (return None instead), then have get_avg_kv_hit_rate call
_get_average_metric and directly return its Optional[float] result (no
defaulting to 0.0); this preserves tri-state semantics (real float or None) so
PlannerStateMachine._observe_traffic can distinguish missing data.

In `@lib/bindings/python/src/dynamo/prometheus_names.py`:
- Around line 304-305: The KV_HIT_RATE metric name is inconsistent between
Python (KV_HIT_RATE = "router_kv_hit_rate" in prometheus_names.py) and Rust
(KV_HIT_RATE = "kv_hit_rate" in lib/runtime/src/metrics/prometheus_names.rs);
decide which source is authoritative and make them match: either move the Rust
constant into the router module and rename it to "router_kv_hit_rate" (so the
router module follows the "router_" prefix pattern), or change the Python
constant value to "kv_hit_rate" and then regenerate the Python bindings; after
updating the authoritative source (modify the KV_HIT_RATE definition in
prometheus_names.rs or the Python constant), run the generator to refresh
lib/bindings/python/src/dynamo/prometheus_names.py so both sides use the
identical KV_HIT_RATE value.

---

Outside diff comments:
In `@components/src/dynamo/planner/core/perf_model/agg.py`:
- Around line 181-194: The TTFT calculation in find_best_engine_agg_rps is using
a discounted effective_isl for prefill_per_iter but calls
self.estimate_next_ttft(...) without forwarding kv_hit_rate (or equivalent
scale), causing the TTFT gate to use an undiscounted avg_isl; fix by passing the
same kv_hit_rate (or compute scale = 1.0 - clamp(kv_hit_rate)) into
estimate_next_ttft so its scale parameter reflects the cache hit discount used
for prefill_per_iter, e.g., update the call site in find_best_engine_agg_rps to
include kv_hit_rate or scale so estimate_next_ttft uses the same discounting
logic.

In `@components/src/dynamo/planner/core/throughput_scaling.py`:
- Around line 212-243: The agg path omits the kv_hit_rate when calling
estimate_next_ttft in find_best_engine_agg_rps, causing TTFT to be computed
without the same KV-hit discount applied elsewhere; fix by passing the
kv_hit_rate (after clamping with _clamp_kv_hit_rate) into estimate_next_ttft so
the same effective ISL/discount used for prefill_per_iter and _predict_2d is
also applied to the _predict_wall_time TTFT estimate (adjust inputs to
estimate_next_ttft in find_best_engine_agg_rps to include kv_hit_rate), or add a
clear comment in find_best_engine_agg_rps explaining intentional asymmetry if
that was deliberate.

---

Nitpick comments:
In `@components/src/dynamo/planner/core/base.py`:
- Around line 465-469: The log currently treats m.kv_hit_rate only for None and
will print "nan" if m.kv_hit_rate is NaN; update the logger formatting around
m.kv_hit_rate in the logger.info block (the hit_rate_str assignment used above
the logger) to guard against NaN the same way _clamp_kv_hit_rate does (e.g.,
treat NaN like None and render "n/a" or the normalized value), so that
hit_rate_str is never the literal "nan" when passed into logger.info.

In `@components/src/dynamo/planner/core/throughput_scaling.py`:
- Around line 84-99: The method _predict_kv_hit_rate currently swallows all
exceptions from self._kv_hit_rate_predictor.predict_next; change this to only
treat the predictor's "not ready" exception as a benign cold-start (e.g., catch
the predictor-specific exception such as PredictorNotReady / NotEnoughDataError
thrown by the predictor and in that branch set self._diag_predicted_kv_hit_rate
= None and return None), and for any other Exception from predict_next log the
error and re-raise so real bugs surface; update the except block around
self._kv_hit_rate_predictor.predict_next accordingly and reference
_diag_predicted_kv_hit_rate and _predict_kv_hit_rate when implementing the
narrower handling.

In `@components/src/dynamo/planner/monitoring/diagnostics_recorder.py`:
- Around line 58-66: Add a new subplot in the _build_report_html function to
visualize observed_kv_hit_rate alongside predicted_kv_hit_rate so the new signal
is visible in generated reports; locate the _build_report_html method and add a
plot/trace that reads DiagnosticsRecorder.observed_kv_hit_rate and
DiagnosticsRecorder.predicted_kv_hit_rate (or the local variables named
observed_kv_hit_rate / predicted_kv_hit_rate) with appropriate labels/legend and
handle None values gracefully (skip or show gaps) to match existing plotting
patterns used for the other predicted_* vs observed_* signals.

In `@lib/mocker/src/replay/offline/components/router.rs`:
- Around line 487-565: The admit_request path currently silently saturates
isl_blocks with u32::try_from(...).unwrap_or(u32::MAX); add a
debug_assert!(isl_blocks != u32::MAX, "saturated isl_blocks: {}",
request.isl_tokens) or emit a tracing::warn! with the original ISL value right
after computing isl_blocks to surface unexpected saturation in debug/tracing
builds; and in drain_pending, avoid the redundant expect by extracting uuid from
the QueueEntry before calling admit_request (use the uuid variable from the
popped QueueEntry instead of relying on request.uuid inside admit_request) so
you don't need the panic path in build_pending_request and to keep admission
consumption semantics correct.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01428679-eef3-47aa-9514-05dfb563627a

📥 Commits

Reviewing files that changed from the base of the PR and between 5b03a59 and c88c209.

📒 Files selected for processing (21)

components/src/dynamo/planner/core/base.py
components/src/dynamo/planner/core/load_scaling.py
components/src/dynamo/planner/core/perf_model/agg.py
components/src/dynamo/planner/core/perf_model/base.py
components/src/dynamo/planner/core/perf_model/prefill.py
components/src/dynamo/planner/core/state_machine.py
components/src/dynamo/planner/core/throughput_scaling.py
components/src/dynamo/planner/core/types.py
components/src/dynamo/planner/monitoring/diagnostics_recorder.py
components/src/dynamo/planner/monitoring/traffic_metrics.py
components/src/dynamo/planner/offline/replay_adapter.py
components/src/dynamo/planner/tests/unit/test_load_based_scaling.py
components/src/dynamo/planner/tests/unit/test_prometheus.py
components/src/dynamo/planner/tests/unit/test_state_machine.py
lib/bindings/python/rust/llm/replay.rs
lib/bindings/python/src/dynamo/prometheus_names.py
lib/mocker/src/replay/offline/agg.rs
lib/mocker/src/replay/offline/components/router.rs
lib/mocker/src/replay/offline/components/types.rs
lib/mocker/src/replay/offline/disagg.rs
lib/mocker/src/replay/planner_handle.rs

The KV router already publishes `dynamo_component_router_kv_hit_rate` (predicted prefix-cache hit rate at routing time), but the planner ignored it -- so scaling decisions over-counted prefill compute work on reuse-heavy workloads and scaled prefill up unnecessarily. This threads the signal through both scaling paths. Load planner: `estimate_next_ttft` in the prefill and agg regression models takes an optional `kv_hit_rate` and scales `(queued + avg_isl)` by `(1 - clamp(hit_rate))`. The regression's x-feature (per-iter chunk size) is untouched, so the fit stays valid -- only the simulation aggregate is discounted. `load_scaling` passes the sticky `state._last_kv_hit_rate` into both call sites. Throughput planner: a fourth load predictor (`_kv_hit_rate_predictor`) is added alongside num_req/isl/osl. Per requirement it is NOT warmed from the mooncake trace (no good offline proxy). At prediction time the predicted hit rate discounts only the prefill portion -- decode KV residency uses the raw ISL because cache hits reduce prefill compute but not the decode-time KV footprint. `find_best_engine_agg_rps` gains a `kv_hit_rate` param that applies the discount to its internal prefill-per-iter simulation. Replay: mocker's `TrafficAccumulator` now tracks overlap + ISL blocks via `on_admission(overlap_blocks, isl_blocks)` called from the prefill-router dispatch path (disagg) and the router admission path (agg). `drain_traffic` returns a 5-tuple including `avg_kv_hit_rate` as `Σoverlap / Σisl_blocks`, matching the real router's per-request histogram semantics. Python binding and `replay_adapter` thread the field into `TrafficObservation` so offline replay can exercise the same discount logic end-to-end. Safety: `_clamp_kv_hit_rate` caps at 0.95 (a stale 1.0 reading would otherwise zero out queued work), and treats None/NaN as 0.0 (no discount, preserving prior behavior exactly). Diagnostics expose the observed `kv_hit_rate` and the predicted value on each tick. Tests: 10 new Python tests covering regression discount, clamping, None/NaN fallback, warmup exclusion, diagnostics propagation, and end-to-end throughput-scaling demand reduction; 3 Rust tests on `TrafficAccumulator` including the weighted-by-isl-blocks property. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

devin-ai-integration

Devin Review found 2 new potential issues.

View 5 additional findings in Devin Review.

… for load scaling Plug two cadence gaps in the original KV-reuse-aware planner: 1. **Load-only deployments now scrape kv_hit_rate.** Previously the Prometheus scrape was tied to the throughput tick, so deployments with `enable_throughput_scaling=False` silently saw `_last_kv_hit_rate = None` forever and applied no discount. The scheduler now requests a kv-hit-rate-only scrape on each load tick (window = load_adjustment_interval) when throughput scaling is disabled. 2. **Mixed-mode load scaling now reads the predicted value.** When throughput scaling is enabled, the kv_hit_rate predictor produces a smoothed forecast on each throughput tick. `_advance_throughput` now promotes that predicted value to `_last_kv_hit_rate`, so all subsequent load ticks (between throughput ticks) consume the forecast rather than the raw last-window observation. Implementation: - `state_machine._next_scheduled_tick`: in load-only mode, set `need_traffic_metrics=True` and `traffic_metrics_duration_s = load_adjustment_interval` on load ticks. - `state_machine._observe_traffic`: gate num_req/isl/osl predictor feeds on `enable_throughput_scaling` (avoid polluting unused predictors with the load-only path's placeholder zeros). Only write directly to `_last_kv_hit_rate` in load-only mode; in mixed mode feed the predictor and let the throughput tick do the promotion. - `state_machine.on_tick`: parallel branch in `run_load_scaling` calls `_observe_traffic` for pure load ticks (skipped in mixed mode where the throughput branch already consumed it). - `throughput_scaling._advance_throughput`: after `_predict_kv_hit_rate`, store the predicted value in `_last_kv_hit_rate` for cross-cadence consumption. - `base.py`: add `_collect_kv_hit_rate_observation(duration_s)` — a cheap one-query scrape used in load-only mode to avoid issuing the six unused traffic queries per load tick. Dispatch in `_gather_tick_input` based on `tick.run_throughput_scaling`. Replay needs no code change: `TrafficStats.avg_kv_hit_rate` already exists, and the scheduler change drains the mocker's TrafficAccumulator at load cadence in load-only mode automatically. Tests: 7 new state-machine tests covering load-only direct-pass, mixed-mode predictor promotion, scheduler flag in both modes, and predictor-feed gating. Updated the existing `test_load_only` initial tick test to reflect the new `need_traffic_metrics=True` behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Three issues from CodeRabbit / Devin reviews on PR #8314: 1. **prometheus_names: register router KV_HIT_RATE in Rust source.** The Python constant `router.KV_HIT_RATE = "router_kv_hit_rate"` was added directly to the auto-generated file with no matching Rust entry, so the next codegen run would clobber it. Add `pub const KV_HIT_RATE` to the `router` module in `lib/runtime/src/metrics/prometheus_names.rs` and regenerate the Python module via `dynamo-codegen gen-python-prometheus-names`. 2. **traffic_metrics: preserve None for missing kv_hit_rate.** Routing through `_get_average_metric` collapsed Prometheus scrape gaps / NaN into a real `0.0`, which downstream is treated as a valid observation and would drag the predictor / sticky value down to zero on every failed scrape. Inline a custom query in `get_avg_kv_hit_rate` that returns `None` on empty/NaN/error so `_clamp_kv_hit_rate(None)` falls back to no-discount behavior. `_get_average_metric` is left unchanged (other metrics intentionally prefer 0 on missing data). 3. **agg.find_best_engine_agg_rps: uniform kv_hit_rate discount in TTFT estimate.** The batch-size sweep computed `prefill_per_iter` with `effective_isl` (already discounted) and then passed it to `estimate_next_ttft` *without* forwarding `kv_hit_rate`. Inside `estimate_next_ttft` the discount defaulted to 1.0, so the `avg_isl` term (the hypothetical next request's ISL from the regression model) was *not* discounted, inflating the predicted TTFT and over-provisioning replicas at high hit rate. Fix by passing the *raw* `prefill_per_iter = bs * isl / osl` together with `kv_hit_rate`, letting the function apply the discount uniformly to both the queued portion and `avg_isl`. Also switch the `_prefill_balanced` constraint to use `effective_isl` so the prefill admission rate reflects the cache discount and the batch-size cap widens correspondingly. Tests: new regression test `test_agg_find_best_engine_rps_uniform_discount_in_ttft_estimate` asserts that `engine_rps` strictly grows when `kv_hit_rate` jumps from 0.0 to 0.8 under permissive SLAs (the bug would have left it suppressed by the still-full `avg_isl` term). 127 / 127 planner unit tests pass; 216 / 216 mocker tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

@jthomson04

Per @jthomson04's review on PR #8314: the mocker's TrafficAccumulator was computing avg_kv_hit_rate as a block-weighted aggregate ratio ``Σoverlap_blocks / Σisl_blocks``, which gives larger requests more weight than smaller ones. The real router's Prometheus histogram observes one ``overlap/isl`` sample per request (see RequestTracker::kv_hit_rate in timing.rs:308-314), and the planner's PromQL ``sum(increase(_sum)) / sum(increase(_count))`` returns the arithmetic mean of those per-request samples. The two diverge when requests have variable ISL: Example: request A (4 ISL blocks, 3 overlap) + request B (12 ISL blocks, 0 overlap). Real router Prometheus: (0.75 + 0.0) / 2 = 0.375 Old mocker: 3 / 16 = 0.1875 Switch TrafficAccumulator to track a running sum of per-request ``overlap / isl`` ratios and a sample count; drain returns ``total_hit_rate / hit_rate_count``. Admissions with ``isl_blocks == 0`` are skipped (no meaningful ratio), matching ``RequestTracker::kv_hit_rate`` returning ``None`` in that case. Updates: - ``TrafficAccumulator``: replace ``total_overlap_blocks: u64``, ``total_isl_blocks: u64`` with ``total_hit_rate: f64``, ``hit_rate_count: usize``. ``on_admission`` now computes the per-request ratio up front. - ``TrafficStats`` and adapter-layer docstrings (planner_handle.rs, bindings/replay.rs) updated to describe the per-request-mean semantics. - Rename the Rust regression test ``traffic_accumulator_hit_rate_is_weighted_by_isl_blocks`` → ``traffic_accumulator_hit_rate_is_mean_of_per_request_ratios``, flip its expected value (0.375 instead of 0.1875), and add a new ``traffic_accumulator_skips_admissions_with_zero_isl_blocks`` test. Tests: 217 / 217 mocker tests pass (216 + 1 new); 127 / 127 planner tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

tedzhouhk requested review from a team as code owners April 17, 2026 18:19

pull-request-size Bot added the size/XL label Apr 17, 2026

github-actions Bot added feat planner labels Apr 17, 2026

This comment was marked as resolved.

Sign in to view

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread components/src/dynamo/planner/core/state_machine.py Outdated

Comment thread components/src/dynamo/planner/monitoring/traffic_metrics.py Outdated

Comment thread lib/bindings/python/src/dynamo/prometheus_names.py

tedzhouhk force-pushed the hzhou/planner-kv-reuse-awareness branch from c88c209 to 518a702 Compare April 17, 2026 20:31

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 20:31 Inactive

devin-ai-integration Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread components/src/dynamo/planner/core/perf_model/agg.py

Comment thread components/src/dynamo/planner/core/perf_model/agg.py

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 20:49 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 21:56 Inactive

tedzhouhk changed the title ~~feat(planner): KV reuse awareness in load + throughput scaling~~ feat(planner/replay): KV reuse awareness in load + throughput scaling Apr 17, 2026

pull-request-size Bot added size/XXL and removed size/XL labels Apr 17, 2026

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 22:10 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 22:29 Inactive

jthomson04 reviewed Apr 17, 2026

View reviewed changes

Comment thread components/src/dynamo/planner/monitoring/traffic_metrics.py

copy-pr-bot Bot temporarily deployed to GITLAB April 18, 2026 00:40 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 18, 2026 00:41 Inactive

nv-yna mentioned this pull request Apr 18, 2026

[test] fix(trtllm): DYN-2715 on release/1.1.0 with #8324 IRSA cherry-pick #8338

Merged

4 tasks

Merge branch 'main' into hzhou/planner-kv-reuse-awareness

db77afb

copy-pr-bot Bot temporarily deployed to GITLAB April 20, 2026 16:08 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 20, 2026 16:11 Inactive

Merge branch 'main' into hzhou/planner-kv-reuse-awareness

81c732b

copy-pr-bot Bot temporarily deployed to GITLAB April 20, 2026 16:28 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 20, 2026 16:48 Inactive

PeaBrane approved these changes Apr 20, 2026

View reviewed changes

tedzhouhk merged commit c388483 into main Apr 20, 2026
90 checks passed

tedzhouhk deleted the hzhou/planner-kv-reuse-awareness branch April 20, 2026 17:24

dmitry-tokarev-nv mentioned this pull request Apr 20, 2026

fix(planner/tests): scope test_advisory_mode stubs so they don't leak #8418

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(planner/replay): KV reuse awareness in load + throughput scaling#8314

feat(planner/replay): KV reuse awareness in load + throughput scaling#8314
tedzhouhk merged 6 commits into
mainfrom
hzhou/planner-kv-reuse-awareness

tedzhouhk commented Apr 17, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tedzhouhk commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Discount math

Tick cadence (latest commit)

Safety

Test plan

Out of scope

Summary by CodeRabbit

Release Notes

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tedzhouhk commented Apr 17, 2026 •

edited

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading