Skip to content

feat(planner/replay): KV reuse awareness in load + throughput scaling#8314

Merged
tedzhouhk merged 6 commits into
mainfrom
hzhou/planner-kv-reuse-awareness
Apr 20, 2026
Merged

feat(planner/replay): KV reuse awareness in load + throughput scaling#8314
tedzhouhk merged 6 commits into
mainfrom
hzhou/planner-kv-reuse-awareness

Conversation

@tedzhouhk
Copy link
Copy Markdown
Contributor

@tedzhouhk tedzhouhk commented Apr 17, 2026

Summary

The KV router publishes dynamo_component_router_kv_hit_rate (predicted prefix-cache hit rate at routing time), but the planner ignored it — so scaling decisions over-counted prefill compute work on reuse-heavy workloads and scaled prefill up unnecessarily. This PR threads the signal through both scaling paths, through offline replay, and (in the latest commit) through the right tick cadence for each deployment mode.

Discount math

  • Load planner (reactive, FPM-driven): estimate_next_ttft in the prefill and agg regressions takes an optional kv_hit_rate and scales (queued + avg_isl) * (1 − clamp(hit_rate)). The regression's x-feature is untouched — only the simulation aggregate is discounted, so no double-counting against post-cache sum_prefill_tokens.
  • Throughput planner (predictive): a fourth load predictor (_kv_hit_rate_predictor) is added alongside num_req/isl/osl. Per requirement it is not warmed from the mooncake trace (no good offline proxy). Predicted hit rate discounts only the prefill portion of capacity sizing; decode KV residency uses raw ISL because cache hits don't shrink the decode footprint. find_best_engine_agg_rps gains a kv_hit_rate param.
  • Replay integration: mocker's TrafficAccumulator tracks overlap + ISL blocks via a new on_admission(overlap, isl_blocks) call from the prefill-router dispatch path. TrafficStats carries avg_kv_hit_rate = Σoverlap / Σisl_blocks, matching the real router's per-request histogram semantics. Python binding and replay_adapter thread the field into TrafficObservation.

Tick cadence (latest commit)

Two cadence gaps from the initial implementation are now fixed:

  • Load-only deployments (enable_throughput_scaling=False): the kv-hit-rate scrape now rides on each load tick over the load interval, via a one-query _collect_kv_hit_rate_observation. Previously the scrape was tied to throughput ticks, so load-only mode silently saw _last_kv_hit_rate=None forever and applied no discount.
  • Mixed mode (both scalings on): _advance_throughput now promotes the _kv_hit_rate_predictor's smoothed forecast to _last_kv_hit_rate. All load ticks between throughput ticks consume the predicted value rather than the raw last-window observation.
  • Replay needs no Rust change: the same scheduler flag that drives Prometheus scrapes also drains the mocker TrafficAccumulator at the right cadence per mode.

Safety

_clamp_kv_hit_rate caps at 0.95 (a stale 1.0 reading would zero out queued work and mask backlog), and treats None/NaN as 0.0 (no discount — preserves prior behavior exactly). Diagnostics expose the observed kv_hit_rate and the predicted value per tick.

Test plan

  • 17 Python unit tests across test_load_based_scaling, test_state_machine, test_prometheus: regression discount, clamping, None/NaN fallback, warmup exclusion, diagnostics propagation, end-to-end throughput-scaling demand reduction, prefill/agg parity, load-only direct-pass, mixed-mode predictor promotion, scheduler flag in both modes, predictor-feed gating.
  • 3 Rust tests on TrafficAccumulator including the weighted-by-isl-blocks property.
  • Full planner unit suite: 126 / 126 passing.
  • Full mocker suite: 216 / 216 passing.
  • pre-commit (ruff, isort, black, flake8, codespell) clean on changed files; cargo fmt clean.
  • End-to-end replay validation on shared-prefix-heavy synthetic workload (reviewer-suggested follow-up).

Out of scope

  • Per-worker hit rate (router histogram is aggregated; per-worker needs new labels).
  • Tier-aware discount (GPU vs host vs disk) — current metric conflates tiers.
  • Decode-side reuse (sum_decode_kv_tokens is already directly measured).
  • FPM-sourced hit rate (would require engine emitter schema extensions).
  • Predicting hit rate from the offline trace — explicitly skipped per design.

🤖 Generated with Claude Code


Open with Devin

Summary by CodeRabbit

Release Notes

  • New Features

    • Integrated KV cache hit rate collection from metrics sources and added hit rate prediction capabilities. Updated load scaling models to incorporate hit rate data for improved load estimation and performance prediction accuracy.
  • Tests

    • Added comprehensive unit test coverage for KV cache hit rate integration across regression models and diagnostic tracking.

devin-ai-integration[bot]

This comment was marked as resolved.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Walkthrough

This PR introduces KV cache hit-rate tracking across the planner system. Changes propagate hit-rate metrics from Prometheus through traffic observations, integrate them into performance model estimations via scaling factors, add KV hit-rate prediction capabilities, and extend offline replay and monitoring infrastructure to capture and report these metrics.

Changes

Cohort / File(s) Summary
Core Data Structures
components/src/dynamo/planner/core/types.py, components/src/dynamo/planner/monitoring/traffic_metrics.py
Added kv_hit_rate fields to TrafficObservation and TickDiagnostics. Extended Metrics dataclass with kv_hit_rate field and added PrometheusAPIClient.get_avg_kv_hit_rate() method to query router KV hit rate from Prometheus.
Traffic & State Management
components/src/dynamo/planner/core/base.py, components/src/dynamo/planner/core/state_machine.py
Extended traffic metrics collection to fetch kv_hit_rate from Prometheus and propagate into TrafficObservation. Added _kv_hit_rate_predictor and _last_kv_hit_rate tracking to state machine; extended diagnostics with predicted_kv_hit_rate.
Performance Model Foundation
components/src/dynamo/planner/core/perf_model/base.py
Added _clamp_kv_hit_rate() helper function and _MAX_KV_HIT_RATE_DISCOUNT constant to normalize hit-rate values for downstream scaling operations.
Performance Model Implementations
components/src/dynamo/planner/core/perf_model/prefill.py, components/src/dynamo/planner/core/perf_model/agg.py
Updated estimate_next_ttft() methods to accept optional kv_hit_rate parameter and apply computed scaling factor 1.0 - clamped_hit_rate to reduce prefill token work. AggRegressionModel.find_best_engine_agg_rps() now adjusts effective ISL based on cache hit rate.
Scaling Decision Logic
components/src/dynamo/planner/core/load_scaling.py, components/src/dynamo/planner/core/throughput_scaling.py
Updated load and throughput scaling decisions to pass kv_hit_rate into regression estimators. Added KV hit-rate prediction and exception handling in throughput scaling; extended logging to show raw and effective ISL values.
Monitoring & Offline Replay
components/src/dynamo/planner/monitoring/diagnostics_recorder.py, components/src/dynamo/planner/offline/replay_adapter.py
Extended TickSnapshot to capture observed_kv_hit_rate and predicted_kv_hit_rate. Updated replay adapter to bridge avg_kv_hit_rate from trace data into planner observations.
Mocker Runtime Components
lib/mocker/src/replay/offline/components/types.rs, lib/mocker/src/replay/offline/components/router.rs
Added overlap_blocks and isl_blocks fields to WorkerAdmission. Introduced TrafficAccumulator.on_admission() method and updated drain() signature to compute and return avg_kv_hit_rate as total_overlap_blocks / total_isl_blocks. Router now captures admission metadata in AdmitOutcome.
Mocker Aggregated & Disaggregated Runtimes
lib/mocker/src/replay/offline/agg.rs, lib/mocker/src/replay/offline/disagg.rs, lib/mocker/src/replay/planner_handle.rs
Updated drain_traffic() return type from 4-tuple to 5-tuple to include avg_kv_hit_rate. Both runtimes now call traffic.on_admission() when dispatching admissions. PlannerReplayHandle delegates to runtime's drain method without change in logic.
Python Bindings
lib/bindings/python/rust/llm/replay.rs, lib/bindings/python/src/dynamo/prometheus_names.py
Updated PlannerReplayBridge.drain_traffic() return to include avg_kv_hit_rate. Added router.KV_HIT_RATE Prometheus metric name constant.
Test Coverage
components/src/dynamo/planner/tests/unit/test_load_based_scaling.py, components/src/dynamo/planner/tests/unit/test_prometheus.py, components/src/dynamo/planner/tests/unit/test_state_machine.py
Added comprehensive unit tests covering kv_hit_rate parameter handling in regression models (clamping, scaling behavior), Prometheus client method for both router and non-router sources, and end-to-end planner plumbing with prediction and diagnostics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 95.77% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed PR description comprehensively covers overview, changes, discount math, tick cadence fixes, safety mechanisms, test coverage, and out-of-scope items.
Title check ✅ Passed The PR title clearly describes the main feature: KV reuse awareness in load and throughput scaling. It accurately summarizes the primary change across the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/src/dynamo/planner/core/perf_model/agg.py (1)

181-194: ⚠️ Potential issue | 🟠 Major

Inner estimate_next_ttft call drops kv_hit_rate — TTFT SLA check is inconsistent with the discount applied everywhere else.

find_best_engine_agg_rps builds effective_isl = isl * (1 - clamp(kv_hit_rate)) and uses it to scale prefill_per_iter, but then calls self.estimate_next_ttft(...) without forwarding kv_hit_rate. Inside that call, scale defaults to 1.0, so the hypothetical next request’s avg_isl contribution is added back at full cost. The net effect: the TTFT SLA gate in the bs sweep sees an undiscounted TTFT even when cache reuse is high, which can cause the sweep to break early and under-report achievable RPS — the exact behavior this PR is trying to avoid. The docstring on Lines 150-154 also explicitly says the prefill portion is discounted, so this is a behavioral/docstring mismatch.

🔧 Proposed fix
             est_ttft = self.estimate_next_ttft(
                 queued_prefill_tokens=int(prefill_per_iter),
                 max_num_batched_tokens=max_num_batched_tokens,
                 current_decode_kv=int(decode_kv),
+                kv_hit_rate=kv_hit_rate,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/core/perf_model/agg.py` around lines 181 - 194,
The TTFT calculation in find_best_engine_agg_rps is using a discounted
effective_isl for prefill_per_iter but calls self.estimate_next_ttft(...)
without forwarding kv_hit_rate (or equivalent scale), causing the TTFT gate to
use an undiscounted avg_isl; fix by passing the same kv_hit_rate (or compute
scale = 1.0 - clamp(kv_hit_rate)) into estimate_next_ttft so its scale parameter
reflects the cache hit discount used for prefill_per_iter, e.g., update the call
site in find_best_engine_agg_rps to include kv_hit_rate or scale so
estimate_next_ttft uses the same discounting logic.
components/src/dynamo/planner/core/throughput_scaling.py (1)

212-243: ⚠️ Potential issue | 🟠 Major

Verify asymmetry: agg throughput's TTFT estimate does not receive the kv_hit_rate discount.

In find_best_engine_agg_rps (agg.py line 188–193), the call to estimate_next_ttft omits the kv_hit_rate parameter entirely. This means the TTFT estimate computes scale = 1.0 (no discount), while prefill_per_iter was already discounted via effective_isl = isl * (1.0 - clamp(kv_hit_rate)). The prefill throughput path pre-discounts isl before calling find_best_engine_prefill_rps, applying the discount once in a single regression call. However:

  • Prefill path: Discount applied upfront → wall_time(discounted_isl)
  • Agg path: Discount applied to prefill_per_iter for _predict_2d call, but estimate_next_ttft receives no discount and treats aggregate as full-scale
  • Result: Different regression kernels (_predict_2d vs. _predict_wall_time, 2D vs. 1D) receive inconsistent inputs under cache hit conditions, yielding mismatched TTFT and ITL estimates

Align by passing kv_hit_rate to the estimate_next_ttft call in the agg loop, or document why the TTFT estimate intentionally omits the discount.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/core/throughput_scaling.py` around lines 212 -
243, The agg path omits the kv_hit_rate when calling estimate_next_ttft in
find_best_engine_agg_rps, causing TTFT to be computed without the same KV-hit
discount applied elsewhere; fix by passing the kv_hit_rate (after clamping with
_clamp_kv_hit_rate) into estimate_next_ttft so the same effective ISL/discount
used for prefill_per_iter and _predict_2d is also applied to the
_predict_wall_time TTFT estimate (adjust inputs to estimate_next_ttft in
find_best_engine_agg_rps to include kv_hit_rate), or add a clear comment in
find_best_engine_agg_rps explaining intentional asymmetry if that was
deliberate.
🧹 Nitpick comments (4)
components/src/dynamo/planner/core/base.py (1)

465-469: Minor: kv_hit_rate may be NaN, not just None.

The is None guard on Line 465 only handles None; if Prometheus returns NaN (e.g., empty series converted to NaN by the client), the log will render "nan" and downstream consumers rely on _clamp_kv_hit_rate to normalize it. Either NaN-guard here for consistency with _clamp_kv_hit_rate, or leave as-is and rely on the clamp — just make sure it’s intentional.

-        hit_rate_str = f"{m.kv_hit_rate:.3f}" if m.kv_hit_rate is not None else "n/a"
+        hit_rate_str = (
+            f"{m.kv_hit_rate:.3f}"
+            if m.kv_hit_rate is not None and not math.isnan(m.kv_hit_rate)
+            else "n/a"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/core/base.py` around lines 465 - 469, The log
currently treats m.kv_hit_rate only for None and will print "nan" if
m.kv_hit_rate is NaN; update the logger formatting around m.kv_hit_rate in the
logger.info block (the hit_rate_str assignment used above the logger) to guard
against NaN the same way _clamp_kv_hit_rate does (e.g., treat NaN like None and
render "n/a" or the normalized value), so that hit_rate_str is never the literal
"nan" when passed into logger.info.
components/src/dynamo/planner/monitoring/diagnostics_recorder.py (1)

58-66: LGTM.

Observed vs predicted hit-rate are surfaced symmetrically; both default to None so existing snapshots/HTML reports handle missing data gracefully. Consider adding a subplot for them in _build_report_html in a follow-up so the new signal is actually visible in the generated report.

Also applies to: 169-179

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/monitoring/diagnostics_recorder.py` around
lines 58 - 66, Add a new subplot in the _build_report_html function to visualize
observed_kv_hit_rate alongside predicted_kv_hit_rate so the new signal is
visible in generated reports; locate the _build_report_html method and add a
plot/trace that reads DiagnosticsRecorder.observed_kv_hit_rate and
DiagnosticsRecorder.predicted_kv_hit_rate (or the local variables named
observed_kv_hit_rate / predicted_kv_hit_rate) with appropriate labels/legend and
handle None values gracefully (skip or show gaps) to match existing plotting
patterns used for the other predicted_* vs observed_* signals.
lib/mocker/src/replay/offline/components/router.rs (1)

487-565: Clean refactor; AdmitOutcome threads the two overlap counters through both admit paths consistently.

A couple of small observations (non-blocking):

  • Line 516-517: u32::try_from(...).unwrap_or(u32::MAX) silently saturates. For realistic ISL this can't trigger, but if it ever did, downstream ratios would skew. A debug_assert! (or tracing::warn! with the ISL) would make the saturation observable without affecting release behavior.
  • Line 265-267: request.uuid.expect(...) is redundant with the uuid extraction already performed inside build_pending_request (line 438-440). You could pull uuid from pending.uuid before admit_request consumes it and drop the panic path entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/mocker/src/replay/offline/components/router.rs` around lines 487 - 565,
The admit_request path currently silently saturates isl_blocks with
u32::try_from(...).unwrap_or(u32::MAX); add a debug_assert!(isl_blocks !=
u32::MAX, "saturated isl_blocks: {}", request.isl_tokens) or emit a
tracing::warn! with the original ISL value right after computing isl_blocks to
surface unexpected saturation in debug/tracing builds; and in drain_pending,
avoid the redundant expect by extracting uuid from the QueueEntry before calling
admit_request (use the uuid variable from the popped QueueEntry instead of
relying on request.uuid inside admit_request) so you don't need the panic path
in build_pending_request and to keep admission consumption semantics correct.
components/src/dynamo/planner/core/throughput_scaling.py (1)

84-99: Broad except Exception without re-raise.

Per the repo's Python guidelines ("Prefer failing fast … if catching Exception, log then re-raise"), and per Ruff BLE001. The cold-start-fallback intent is clear from the docstring, but swallowing any exception here will mask genuine predictor bugs (shape mismatches, serialization errors) as "no prediction available." If a None-on-cold-start contract is really what you want, consider catching the predictor's specific "not enough data" exception type instead of bare Exception.

As per coding guidelines: "Prefer failing fast (avoid broad excepts; if catching Exception, log then re-raise)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/core/throughput_scaling.py` around lines 84 -
99, The method _predict_kv_hit_rate currently swallows all exceptions from
self._kv_hit_rate_predictor.predict_next; change this to only treat the
predictor's "not ready" exception as a benign cold-start (e.g., catch the
predictor-specific exception such as PredictorNotReady / NotEnoughDataError
thrown by the predictor and in that branch set self._diag_predicted_kv_hit_rate
= None and return None), and for any other Exception from predict_next log the
error and re-raise so real bugs surface; update the except block around
self._kv_hit_rate_predictor.predict_next accordingly and reference
_diag_predicted_kv_hit_rate and _predict_kv_hit_rate when implementing the
narrower handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/planner/core/state_machine.py`:
- Around line 304-306: The code only updates _last_kv_hit_rate and calls
_kv_hit_rate_predictor.add_data_point inside _observe_traffic(), which on_tick()
only invokes in the throughput-scaling branch, so load-only deployments never
learn kv_hit_rate; to fix, move or duplicate the KV-hit observation so it runs
on load ticks when SLA load scaling is enabled: call the same update logic that
sets self._last_kv_hit_rate and calls
self._kv_hit_rate_predictor.add_data_point(traffic.kv_hit_rate) from on_tick()’s
load-scaling path (or extract it into a helper method e.g.,
_record_kv_hit_rate(traffic) and invoke it from both _observe_traffic() and the
load-scaling branch), ensuring you check for traffic.kv_hit_rate is not None and
not math.isnan before updating; also ensure load_scaling.py’s TTFT discount
calculation reads self._last_kv_hit_rate after this change so it no longer
remains None.

In `@components/src/dynamo/planner/monitoring/traffic_metrics.py`:
- Around line 305-321: get_avg_kv_hit_rate currently lets _get_average_metric
collapse “no data / query failed” into 0.0 which downstream
(PlannerStateMachine._observe_traffic) treats as a valid observation; change the
code so missing router data is preserved as None. Update _get_average_metric to
return Optional[float] and avoid coercing empty/no-result Prometheus responses
to 0.0 (return None instead), then have get_avg_kv_hit_rate call
_get_average_metric and directly return its Optional[float] result (no
defaulting to 0.0); this preserves tri-state semantics (real float or None) so
PlannerStateMachine._observe_traffic can distinguish missing data.

In `@lib/bindings/python/src/dynamo/prometheus_names.py`:
- Around line 304-305: The KV_HIT_RATE metric name is inconsistent between
Python (KV_HIT_RATE = "router_kv_hit_rate" in prometheus_names.py) and Rust
(KV_HIT_RATE = "kv_hit_rate" in lib/runtime/src/metrics/prometheus_names.rs);
decide which source is authoritative and make them match: either move the Rust
constant into the router module and rename it to "router_kv_hit_rate" (so the
router module follows the "router_" prefix pattern), or change the Python
constant value to "kv_hit_rate" and then regenerate the Python bindings; after
updating the authoritative source (modify the KV_HIT_RATE definition in
prometheus_names.rs or the Python constant), run the generator to refresh
lib/bindings/python/src/dynamo/prometheus_names.py so both sides use the
identical KV_HIT_RATE value.

---

Outside diff comments:
In `@components/src/dynamo/planner/core/perf_model/agg.py`:
- Around line 181-194: The TTFT calculation in find_best_engine_agg_rps is using
a discounted effective_isl for prefill_per_iter but calls
self.estimate_next_ttft(...) without forwarding kv_hit_rate (or equivalent
scale), causing the TTFT gate to use an undiscounted avg_isl; fix by passing the
same kv_hit_rate (or compute scale = 1.0 - clamp(kv_hit_rate)) into
estimate_next_ttft so its scale parameter reflects the cache hit discount used
for prefill_per_iter, e.g., update the call site in find_best_engine_agg_rps to
include kv_hit_rate or scale so estimate_next_ttft uses the same discounting
logic.

In `@components/src/dynamo/planner/core/throughput_scaling.py`:
- Around line 212-243: The agg path omits the kv_hit_rate when calling
estimate_next_ttft in find_best_engine_agg_rps, causing TTFT to be computed
without the same KV-hit discount applied elsewhere; fix by passing the
kv_hit_rate (after clamping with _clamp_kv_hit_rate) into estimate_next_ttft so
the same effective ISL/discount used for prefill_per_iter and _predict_2d is
also applied to the _predict_wall_time TTFT estimate (adjust inputs to
estimate_next_ttft in find_best_engine_agg_rps to include kv_hit_rate), or add a
clear comment in find_best_engine_agg_rps explaining intentional asymmetry if
that was deliberate.

---

Nitpick comments:
In `@components/src/dynamo/planner/core/base.py`:
- Around line 465-469: The log currently treats m.kv_hit_rate only for None and
will print "nan" if m.kv_hit_rate is NaN; update the logger formatting around
m.kv_hit_rate in the logger.info block (the hit_rate_str assignment used above
the logger) to guard against NaN the same way _clamp_kv_hit_rate does (e.g.,
treat NaN like None and render "n/a" or the normalized value), so that
hit_rate_str is never the literal "nan" when passed into logger.info.

In `@components/src/dynamo/planner/core/throughput_scaling.py`:
- Around line 84-99: The method _predict_kv_hit_rate currently swallows all
exceptions from self._kv_hit_rate_predictor.predict_next; change this to only
treat the predictor's "not ready" exception as a benign cold-start (e.g., catch
the predictor-specific exception such as PredictorNotReady / NotEnoughDataError
thrown by the predictor and in that branch set self._diag_predicted_kv_hit_rate
= None and return None), and for any other Exception from predict_next log the
error and re-raise so real bugs surface; update the except block around
self._kv_hit_rate_predictor.predict_next accordingly and reference
_diag_predicted_kv_hit_rate and _predict_kv_hit_rate when implementing the
narrower handling.

In `@components/src/dynamo/planner/monitoring/diagnostics_recorder.py`:
- Around line 58-66: Add a new subplot in the _build_report_html function to
visualize observed_kv_hit_rate alongside predicted_kv_hit_rate so the new signal
is visible in generated reports; locate the _build_report_html method and add a
plot/trace that reads DiagnosticsRecorder.observed_kv_hit_rate and
DiagnosticsRecorder.predicted_kv_hit_rate (or the local variables named
observed_kv_hit_rate / predicted_kv_hit_rate) with appropriate labels/legend and
handle None values gracefully (skip or show gaps) to match existing plotting
patterns used for the other predicted_* vs observed_* signals.

In `@lib/mocker/src/replay/offline/components/router.rs`:
- Around line 487-565: The admit_request path currently silently saturates
isl_blocks with u32::try_from(...).unwrap_or(u32::MAX); add a
debug_assert!(isl_blocks != u32::MAX, "saturated isl_blocks: {}",
request.isl_tokens) or emit a tracing::warn! with the original ISL value right
after computing isl_blocks to surface unexpected saturation in debug/tracing
builds; and in drain_pending, avoid the redundant expect by extracting uuid from
the QueueEntry before calling admit_request (use the uuid variable from the
popped QueueEntry instead of relying on request.uuid inside admit_request) so
you don't need the panic path in build_pending_request and to keep admission
consumption semantics correct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01428679-eef3-47aa-9514-05dfb563627a

📥 Commits

Reviewing files that changed from the base of the PR and between 5b03a59 and c88c209.

📒 Files selected for processing (21)
  • components/src/dynamo/planner/core/base.py
  • components/src/dynamo/planner/core/load_scaling.py
  • components/src/dynamo/planner/core/perf_model/agg.py
  • components/src/dynamo/planner/core/perf_model/base.py
  • components/src/dynamo/planner/core/perf_model/prefill.py
  • components/src/dynamo/planner/core/state_machine.py
  • components/src/dynamo/planner/core/throughput_scaling.py
  • components/src/dynamo/planner/core/types.py
  • components/src/dynamo/planner/monitoring/diagnostics_recorder.py
  • components/src/dynamo/planner/monitoring/traffic_metrics.py
  • components/src/dynamo/planner/offline/replay_adapter.py
  • components/src/dynamo/planner/tests/unit/test_load_based_scaling.py
  • components/src/dynamo/planner/tests/unit/test_prometheus.py
  • components/src/dynamo/planner/tests/unit/test_state_machine.py
  • lib/bindings/python/rust/llm/replay.rs
  • lib/bindings/python/src/dynamo/prometheus_names.py
  • lib/mocker/src/replay/offline/agg.rs
  • lib/mocker/src/replay/offline/components/router.rs
  • lib/mocker/src/replay/offline/components/types.rs
  • lib/mocker/src/replay/offline/disagg.rs
  • lib/mocker/src/replay/planner_handle.rs

Comment thread components/src/dynamo/planner/core/state_machine.py Outdated
Comment thread components/src/dynamo/planner/monitoring/traffic_metrics.py Outdated
Comment thread lib/bindings/python/src/dynamo/prometheus_names.py
The KV router already publishes `dynamo_component_router_kv_hit_rate`
(predicted prefix-cache hit rate at routing time), but the planner
ignored it -- so scaling decisions over-counted prefill compute work
on reuse-heavy workloads and scaled prefill up unnecessarily. This
threads the signal through both scaling paths.

Load planner: `estimate_next_ttft` in the prefill and agg regression
models takes an optional `kv_hit_rate` and scales `(queued + avg_isl)`
by `(1 - clamp(hit_rate))`. The regression's x-feature (per-iter chunk
size) is untouched, so the fit stays valid -- only the simulation
aggregate is discounted. `load_scaling` passes the sticky
`state._last_kv_hit_rate` into both call sites.

Throughput planner: a fourth load predictor (`_kv_hit_rate_predictor`)
is added alongside num_req/isl/osl. Per requirement it is NOT warmed
from the mooncake trace (no good offline proxy). At prediction time
the predicted hit rate discounts only the prefill portion -- decode
KV residency uses the raw ISL because cache hits reduce prefill
compute but not the decode-time KV footprint. `find_best_engine_agg_rps`
gains a `kv_hit_rate` param that applies the discount to its internal
prefill-per-iter simulation.

Replay: mocker's `TrafficAccumulator` now tracks overlap + ISL blocks
via `on_admission(overlap_blocks, isl_blocks)` called from the
prefill-router dispatch path (disagg) and the router admission path
(agg). `drain_traffic` returns a 5-tuple including `avg_kv_hit_rate`
as `Σoverlap / Σisl_blocks`, matching the real router's per-request
histogram semantics. Python binding and `replay_adapter` thread the
field into `TrafficObservation` so offline replay can exercise the
same discount logic end-to-end.

Safety: `_clamp_kv_hit_rate` caps at 0.95 (a stale 1.0 reading would
otherwise zero out queued work), and treats None/NaN as 0.0 (no
discount, preserving prior behavior exactly). Diagnostics expose the
observed `kv_hit_rate` and the predicted value on each tick.

Tests: 10 new Python tests covering regression discount, clamping,
None/NaN fallback, warmup exclusion, diagnostics propagation, and
end-to-end throughput-scaling demand reduction; 3 Rust tests on
`TrafficAccumulator` including the weighted-by-isl-blocks property.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk force-pushed the hzhou/planner-kv-reuse-awareness branch from c88c209 to 518a702 Compare April 17, 2026 20:31
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread components/src/dynamo/planner/core/perf_model/agg.py
Comment thread components/src/dynamo/planner/core/perf_model/agg.py
… for load scaling

Plug two cadence gaps in the original KV-reuse-aware planner:

1. **Load-only deployments now scrape kv_hit_rate.** Previously the
   Prometheus scrape was tied to the throughput tick, so deployments
   with `enable_throughput_scaling=False` silently saw `_last_kv_hit_rate
   = None` forever and applied no discount. The scheduler now requests
   a kv-hit-rate-only scrape on each load tick (window =
   load_adjustment_interval) when throughput scaling is disabled.

2. **Mixed-mode load scaling now reads the predicted value.** When
   throughput scaling is enabled, the kv_hit_rate predictor produces a
   smoothed forecast on each throughput tick. `_advance_throughput`
   now promotes that predicted value to `_last_kv_hit_rate`, so all
   subsequent load ticks (between throughput ticks) consume the
   forecast rather than the raw last-window observation.

Implementation:

- `state_machine._next_scheduled_tick`: in load-only mode, set
  `need_traffic_metrics=True` and `traffic_metrics_duration_s =
  load_adjustment_interval` on load ticks.
- `state_machine._observe_traffic`: gate num_req/isl/osl predictor feeds
  on `enable_throughput_scaling` (avoid polluting unused predictors with
  the load-only path's placeholder zeros). Only write directly to
  `_last_kv_hit_rate` in load-only mode; in mixed mode feed the
  predictor and let the throughput tick do the promotion.
- `state_machine.on_tick`: parallel branch in `run_load_scaling` calls
  `_observe_traffic` for pure load ticks (skipped in mixed mode where
  the throughput branch already consumed it).
- `throughput_scaling._advance_throughput`: after `_predict_kv_hit_rate`,
  store the predicted value in `_last_kv_hit_rate` for cross-cadence
  consumption.
- `base.py`: add `_collect_kv_hit_rate_observation(duration_s)` — a
  cheap one-query scrape used in load-only mode to avoid issuing the
  six unused traffic queries per load tick. Dispatch in
  `_gather_tick_input` based on `tick.run_throughput_scaling`.

Replay needs no code change: `TrafficStats.avg_kv_hit_rate` already
exists, and the scheduler change drains the mocker's TrafficAccumulator
at load cadence in load-only mode automatically.

Tests: 7 new state-machine tests covering load-only direct-pass,
mixed-mode predictor promotion, scheduler flag in both modes, and
predictor-feed gating. Updated the existing `test_load_only` initial
tick test to reflect the new `need_traffic_metrics=True` behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk changed the title feat(planner): KV reuse awareness in load + throughput scaling feat(planner/replay): KV reuse awareness in load + throughput scaling Apr 17, 2026
Three issues from CodeRabbit / Devin reviews on PR #8314:

1. **prometheus_names: register router KV_HIT_RATE in Rust source.**
   The Python constant `router.KV_HIT_RATE = "router_kv_hit_rate"` was
   added directly to the auto-generated file with no matching Rust
   entry, so the next codegen run would clobber it. Add
   `pub const KV_HIT_RATE` to the `router` module in
   `lib/runtime/src/metrics/prometheus_names.rs` and regenerate the
   Python module via `dynamo-codegen gen-python-prometheus-names`.

2. **traffic_metrics: preserve None for missing kv_hit_rate.**
   Routing through `_get_average_metric` collapsed Prometheus scrape
   gaps / NaN into a real `0.0`, which downstream is treated as a
   valid observation and would drag the predictor / sticky value down
   to zero on every failed scrape. Inline a custom query in
   `get_avg_kv_hit_rate` that returns `None` on empty/NaN/error so
   `_clamp_kv_hit_rate(None)` falls back to no-discount behavior.
   `_get_average_metric` is left unchanged (other metrics intentionally
   prefer 0 on missing data).

3. **agg.find_best_engine_agg_rps: uniform kv_hit_rate discount in TTFT
   estimate.** The batch-size sweep computed `prefill_per_iter` with
   `effective_isl` (already discounted) and then passed it to
   `estimate_next_ttft` *without* forwarding `kv_hit_rate`. Inside
   `estimate_next_ttft` the discount defaulted to 1.0, so the
   `avg_isl` term (the hypothetical next request's ISL from the
   regression model) was *not* discounted, inflating the predicted
   TTFT and over-provisioning replicas at high hit rate. Fix by
   passing the *raw* `prefill_per_iter = bs * isl / osl` together with
   `kv_hit_rate`, letting the function apply the discount uniformly to
   both the queued portion and `avg_isl`. Also switch the
   `_prefill_balanced` constraint to use `effective_isl` so the
   prefill admission rate reflects the cache discount and the
   batch-size cap widens correspondingly.

Tests: new regression test
`test_agg_find_best_engine_rps_uniform_discount_in_ttft_estimate`
asserts that `engine_rps` strictly grows when `kv_hit_rate` jumps from
0.0 to 0.8 under permissive SLAs (the bug would have left it
suppressed by the still-full `avg_isl` term).

127 / 127 planner unit tests pass; 216 / 216 mocker tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Comment thread components/src/dynamo/planner/monitoring/traffic_metrics.py
Per @jthomson04's review on PR #8314: the mocker's TrafficAccumulator
was computing avg_kv_hit_rate as a block-weighted aggregate ratio
``Σoverlap_blocks / Σisl_blocks``, which gives larger requests more
weight than smaller ones. The real router's Prometheus histogram
observes one ``overlap/isl`` sample per request (see
RequestTracker::kv_hit_rate in timing.rs:308-314), and the planner's
PromQL ``sum(increase(_sum)) / sum(increase(_count))`` returns the
arithmetic mean of those per-request samples. The two diverge when
requests have variable ISL:

  Example: request A (4 ISL blocks, 3 overlap) + request B (12 ISL
  blocks, 0 overlap).
    Real router Prometheus: (0.75 + 0.0) / 2 = 0.375
    Old mocker:             3 / 16       = 0.1875

Switch TrafficAccumulator to track a running sum of per-request
``overlap / isl`` ratios and a sample count; drain returns
``total_hit_rate / hit_rate_count``. Admissions with ``isl_blocks == 0``
are skipped (no meaningful ratio), matching
``RequestTracker::kv_hit_rate`` returning ``None`` in that case.

Updates:
- ``TrafficAccumulator``: replace ``total_overlap_blocks: u64``,
  ``total_isl_blocks: u64`` with ``total_hit_rate: f64``,
  ``hit_rate_count: usize``. ``on_admission`` now computes the
  per-request ratio up front.
- ``TrafficStats`` and adapter-layer docstrings (planner_handle.rs,
  bindings/replay.rs) updated to describe the per-request-mean
  semantics.
- Rename the Rust regression test
  ``traffic_accumulator_hit_rate_is_weighted_by_isl_blocks`` →
  ``traffic_accumulator_hit_rate_is_mean_of_per_request_ratios``,
  flip its expected value (0.375 instead of 0.1875), and add a new
  ``traffic_accumulator_skips_admissions_with_zero_isl_blocks`` test.

Tests: 217 / 217 mocker tests pass (216 + 1 new); 127 / 127 planner
tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk merged commit c388483 into main Apr 20, 2026
90 checks passed
@tedzhouhk tedzhouhk deleted the hzhou/planner-kv-reuse-awareness branch April 20, 2026 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants