feat(planner/replay): KV reuse awareness in load + throughput scaling#8314
Conversation
WalkthroughThis PR introduces KV cache hit-rate tracking across the planner system. Changes propagate hit-rate metrics from Prometheus through traffic observations, integrate them into performance model estimations via scaling factors, add KV hit-rate prediction capabilities, and extend offline replay and monitoring infrastructure to capture and report these metrics. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/src/dynamo/planner/core/perf_model/agg.py (1)
181-194:⚠️ Potential issue | 🟠 MajorInner
estimate_next_ttftcall dropskv_hit_rate— TTFT SLA check is inconsistent with the discount applied everywhere else.
find_best_engine_agg_rpsbuildseffective_isl = isl * (1 - clamp(kv_hit_rate))and uses it to scaleprefill_per_iter, but then callsself.estimate_next_ttft(...)without forwardingkv_hit_rate. Inside that call,scaledefaults to1.0, so the hypothetical next request’savg_islcontribution is added back at full cost. The net effect: the TTFT SLA gate in the bs sweep sees an undiscounted TTFT even when cache reuse is high, which can cause the sweep to break early and under-report achievable RPS — the exact behavior this PR is trying to avoid. The docstring on Lines 150-154 also explicitly says the prefill portion is discounted, so this is a behavioral/docstring mismatch.🔧 Proposed fix
est_ttft = self.estimate_next_ttft( queued_prefill_tokens=int(prefill_per_iter), max_num_batched_tokens=max_num_batched_tokens, current_decode_kv=int(decode_kv), + kv_hit_rate=kv_hit_rate, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/planner/core/perf_model/agg.py` around lines 181 - 194, The TTFT calculation in find_best_engine_agg_rps is using a discounted effective_isl for prefill_per_iter but calls self.estimate_next_ttft(...) without forwarding kv_hit_rate (or equivalent scale), causing the TTFT gate to use an undiscounted avg_isl; fix by passing the same kv_hit_rate (or compute scale = 1.0 - clamp(kv_hit_rate)) into estimate_next_ttft so its scale parameter reflects the cache hit discount used for prefill_per_iter, e.g., update the call site in find_best_engine_agg_rps to include kv_hit_rate or scale so estimate_next_ttft uses the same discounting logic.components/src/dynamo/planner/core/throughput_scaling.py (1)
212-243:⚠️ Potential issue | 🟠 MajorVerify asymmetry: agg throughput's TTFT estimate does not receive the kv_hit_rate discount.
In
find_best_engine_agg_rps(agg.py line 188–193), the call toestimate_next_ttftomits thekv_hit_rateparameter entirely. This means the TTFT estimate computesscale = 1.0(no discount), whileprefill_per_iterwas already discounted viaeffective_isl = isl * (1.0 - clamp(kv_hit_rate)). The prefill throughput path pre-discounts isl before callingfind_best_engine_prefill_rps, applying the discount once in a single regression call. However:
- Prefill path: Discount applied upfront →
wall_time(discounted_isl)- Agg path: Discount applied to
prefill_per_iterfor_predict_2dcall, butestimate_next_ttftreceives no discount and treats aggregate as full-scale- Result: Different regression kernels (_predict_2d vs. _predict_wall_time, 2D vs. 1D) receive inconsistent inputs under cache hit conditions, yielding mismatched TTFT and ITL estimates
Align by passing
kv_hit_rateto theestimate_next_ttftcall in the agg loop, or document why the TTFT estimate intentionally omits the discount.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/planner/core/throughput_scaling.py` around lines 212 - 243, The agg path omits the kv_hit_rate when calling estimate_next_ttft in find_best_engine_agg_rps, causing TTFT to be computed without the same KV-hit discount applied elsewhere; fix by passing the kv_hit_rate (after clamping with _clamp_kv_hit_rate) into estimate_next_ttft so the same effective ISL/discount used for prefill_per_iter and _predict_2d is also applied to the _predict_wall_time TTFT estimate (adjust inputs to estimate_next_ttft in find_best_engine_agg_rps to include kv_hit_rate), or add a clear comment in find_best_engine_agg_rps explaining intentional asymmetry if that was deliberate.
🧹 Nitpick comments (4)
components/src/dynamo/planner/core/base.py (1)
465-469: Minor:kv_hit_ratemay be NaN, not justNone.The
is Noneguard on Line 465 only handlesNone; if Prometheus returns NaN (e.g., empty series converted to NaN by the client), the log will render"nan"and downstream consumers rely on_clamp_kv_hit_rateto normalize it. Either NaN-guard here for consistency with_clamp_kv_hit_rate, or leave as-is and rely on the clamp — just make sure it’s intentional.- hit_rate_str = f"{m.kv_hit_rate:.3f}" if m.kv_hit_rate is not None else "n/a" + hit_rate_str = ( + f"{m.kv_hit_rate:.3f}" + if m.kv_hit_rate is not None and not math.isnan(m.kv_hit_rate) + else "n/a" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/planner/core/base.py` around lines 465 - 469, The log currently treats m.kv_hit_rate only for None and will print "nan" if m.kv_hit_rate is NaN; update the logger formatting around m.kv_hit_rate in the logger.info block (the hit_rate_str assignment used above the logger) to guard against NaN the same way _clamp_kv_hit_rate does (e.g., treat NaN like None and render "n/a" or the normalized value), so that hit_rate_str is never the literal "nan" when passed into logger.info.components/src/dynamo/planner/monitoring/diagnostics_recorder.py (1)
58-66: LGTM.Observed vs predicted hit-rate are surfaced symmetrically; both default to
Noneso existing snapshots/HTML reports handle missing data gracefully. Consider adding a subplot for them in_build_report_htmlin a follow-up so the new signal is actually visible in the generated report.Also applies to: 169-179
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/planner/monitoring/diagnostics_recorder.py` around lines 58 - 66, Add a new subplot in the _build_report_html function to visualize observed_kv_hit_rate alongside predicted_kv_hit_rate so the new signal is visible in generated reports; locate the _build_report_html method and add a plot/trace that reads DiagnosticsRecorder.observed_kv_hit_rate and DiagnosticsRecorder.predicted_kv_hit_rate (or the local variables named observed_kv_hit_rate / predicted_kv_hit_rate) with appropriate labels/legend and handle None values gracefully (skip or show gaps) to match existing plotting patterns used for the other predicted_* vs observed_* signals.lib/mocker/src/replay/offline/components/router.rs (1)
487-565: Clean refactor;AdmitOutcomethreads the two overlap counters through both admit paths consistently.A couple of small observations (non-blocking):
- Line 516-517:
u32::try_from(...).unwrap_or(u32::MAX)silently saturates. For realistic ISL this can't trigger, but if it ever did, downstream ratios would skew. Adebug_assert!(ortracing::warn!with the ISL) would make the saturation observable without affecting release behavior.- Line 265-267:
request.uuid.expect(...)is redundant with theuuidextraction already performed insidebuild_pending_request(line 438-440). You could pulluuidfrompending.uuidbeforeadmit_requestconsumes it and drop the panic path entirely.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/mocker/src/replay/offline/components/router.rs` around lines 487 - 565, The admit_request path currently silently saturates isl_blocks with u32::try_from(...).unwrap_or(u32::MAX); add a debug_assert!(isl_blocks != u32::MAX, "saturated isl_blocks: {}", request.isl_tokens) or emit a tracing::warn! with the original ISL value right after computing isl_blocks to surface unexpected saturation in debug/tracing builds; and in drain_pending, avoid the redundant expect by extracting uuid from the QueueEntry before calling admit_request (use the uuid variable from the popped QueueEntry instead of relying on request.uuid inside admit_request) so you don't need the panic path in build_pending_request and to keep admission consumption semantics correct.components/src/dynamo/planner/core/throughput_scaling.py (1)
84-99: Broadexcept Exceptionwithout re-raise.Per the repo's Python guidelines ("Prefer failing fast … if catching Exception, log then re-raise"), and per Ruff BLE001. The cold-start-fallback intent is clear from the docstring, but swallowing any exception here will mask genuine predictor bugs (shape mismatches, serialization errors) as "no prediction available." If a None-on-cold-start contract is really what you want, consider catching the predictor's specific "not enough data" exception type instead of bare
Exception.As per coding guidelines: "Prefer failing fast (avoid broad excepts; if catching Exception, log then re-raise)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/planner/core/throughput_scaling.py` around lines 84 - 99, The method _predict_kv_hit_rate currently swallows all exceptions from self._kv_hit_rate_predictor.predict_next; change this to only treat the predictor's "not ready" exception as a benign cold-start (e.g., catch the predictor-specific exception such as PredictorNotReady / NotEnoughDataError thrown by the predictor and in that branch set self._diag_predicted_kv_hit_rate = None and return None), and for any other Exception from predict_next log the error and re-raise so real bugs surface; update the except block around self._kv_hit_rate_predictor.predict_next accordingly and reference _diag_predicted_kv_hit_rate and _predict_kv_hit_rate when implementing the narrower handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/src/dynamo/planner/core/state_machine.py`:
- Around line 304-306: The code only updates _last_kv_hit_rate and calls
_kv_hit_rate_predictor.add_data_point inside _observe_traffic(), which on_tick()
only invokes in the throughput-scaling branch, so load-only deployments never
learn kv_hit_rate; to fix, move or duplicate the KV-hit observation so it runs
on load ticks when SLA load scaling is enabled: call the same update logic that
sets self._last_kv_hit_rate and calls
self._kv_hit_rate_predictor.add_data_point(traffic.kv_hit_rate) from on_tick()’s
load-scaling path (or extract it into a helper method e.g.,
_record_kv_hit_rate(traffic) and invoke it from both _observe_traffic() and the
load-scaling branch), ensuring you check for traffic.kv_hit_rate is not None and
not math.isnan before updating; also ensure load_scaling.py’s TTFT discount
calculation reads self._last_kv_hit_rate after this change so it no longer
remains None.
In `@components/src/dynamo/planner/monitoring/traffic_metrics.py`:
- Around line 305-321: get_avg_kv_hit_rate currently lets _get_average_metric
collapse “no data / query failed” into 0.0 which downstream
(PlannerStateMachine._observe_traffic) treats as a valid observation; change the
code so missing router data is preserved as None. Update _get_average_metric to
return Optional[float] and avoid coercing empty/no-result Prometheus responses
to 0.0 (return None instead), then have get_avg_kv_hit_rate call
_get_average_metric and directly return its Optional[float] result (no
defaulting to 0.0); this preserves tri-state semantics (real float or None) so
PlannerStateMachine._observe_traffic can distinguish missing data.
In `@lib/bindings/python/src/dynamo/prometheus_names.py`:
- Around line 304-305: The KV_HIT_RATE metric name is inconsistent between
Python (KV_HIT_RATE = "router_kv_hit_rate" in prometheus_names.py) and Rust
(KV_HIT_RATE = "kv_hit_rate" in lib/runtime/src/metrics/prometheus_names.rs);
decide which source is authoritative and make them match: either move the Rust
constant into the router module and rename it to "router_kv_hit_rate" (so the
router module follows the "router_" prefix pattern), or change the Python
constant value to "kv_hit_rate" and then regenerate the Python bindings; after
updating the authoritative source (modify the KV_HIT_RATE definition in
prometheus_names.rs or the Python constant), run the generator to refresh
lib/bindings/python/src/dynamo/prometheus_names.py so both sides use the
identical KV_HIT_RATE value.
---
Outside diff comments:
In `@components/src/dynamo/planner/core/perf_model/agg.py`:
- Around line 181-194: The TTFT calculation in find_best_engine_agg_rps is using
a discounted effective_isl for prefill_per_iter but calls
self.estimate_next_ttft(...) without forwarding kv_hit_rate (or equivalent
scale), causing the TTFT gate to use an undiscounted avg_isl; fix by passing the
same kv_hit_rate (or compute scale = 1.0 - clamp(kv_hit_rate)) into
estimate_next_ttft so its scale parameter reflects the cache hit discount used
for prefill_per_iter, e.g., update the call site in find_best_engine_agg_rps to
include kv_hit_rate or scale so estimate_next_ttft uses the same discounting
logic.
In `@components/src/dynamo/planner/core/throughput_scaling.py`:
- Around line 212-243: The agg path omits the kv_hit_rate when calling
estimate_next_ttft in find_best_engine_agg_rps, causing TTFT to be computed
without the same KV-hit discount applied elsewhere; fix by passing the
kv_hit_rate (after clamping with _clamp_kv_hit_rate) into estimate_next_ttft so
the same effective ISL/discount used for prefill_per_iter and _predict_2d is
also applied to the _predict_wall_time TTFT estimate (adjust inputs to
estimate_next_ttft in find_best_engine_agg_rps to include kv_hit_rate), or add a
clear comment in find_best_engine_agg_rps explaining intentional asymmetry if
that was deliberate.
---
Nitpick comments:
In `@components/src/dynamo/planner/core/base.py`:
- Around line 465-469: The log currently treats m.kv_hit_rate only for None and
will print "nan" if m.kv_hit_rate is NaN; update the logger formatting around
m.kv_hit_rate in the logger.info block (the hit_rate_str assignment used above
the logger) to guard against NaN the same way _clamp_kv_hit_rate does (e.g.,
treat NaN like None and render "n/a" or the normalized value), so that
hit_rate_str is never the literal "nan" when passed into logger.info.
In `@components/src/dynamo/planner/core/throughput_scaling.py`:
- Around line 84-99: The method _predict_kv_hit_rate currently swallows all
exceptions from self._kv_hit_rate_predictor.predict_next; change this to only
treat the predictor's "not ready" exception as a benign cold-start (e.g., catch
the predictor-specific exception such as PredictorNotReady / NotEnoughDataError
thrown by the predictor and in that branch set self._diag_predicted_kv_hit_rate
= None and return None), and for any other Exception from predict_next log the
error and re-raise so real bugs surface; update the except block around
self._kv_hit_rate_predictor.predict_next accordingly and reference
_diag_predicted_kv_hit_rate and _predict_kv_hit_rate when implementing the
narrower handling.
In `@components/src/dynamo/planner/monitoring/diagnostics_recorder.py`:
- Around line 58-66: Add a new subplot in the _build_report_html function to
visualize observed_kv_hit_rate alongside predicted_kv_hit_rate so the new signal
is visible in generated reports; locate the _build_report_html method and add a
plot/trace that reads DiagnosticsRecorder.observed_kv_hit_rate and
DiagnosticsRecorder.predicted_kv_hit_rate (or the local variables named
observed_kv_hit_rate / predicted_kv_hit_rate) with appropriate labels/legend and
handle None values gracefully (skip or show gaps) to match existing plotting
patterns used for the other predicted_* vs observed_* signals.
In `@lib/mocker/src/replay/offline/components/router.rs`:
- Around line 487-565: The admit_request path currently silently saturates
isl_blocks with u32::try_from(...).unwrap_or(u32::MAX); add a
debug_assert!(isl_blocks != u32::MAX, "saturated isl_blocks: {}",
request.isl_tokens) or emit a tracing::warn! with the original ISL value right
after computing isl_blocks to surface unexpected saturation in debug/tracing
builds; and in drain_pending, avoid the redundant expect by extracting uuid from
the QueueEntry before calling admit_request (use the uuid variable from the
popped QueueEntry instead of relying on request.uuid inside admit_request) so
you don't need the panic path in build_pending_request and to keep admission
consumption semantics correct.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 01428679-eef3-47aa-9514-05dfb563627a
📒 Files selected for processing (21)
components/src/dynamo/planner/core/base.pycomponents/src/dynamo/planner/core/load_scaling.pycomponents/src/dynamo/planner/core/perf_model/agg.pycomponents/src/dynamo/planner/core/perf_model/base.pycomponents/src/dynamo/planner/core/perf_model/prefill.pycomponents/src/dynamo/planner/core/state_machine.pycomponents/src/dynamo/planner/core/throughput_scaling.pycomponents/src/dynamo/planner/core/types.pycomponents/src/dynamo/planner/monitoring/diagnostics_recorder.pycomponents/src/dynamo/planner/monitoring/traffic_metrics.pycomponents/src/dynamo/planner/offline/replay_adapter.pycomponents/src/dynamo/planner/tests/unit/test_load_based_scaling.pycomponents/src/dynamo/planner/tests/unit/test_prometheus.pycomponents/src/dynamo/planner/tests/unit/test_state_machine.pylib/bindings/python/rust/llm/replay.rslib/bindings/python/src/dynamo/prometheus_names.pylib/mocker/src/replay/offline/agg.rslib/mocker/src/replay/offline/components/router.rslib/mocker/src/replay/offline/components/types.rslib/mocker/src/replay/offline/disagg.rslib/mocker/src/replay/planner_handle.rs
The KV router already publishes `dynamo_component_router_kv_hit_rate` (predicted prefix-cache hit rate at routing time), but the planner ignored it -- so scaling decisions over-counted prefill compute work on reuse-heavy workloads and scaled prefill up unnecessarily. This threads the signal through both scaling paths. Load planner: `estimate_next_ttft` in the prefill and agg regression models takes an optional `kv_hit_rate` and scales `(queued + avg_isl)` by `(1 - clamp(hit_rate))`. The regression's x-feature (per-iter chunk size) is untouched, so the fit stays valid -- only the simulation aggregate is discounted. `load_scaling` passes the sticky `state._last_kv_hit_rate` into both call sites. Throughput planner: a fourth load predictor (`_kv_hit_rate_predictor`) is added alongside num_req/isl/osl. Per requirement it is NOT warmed from the mooncake trace (no good offline proxy). At prediction time the predicted hit rate discounts only the prefill portion -- decode KV residency uses the raw ISL because cache hits reduce prefill compute but not the decode-time KV footprint. `find_best_engine_agg_rps` gains a `kv_hit_rate` param that applies the discount to its internal prefill-per-iter simulation. Replay: mocker's `TrafficAccumulator` now tracks overlap + ISL blocks via `on_admission(overlap_blocks, isl_blocks)` called from the prefill-router dispatch path (disagg) and the router admission path (agg). `drain_traffic` returns a 5-tuple including `avg_kv_hit_rate` as `Σoverlap / Σisl_blocks`, matching the real router's per-request histogram semantics. Python binding and `replay_adapter` thread the field into `TrafficObservation` so offline replay can exercise the same discount logic end-to-end. Safety: `_clamp_kv_hit_rate` caps at 0.95 (a stale 1.0 reading would otherwise zero out queued work), and treats None/NaN as 0.0 (no discount, preserving prior behavior exactly). Diagnostics expose the observed `kv_hit_rate` and the predicted value on each tick. Tests: 10 new Python tests covering regression discount, clamping, None/NaN fallback, warmup exclusion, diagnostics propagation, and end-to-end throughput-scaling demand reduction; 3 Rust tests on `TrafficAccumulator` including the weighted-by-isl-blocks property. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
c88c209 to
518a702
Compare
… for load scaling Plug two cadence gaps in the original KV-reuse-aware planner: 1. **Load-only deployments now scrape kv_hit_rate.** Previously the Prometheus scrape was tied to the throughput tick, so deployments with `enable_throughput_scaling=False` silently saw `_last_kv_hit_rate = None` forever and applied no discount. The scheduler now requests a kv-hit-rate-only scrape on each load tick (window = load_adjustment_interval) when throughput scaling is disabled. 2. **Mixed-mode load scaling now reads the predicted value.** When throughput scaling is enabled, the kv_hit_rate predictor produces a smoothed forecast on each throughput tick. `_advance_throughput` now promotes that predicted value to `_last_kv_hit_rate`, so all subsequent load ticks (between throughput ticks) consume the forecast rather than the raw last-window observation. Implementation: - `state_machine._next_scheduled_tick`: in load-only mode, set `need_traffic_metrics=True` and `traffic_metrics_duration_s = load_adjustment_interval` on load ticks. - `state_machine._observe_traffic`: gate num_req/isl/osl predictor feeds on `enable_throughput_scaling` (avoid polluting unused predictors with the load-only path's placeholder zeros). Only write directly to `_last_kv_hit_rate` in load-only mode; in mixed mode feed the predictor and let the throughput tick do the promotion. - `state_machine.on_tick`: parallel branch in `run_load_scaling` calls `_observe_traffic` for pure load ticks (skipped in mixed mode where the throughput branch already consumed it). - `throughput_scaling._advance_throughput`: after `_predict_kv_hit_rate`, store the predicted value in `_last_kv_hit_rate` for cross-cadence consumption. - `base.py`: add `_collect_kv_hit_rate_observation(duration_s)` — a cheap one-query scrape used in load-only mode to avoid issuing the six unused traffic queries per load tick. Dispatch in `_gather_tick_input` based on `tick.run_throughput_scaling`. Replay needs no code change: `TrafficStats.avg_kv_hit_rate` already exists, and the scheduler change drains the mocker's TrafficAccumulator at load cadence in load-only mode automatically. Tests: 7 new state-machine tests covering load-only direct-pass, mixed-mode predictor promotion, scheduler flag in both modes, and predictor-feed gating. Updated the existing `test_load_only` initial tick test to reflect the new `need_traffic_metrics=True` behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Three issues from CodeRabbit / Devin reviews on PR #8314: 1. **prometheus_names: register router KV_HIT_RATE in Rust source.** The Python constant `router.KV_HIT_RATE = "router_kv_hit_rate"` was added directly to the auto-generated file with no matching Rust entry, so the next codegen run would clobber it. Add `pub const KV_HIT_RATE` to the `router` module in `lib/runtime/src/metrics/prometheus_names.rs` and regenerate the Python module via `dynamo-codegen gen-python-prometheus-names`. 2. **traffic_metrics: preserve None for missing kv_hit_rate.** Routing through `_get_average_metric` collapsed Prometheus scrape gaps / NaN into a real `0.0`, which downstream is treated as a valid observation and would drag the predictor / sticky value down to zero on every failed scrape. Inline a custom query in `get_avg_kv_hit_rate` that returns `None` on empty/NaN/error so `_clamp_kv_hit_rate(None)` falls back to no-discount behavior. `_get_average_metric` is left unchanged (other metrics intentionally prefer 0 on missing data). 3. **agg.find_best_engine_agg_rps: uniform kv_hit_rate discount in TTFT estimate.** The batch-size sweep computed `prefill_per_iter` with `effective_isl` (already discounted) and then passed it to `estimate_next_ttft` *without* forwarding `kv_hit_rate`. Inside `estimate_next_ttft` the discount defaulted to 1.0, so the `avg_isl` term (the hypothetical next request's ISL from the regression model) was *not* discounted, inflating the predicted TTFT and over-provisioning replicas at high hit rate. Fix by passing the *raw* `prefill_per_iter = bs * isl / osl` together with `kv_hit_rate`, letting the function apply the discount uniformly to both the queued portion and `avg_isl`. Also switch the `_prefill_balanced` constraint to use `effective_isl` so the prefill admission rate reflects the cache discount and the batch-size cap widens correspondingly. Tests: new regression test `test_agg_find_best_engine_rps_uniform_discount_in_ttft_estimate` asserts that `engine_rps` strictly grows when `kv_hit_rate` jumps from 0.0 to 0.8 under permissive SLAs (the bug would have left it suppressed by the still-full `avg_isl` term). 127 / 127 planner unit tests pass; 216 / 216 mocker tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Per @jthomson04's review on PR #8314: the mocker's TrafficAccumulator was computing avg_kv_hit_rate as a block-weighted aggregate ratio ``Σoverlap_blocks / Σisl_blocks``, which gives larger requests more weight than smaller ones. The real router's Prometheus histogram observes one ``overlap/isl`` sample per request (see RequestTracker::kv_hit_rate in timing.rs:308-314), and the planner's PromQL ``sum(increase(_sum)) / sum(increase(_count))`` returns the arithmetic mean of those per-request samples. The two diverge when requests have variable ISL: Example: request A (4 ISL blocks, 3 overlap) + request B (12 ISL blocks, 0 overlap). Real router Prometheus: (0.75 + 0.0) / 2 = 0.375 Old mocker: 3 / 16 = 0.1875 Switch TrafficAccumulator to track a running sum of per-request ``overlap / isl`` ratios and a sample count; drain returns ``total_hit_rate / hit_rate_count``. Admissions with ``isl_blocks == 0`` are skipped (no meaningful ratio), matching ``RequestTracker::kv_hit_rate`` returning ``None`` in that case. Updates: - ``TrafficAccumulator``: replace ``total_overlap_blocks: u64``, ``total_isl_blocks: u64`` with ``total_hit_rate: f64``, ``hit_rate_count: usize``. ``on_admission`` now computes the per-request ratio up front. - ``TrafficStats`` and adapter-layer docstrings (planner_handle.rs, bindings/replay.rs) updated to describe the per-request-mean semantics. - Rename the Rust regression test ``traffic_accumulator_hit_rate_is_weighted_by_isl_blocks`` → ``traffic_accumulator_hit_rate_is_mean_of_per_request_ratios``, flip its expected value (0.375 instead of 0.1875), and add a new ``traffic_accumulator_skips_admissions_with_zero_isl_blocks`` test. Tests: 217 / 217 mocker tests pass (216 + 1 new); 127 / 127 planner tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Summary
The KV router publishes
dynamo_component_router_kv_hit_rate(predicted prefix-cache hit rate at routing time), but the planner ignored it — so scaling decisions over-counted prefill compute work on reuse-heavy workloads and scaled prefill up unnecessarily. This PR threads the signal through both scaling paths, through offline replay, and (in the latest commit) through the right tick cadence for each deployment mode.Discount math
estimate_next_ttftin the prefill and agg regressions takes an optionalkv_hit_rateand scales(queued + avg_isl) * (1 − clamp(hit_rate)). The regression's x-feature is untouched — only the simulation aggregate is discounted, so no double-counting against post-cachesum_prefill_tokens._kv_hit_rate_predictor) is added alongside num_req/isl/osl. Per requirement it is not warmed from the mooncake trace (no good offline proxy). Predicted hit rate discounts only the prefill portion of capacity sizing; decode KV residency uses raw ISL because cache hits don't shrink the decode footprint.find_best_engine_agg_rpsgains akv_hit_rateparam.TrafficAccumulatortracks overlap + ISL blocks via a newon_admission(overlap, isl_blocks)call from the prefill-router dispatch path.TrafficStatscarriesavg_kv_hit_rate = Σoverlap / Σisl_blocks, matching the real router's per-request histogram semantics. Python binding andreplay_adapterthread the field intoTrafficObservation.Tick cadence (latest commit)
Two cadence gaps from the initial implementation are now fixed:
enable_throughput_scaling=False): the kv-hit-rate scrape now rides on each load tick over the load interval, via a one-query_collect_kv_hit_rate_observation. Previously the scrape was tied to throughput ticks, so load-only mode silently saw_last_kv_hit_rate=Noneforever and applied no discount._advance_throughputnow promotes the_kv_hit_rate_predictor's smoothed forecast to_last_kv_hit_rate. All load ticks between throughput ticks consume the predicted value rather than the raw last-window observation.TrafficAccumulatorat the right cadence per mode.Safety
_clamp_kv_hit_ratecaps at 0.95 (a stale 1.0 reading would zero out queued work and mask backlog), and treats None/NaN as 0.0 (no discount — preserves prior behavior exactly). Diagnostics expose the observedkv_hit_rateand the predicted value per tick.Test plan
test_load_based_scaling,test_state_machine,test_prometheus: regression discount, clamping, None/NaN fallback, warmup exclusion, diagnostics propagation, end-to-end throughput-scaling demand reduction, prefill/agg parity, load-only direct-pass, mixed-mode predictor promotion, scheduler flag in both modes, predictor-feed gating.TrafficAccumulatorincluding the weighted-by-isl-blocks property.cargo fmtclean.Out of scope
sum_decode_kv_tokensis already directly measured).🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Tests