fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fixes#8280
Conversation
… replay Wire DiagnosticsRecorder into ReplayPlannerAdapter so offline planner-in-the-loop replay generates the same interactive Plotly HTML report as the live planner. Extend the Rust TrafficAccumulator to track per-request TTFT and mean ITL from the TraceCollector, plumbing observed latency data through drain_traffic() to the Python adapter and into the HTML report's "Observed TTFT/ITL vs SLA" charts. Bootstrap regression models from profiling data (NPZ) when throughput scaling is enabled, enabling SLA-mode replay with pre-deployment performance profiles. Hide per-engine FPM traces from the Plotly legend to prevent overlap with chart panels during dynamic scaling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
- Plot throughput-based scaling lower bound as a dashed red line on the Replica Counts chart when it's above 1. - Combine queued + inflight decode KV tokens into a single "total KV" trace per engine in the Decode Engine Load chart. - Add a horizontal KV capacity line to the decode engine load chart from engine capabilities (max_kv_tokens). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Generate synthetic FPMs by sweeping the mocker's polynomial perf model instead of loading external NPZ profiling data. This ensures the planner's capacity estimates are consistent with the simulated engine timing, avoiding overscaling from mismatched perf models. Add --benchmark-granularity CLI flag (default: 8, matching the profiler's default interpolation_granularity) to control the number of sweep points. Also expose throughput_lower_bound in TickDiagnostics and plot it as a dashed line on the Replica Counts chart. Combine queued + inflight decode KV tokens into a single trace and add a KV capacity horizontal line to the Decode Engine Load chart. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
The previous formula ``int(max_num_batched_tokens / avg_ctx) * 2`` capped
the batch-size sweep at 2 for prefill-heavy workloads (e.g. ISL=8844,
max_num_batched_tokens=8192), causing engine_rps to be dramatically
underestimated and the throughput lower bound to be too high.
Replace with a binary search in [1, min(kv_cap, seq_cap)] that finds the
largest batch size satisfying the prefill/decode balance condition:
isl / (max_num_batched_tokens - x) <= osl
Above this point, prefill cannot keep up with decode and TTFT grows
unbounded. Below it, additional concurrency is limited by KV cache or
max_num_seqs.
For our test workload (ISL=8844, OSL=173, MBT=8192, KV=262144), max_bs
goes from 2 to 29 and engine_rps from 1.15 to ~3.5, reducing the
throughput lower bound from 5-9 replicas to 2-3 replicas.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Prefill: - Document the linear-scaling assumption behind ``engine_rps = 1/wt``: under ``wt = k·sum_prefill_tokens``, batching multiple prefills gives the same engine_rps as one-at-a-time. - Handle ``ISL > max_num_batched_tokens`` by chunking wall_time across forward passes rather than extrapolating the regression beyond its training domain (a single pass can never process more than MBT tokens). Decode: - Replace the observation-driven ``_max_observed_kv`` cap with engine capability caps (``max_kv_tokens``, ``max_num_seqs``) so the sweep reflects real engine capacity rather than runtime load history. The observed-KV cap is kept as a fallback when capabilities are not provided. Replay: - Cap the prefill benchmark sweep at ``max_num_batched_tokens`` instead of ``max_kv_tokens`` so AIC is never queried with per-pass token counts that a real engine could not process in a single iteration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
The DecodeRegressionModel fits on (num_decode_requests, sum_decode_kv_tokens). Total KV dominates wall time via attention compute; batch size has a weaker secondary effect. Under multicollinearity between the two features and noisy FPM observations, the num_decode_requests coefficient can flip slightly negative without violating physical monotonicity (total KV still drives wall time up with load). Previously the fit was rejected whenever any coefficient was negative, which caused the decode ITL estimate to be None on most load ticks, leaving the diagnostics report's "Estimated ITL vs SLA" chart sparse. Add a ``_clippable_feature_indices`` hook in the base regression model; subclasses can mark specific features as "clip to 0 if negative" rather than rejecting the whole fit. DecodeRegressionModel marks feature 0 (num_decode_requests) as clippable. sum_decode_kv_tokens must remain non-negative and will still trigger rejection if it turns negative. Also combine prefill-only and decode-only benchmark FPMs for agg-mode bootstrap so the agg regression sees variance on both features rather than only sum_decode_kv_tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…P bound Planner perf model: - Rename ``_clippable_feature_indices`` to ``_relaxable_feature_indices``. - Accept slightly-negative relaxable coefficients as-is instead of clipping them to 0 -- the dominant feature keeps predictions monotone and clipping was unnecessary complexity. Diagnostics report: - Add a separate dashed line for ``throughput_lower_bound_prefill`` in the Replica Counts subplot so disagg replays show both prefill and decode lower bounds. Prefill is dark blue, decode is red, matching their respective replica-count traces. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
On a combined tick (when load and throughput cadences align, e.g. every 60s with default intervals), the previous order ran load scaling first using the stale ``_throughput_lower_bound_*``, then updated the bounds from throughput scaling afterward. Load could therefore decide to scale below a floor that was about to be raised in the same tick, producing a brief window where actual replicas violated the new lower bound. Swap the order: run throughput scaling first so its updated lower bound is visible to load scaling's floor check in the same tick. Keep the precedence rule (load decision wins when it produces one) so the fast reactive loop still drives immediate scaling, while the throughput floor is always up-to-date before load reads it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
The disagg ``_advance_load_disagg`` and the SLA-mode agg ``_advance_load_agg`` paths both returned ``None`` for "no change" before applying the throughput-scaling lower bound. When load scaling saw current replicas as sufficient, the floor check never ran, so replicas could linger below a newly-raised throughput floor until load scaling independently decided to scale up for other reasons. Restructure both paths so the final replica count is computed as ``max(load_decision_or_current, min_endpoint, throughput_floor)`` and the no-change early return is evaluated against the post-floor value. This way a load tick following (or coincident with) a throughput tick that raises the floor will scale up immediately, rather than waiting several ticks for the load signal to independently cross the threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Add per-component load and throughput decision reasons to TickDiagnostics (``load_decision_reason_prefill``, ``load_decision_reason_decode``, and same for throughput). Disagg load scaling sets each side's reason independently so the report's Load / Throughput decision timelines can plot two tracks (prefill at y=2, decode at y=1) with "prefill" / "decode" y-axis labels. Agg mode falls back to the existing single-track rendering. The aggregate ``load_decision_reason`` is still populated (for back-compat and Prometheus metrics) and is chosen by priority: scale_up > scale_down_capped_by_throughput > scale_down > no_change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Engine RPS, Predicted RPS, Predicted ISL, and Predicted OSL are only populated on throughput-scaling ticks (every 60s by default), so the Plotly default of disconnecting lines at None values produces scattered markers with no connecting line. Enable ``connectgaps=True`` on these four traces so the 60s data points form a continuous line across gaps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
- Split AIC bootstrap error handling: distinct try/except for session creation vs benchmark FPM generation, and validate FPM lists are non-empty before calling load_benchmark_fpms. - Add explicit boundary comment + assertion for prefill chunking at ``isl == max_num_batched_tokens``. - Expand the agg max_bs balance-formula comment to show the full admission-rate >= egress-rate derivation. - Clarify state machine's ``_next_throughput_s`` advancement on ticks without traffic data. - Preserve legitimate zero TTFT/ITL values: gate on ``num_req > 0`` instead of ``avg_ttft_ms > 0`` so cache-hit or synthetic-zero latencies are still recorded in the diagnostics report. - Rename "Scaling events" to "Replica transitions" in the HTML report summary and count actual count changes rather than decision ticks. - Document the report's agg vs disagg layout switch for the decision timeline subplots. - Explain in the base regression model docstring why only DecodeRegressionModel uses ``_relaxable_feature_indices``. - Add a maintenance note on ``TrafficStats`` so future field additions also update the PyO3 binding and Python adapter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
WalkthroughThis PR enhances scaling decision diagnostics by segregating load and throughput reasons per component (prefill/decode), extends performance regression models with resource constraints (KV tokens, sequence counts, batch sizes), and enriches traffic tracking with latency metrics (TTFT, inter-token latency) across the Rust backend and Python planner. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
jthomson04
left a comment
There was a problem hiding this comment.
Looks good. One nit (can be for a follow up): Would it be possible to get the x-axes of all the charts to scale together? If I want to narrow down on a specific time period, I have to separately zoom into that area on each graph.
Correctness: - load_scaling: distinguish throughput-floor lifts from min_endpoint / global-budget lifts when labelling ``scale_down_capped_by_throughput`` so the decision timeline reflects the actual cause. - load_scaling: mirror aggregate reason onto per-component fields on ``no_fpm_data`` / ``worker_count_mismatch`` early returns so disagg timelines stay consistent with the top-level reason. - base perf model: add 1e-6 noise tolerance for relaxable negative coefficients; clamp anything beyond the tolerance back to 0 so the model never predicts lower wall time for higher feature values. - throughput_scaling: when one disagg component's regression isn't fitted yet, label the other side ``partner_not_ready`` instead of leaving its per-component reason ``None``. - Rust TrafficAccumulator: track ``ttft_count`` separately from ``num_req`` so ``avg_ttft_ms`` reflects only requests that actually produced a TTFT sample (mirroring the existing ``itl_count`` pattern). Quality: - prefill perf model: replace ``assert`` with a comment stating the ``remainder ∈ (0, max_num_batched_tokens]`` invariant (asserts are stripped under ``python -O``). - replay main: narrow the broad ``except Exception`` around AIC session creation and benchmark generation to the concrete exception types those paths can raise, preserving graceful degradation without swallowing unrelated bugs. - replay main: decouple ``benchmark_granularity`` from the decode batch-size ceiling; the sweep now steps up to ``max_num_seqs`` with ``granularity`` points per axis, giving the regression variance on ``num_decode_requests`` at realistic concurrency levels. - load_scaling (agg): remove unreachable ``else`` branch after the ``desired == num_workers`` early return. - diagnostics recorder: add ``partner_not_ready`` to the throughput decision colour map (pink) so the new reason renders correctly. Documentation: - Rust bridge ``drain_traffic`` docstring: enumerate all six keys with units so the Python-facing API doc matches the returned payload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Run cargo fmt to satisfy the rust-tests CI formatting check on the three mocker replay files touched by the latency-tracking changes. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Mypy CI failed because LoadScalingMixin / ThroughputScalingMixin declared only the aggregate ``_diag_*_reason`` scratch fields while the state machine initialized four new per-component variants with ``Optional[str] = None``. Declare the per-component fields in the mixins too so the assignments in state_machine.__init__ type-check. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
sure, added a TODO |
Note the reviewer request to share the x-axis across all subplots so zooming on one chart narrows all of them to the same time range. Deferred to a follow-up since the current report is shipping. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Mypy flagged passing ``Optional[str]`` (aic_system, aic_model_path) to ``create_session`` which expects ``str``. Tighten the pre-check so the bootstrap branch only runs when all three required AIC fields are non-None, mirroring the existing ``aic_backend is None`` guard. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Devin flagged two issues on #8280: 1. In disagg mode the benchmark generator used prefill_engine_args for both prefill and decode sweeps because ``ref_args`` resolved to ``prefill_engine_args`` when ``extra_engine_args`` was None. Decode engines typically have much higher ``max_num_seqs`` and larger KV caches than prefill, so decoding regressions were trained on data that didn't cover the actual decode engine's operating range. Split ``_generate_aic_benchmark_fpms`` into separate prefill and decode helpers and pick the appropriate engine args per call site: agg uses ``extra_engine_args`` for both, disagg uses ``prefill_engine_args`` / ``decode_engine_args`` respectively. 2. ``ForwardPassMetrics`` / ``ScheduledRequestMetrics`` were imported inside the function body, violating the project's Python guideline that first-party imports belong at the top of the file. Move them to module-level imports. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
… scaling logics fixes (#8280) Signed-off-by: hongkuanz <hongkuanz@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Summary
Adds interactive HTML diagnostics reports to offline planner-in-the-loop replay, bootstraps the throughput regression from the mocker's perf model (via AIC), and fixes several planner correctness issues surfaced by the new reports.
Before this PR, `dynamo.replay --planner-config ...` produced only a text summary of trace metrics. After this PR, every run emits a Plotly HTML report showing replica counts, observed vs predicted traffic, TTFT/ITL vs SLA, per-engine KV load, engine capacity, and color-coded decision timelines — giving a per-tick view of how the planner reacted to the simulated workload.
What's in the PR
Replay diagnostics
HTML report — wire `DiagnosticsRecorder` into `ReplayPlannerAdapter` so offline replay produces the same interactive chart set as the live planner, plus:
TTFT / ITL plumbed through the traffic accumulator (Rust) — the mocker's `TraceCollector` already tracks arrival / first-token / per-token timestamps; now `TrafficAccumulator::on_request` reads `request_latencies()` and exposes `avg_ttft_ms` / `avg_itl_ms` via `drain_traffic()`. The Python adapter feeds these into `Metrics` so the report's "Observed TTFT / ITL vs SLA" charts have data.
Synthetic benchmark from AIC — instead of requiring the user to supply NPZ profiling data, replay now sweeps AIC's `predict_prefill` / `predict_decode` at a configurable `--benchmark-granularity` (default 8, matching the profiler) and bootstraps the planner's regression models with batch-size-aware data consistent with the mocker's timing. Prefill sweep caps at `max_num_batched_tokens`; agg-mode bootstrap combines prefill + decode points so both regression features have variance.
Planner correctness fixes (surfaced by replay diagnostics)
`find_best_engine_agg_rps` max_bs cap — the previous formula `int(max_num_batched_tokens / avg_ctx) * 2` collapsed to 2 whenever `ISL > MBT`, so `engine_rps` was wildly underestimated for prefill-heavy workloads. Replaced with a binary search in `[1, min(kv_cap, seq_cap)]` that enforces the prefill / decode steady-state balance `isl / (max_num_batched_tokens - x) <= osl`. Took engine_rps from 1.15 to ~3.5 on a representative workload.
`find_best_engine_prefill_rps` chunking — for `ISL > max_num_batched_tokens`, the wall-time prediction was being extrapolated beyond the regression's training domain. Now chunks `ISL` into `ceil(ISL / MBT)` forward passes and sums predictions per chunk.
`find_best_engine_decode_rps` capability caps — was using `_max_observed_kv / context_length` as the sweep upper bound, which stays low until the engine runs hot. Now uses `max_kv_tokens / context_length` and `max_num_seqs` from engine capabilities, with the observed-KV fallback only when capabilities aren't provided.
State-machine tick ordering — on a combined load + throughput tick, load scaling ran first with stale `throughput_lower_bound*`, then throughput updated the floor afterward, letting the load decision violate the brand-new floor. Swapped the order so throughput scaling updates the floor before load scaling reads it (precedence unchanged: load decision still wins over throughput decision when both produce one).
Throughput floor enforced on no-change — `_advance_load_disagg` and the SLA-mode agg path both returned `None` for "no change" before applying the floor, so replicas could linger below a raised floor for many ticks. Now computes the final count as `max(load_or_current, min_endpoint, throughput_floor)` first, then checks no-change against the post-floor value.
Relaxable negative coefficients — `DecodeRegressionModel` fits on `(num_decode_requests, sum_decode_kv_tokens)`; total KV dominates wall time, so `num_decode_requests` can flip slightly negative under multicollinearity + noisy FPMs. Added a `_relaxable_feature_indices` hook in the base class; decode marks index 0 as relaxable so minor negatives don't reject the whole fit. `sum_decode_kv_tokens` (and all agg / prefill features) still hard-reject on negatives.
Per-component decision reasons — `TickDiagnostics` now carries `{load,throughput}decision_reason{prefill,decode}` in disagg mode. Aggregate reason is derived by priority (`scale_up > scale_down_capped_by_throughput > scale_down > no_change`) for back-compat.
How to run
Trace preparation
Any Mooncake-format trace works. Example:
```bash
mkdir -p traces/mooncake_fast25
curl -sL https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl
-o traces/mooncake_fast25/toolagent_trace.jsonl
head -2361 traces/mooncake_fast25/toolagent_trace.jsonl > traces/mooncake_fast25/toolagent_trace_10pct.jsonl
```
Build Python bindings
```bash
.venv/bin/maturin develop --uv -m lib/bindings/python/Cargo.toml
export PYTHONPATH=lib/bindings/python/src:components/src
```
Install AIC (for throughput-based scaling)
```bash
uv pip install --python .venv/bin/python aiconfigurator
```
Agg mode with SLA-based scaling
```bash
.venv/bin/python -m dynamo.replay traces/mooncake_fast25/toolagent_trace_10pct.jsonl
--planner-config '{
"mode": "agg",
"optimization_target": "sla",
"ttft": 2500,
"itl": 50,
"enable_throughput_scaling": true,
"enable_load_scaling": true,
"pre_deployment_sweeping_mode": "rapid",
"throughput_adjustment_interval": 60,
"load_adjustment_interval": 5,
"report_filename": "replay_agg.html"
}'
--extra-engine-args '{"aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}'
--num-workers 2
--arrival-speedup-ratio 1.0
```
Disagg mode with SLA-based scaling
```bash
.venv/bin/python -m dynamo.replay traces/mooncake_fast25/toolagent_trace_10pct.jsonl
--planner-config '{
"mode": "disagg",
"optimization_target": "sla",
"ttft": 2500,
"itl": 50,
"enable_throughput_scaling": true,
"enable_load_scaling": true,
"pre_deployment_sweeping_mode": "rapid",
"throughput_adjustment_interval": 60,
"load_adjustment_interval": 5,
"report_filename": "replay_disagg.html"
}'
--prefill-engine-args '{"worker_type": "prefill", "aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}'
--decode-engine-args '{"worker_type": "decode", "aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}'
--num-prefill-workers 1
--num-decode-workers 1
--arrival-speedup-ratio 1.0
```
Results
Example run on 10% of the FAST25 toolagent trace (2,361 requests, ISL
8.8k / OSL173).Agg (90 planner ticks, SLA TTFT=2500ms / ITL=50ms):
Disagg (83 planner ticks, same SLA):
Disagg's ITL is tighter because decode is isolated from prefill chunking; agg takes a slightly lower avg TTFT because the same engine can start serving prefill without cross-worker KV transfer.
Full interactive reports
Both HTML reports (~4.7 MB each, with observed TTFT / ITL vs SLA, predicted vs observed traffic, per-engine KV queue + inflight with capacity line, engine RPS, load / throughput decision timelines):
(Hosted on a gist — htmlpreview renders the Plotly charts in-browser. Use "View raw" on this gist if the preview link stalls.)
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements