fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fixes by tedzhouhk · Pull Request #8280 · ai-dynamo/dynamo

tedzhouhk · 2026-04-16T20:37:32Z

Summary

Adds interactive HTML diagnostics reports to offline planner-in-the-loop replay, bootstraps the throughput regression from the mocker's perf model (via AIC), and fixes several planner correctness issues surfaced by the new reports.

Before this PR, `dynamo.replay --planner-config ...` produced only a text summary of trace metrics. After this PR, every run emits a Plotly HTML report showing replica counts, observed vs predicted traffic, TTFT/ITL vs SLA, per-engine KV load, engine capacity, and color-coded decision timelines — giving a per-tick view of how the planner reacted to the simulated workload.

What's in the PR

Replay diagnostics

HTML report — wire `DiagnosticsRecorder` into `ReplayPlannerAdapter` so offline replay produces the same interactive chart set as the live planner, plus:
- `report_filename` config for a stable path (refresh-in-browser)
- Throughput lower-bound dashed line on the Replica Counts chart (separate prefill / decode lines in disagg)
- Combined queued + inflight decode KV with a horizontal `max_kv_tokens` capacity line
- Per-component Load / Throughput decision timelines in disagg (prefill on y=2, decode on y=1)
- Hidden per-engine legend entries to stop the legend covering charts during scaling
- `connectgaps=True` on throughput-tick-only traces (engine RPS, predicted RPS/ISL/OSL)
TTFT / ITL plumbed through the traffic accumulator (Rust) — the mocker's `TraceCollector` already tracks arrival / first-token / per-token timestamps; now `TrafficAccumulator::on_request` reads `request_latencies()` and exposes `avg_ttft_ms` / `avg_itl_ms` via `drain_traffic()`. The Python adapter feeds these into `Metrics` so the report's "Observed TTFT / ITL vs SLA" charts have data.
Synthetic benchmark from AIC — instead of requiring the user to supply NPZ profiling data, replay now sweeps AIC's `predict_prefill` / `predict_decode` at a configurable `--benchmark-granularity` (default 8, matching the profiler) and bootstraps the planner's regression models with batch-size-aware data consistent with the mocker's timing. Prefill sweep caps at `max_num_batched_tokens`; agg-mode bootstrap combines prefill + decode points so both regression features have variance.

Planner correctness fixes (surfaced by replay diagnostics)

`find_best_engine_agg_rps` max_bs cap — the previous formula `int(max_num_batched_tokens / avg_ctx) * 2` collapsed to 2 whenever `ISL > MBT`, so `engine_rps` was wildly underestimated for prefill-heavy workloads. Replaced with a binary search in `[1, min(kv_cap, seq_cap)]` that enforces the prefill / decode steady-state balance `isl / (max_num_batched_tokens - x) <= osl`. Took engine_rps from 1.15 to ~3.5 on a representative workload.
`find_best_engine_prefill_rps` chunking — for `ISL > max_num_batched_tokens`, the wall-time prediction was being extrapolated beyond the regression's training domain. Now chunks `ISL` into `ceil(ISL / MBT)` forward passes and sums predictions per chunk.
`find_best_engine_decode_rps` capability caps — was using `_max_observed_kv / context_length` as the sweep upper bound, which stays low until the engine runs hot. Now uses `max_kv_tokens / context_length` and `max_num_seqs` from engine capabilities, with the observed-KV fallback only when capabilities aren't provided.
State-machine tick ordering — on a combined load + throughput tick, load scaling ran first with stale `throughput_lower_bound*`, then throughput updated the floor afterward, letting the load decision violate the brand-new floor. Swapped the order so throughput scaling updates the floor before load scaling reads it (precedence unchanged: load decision still wins over throughput decision when both produce one).
Throughput floor enforced on no-change — `_advance_load_disagg` and the SLA-mode agg path both returned `None` for "no change" before applying the floor, so replicas could linger below a raised floor for many ticks. Now computes the final count as `max(load_or_current, min_endpoint, throughput_floor)` first, then checks no-change against the post-floor value.
Relaxable negative coefficients — `DecodeRegressionModel` fits on `(num_decode_requests, sum_decode_kv_tokens)`; total KV dominates wall time, so `num_decode_requests` can flip slightly negative under multicollinearity + noisy FPMs. Added a `_relaxable_feature_indices` hook in the base class; decode marks index 0 as relaxable so minor negatives don't reject the whole fit. `sum_decode_kv_tokens` (and all agg / prefill features) still hard-reject on negatives.
Per-component decision reasons — `TickDiagnostics` now carries `{load,throughput}decision_reason{prefill,decode}` in disagg mode. Aggregate reason is derived by priority (`scale_up > scale_down_capped_by_throughput > scale_down > no_change`) for back-compat.

How to run

Trace preparation

Any Mooncake-format trace works. Example:

```bash
mkdir -p traces/mooncake_fast25
curl -sL https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl
-o traces/mooncake_fast25/toolagent_trace.jsonl
head -2361 traces/mooncake_fast25/toolagent_trace.jsonl > traces/mooncake_fast25/toolagent_trace_10pct.jsonl
```

Build Python bindings

```bash
.venv/bin/maturin develop --uv -m lib/bindings/python/Cargo.toml
export PYTHONPATH=lib/bindings/python/src:components/src
```

Install AIC (for throughput-based scaling)

```bash
uv pip install --python .venv/bin/python aiconfigurator
```

Agg mode with SLA-based scaling

```bash
.venv/bin/python -m dynamo.replay traces/mooncake_fast25/toolagent_trace_10pct.jsonl
--planner-config '{
"mode": "agg",
"optimization_target": "sla",
"ttft": 2500,
"itl": 50,
"enable_throughput_scaling": true,
"enable_load_scaling": true,
"pre_deployment_sweeping_mode": "rapid",
"throughput_adjustment_interval": 60,
"load_adjustment_interval": 5,
"report_filename": "replay_agg.html"
}'
--extra-engine-args '{"aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}'
--num-workers 2
--arrival-speedup-ratio 1.0
```

Disagg mode with SLA-based scaling

```bash
.venv/bin/python -m dynamo.replay traces/mooncake_fast25/toolagent_trace_10pct.jsonl
--planner-config '{
"mode": "disagg",
"optimization_target": "sla",
"ttft": 2500,
"itl": 50,
"enable_throughput_scaling": true,
"enable_load_scaling": true,
"pre_deployment_sweeping_mode": "rapid",
"throughput_adjustment_interval": 60,
"load_adjustment_interval": 5,
"report_filename": "replay_disagg.html"
}'
--prefill-engine-args '{"worker_type": "prefill", "aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}'
--decode-engine-args '{"worker_type": "decode", "aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}'
--num-prefill-workers 1
--num-decode-workers 1
--arrival-speedup-ratio 1.0
```

Results

Example run on 10% of the FAST25 toolagent trace (2,361 requests, ISL~~8.8k / OSL~~173).

Agg (90 planner ticks, SLA TTFT=2500ms / ITL=50ms):

Metric	avg	p90	p99
TTFT	861 ms	2,063 ms	4,765 ms
ITL	24 ms	29 ms	195 ms
E2E	5,165 ms	14,586 ms	31,729 ms

Disagg (83 planner ticks, same SLA):

Metric	avg	p90	p99
TTFT	952 ms	2,263 ms	4,612 ms
ITL	19 ms	26 ms	32 ms
E2E	4,443 ms	11,759 ms	19,147 ms

Disagg's ITL is tighter because decode is isolated from prefill chunking; agg takes a slightly lower avg TTFT because the same engine can start serving prefill without cross-worker KV transfer.

Full interactive reports

Both HTML reports (~4.7 MB each, with observed TTFT / ITL vs SLA, predicted vs observed traffic, per-engine KV queue + inflight with capacity line, engine RPS, load / throughput decision timelines):

(Hosted on a gist — htmlpreview renders the Plotly charts in-browser. Use "View raw" on this gist if the preview link stalls.)

Test plan

All 86 planner unit tests pass (`pytest components/src/dynamo/planner/tests/unit/`)
Agg SLA replay produces scaling events respecting throughput floor
Disagg SLA replay produces separate prefill / decode decision timelines
`connectgaps` renders sparse throughput traces as continuous lines
Pre-commit clean on all touched files

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Enhanced diagnostics for load and throughput scaling with per-component decision tracking (prefill vs. decode).
- Extended performance models to respect KV token capacity and sequence concurrency limits.
- Improved traffic metrics capturing first-token latency and inter-token latency data.
Improvements
- Refined load scaling evaluation to properly apply post-bounds constraints.
- Better throughput lower bound visualization in monitoring dashboards.
- More accurate batch size calculations respecting token limits.

… replay Wire DiagnosticsRecorder into ReplayPlannerAdapter so offline planner-in-the-loop replay generates the same interactive Plotly HTML report as the live planner. Extend the Rust TrafficAccumulator to track per-request TTFT and mean ITL from the TraceCollector, plumbing observed latency data through drain_traffic() to the Python adapter and into the HTML report's "Observed TTFT/ITL vs SLA" charts. Bootstrap regression models from profiling data (NPZ) when throughput scaling is enabled, enabling SLA-mode replay with pre-deployment performance profiles. Hide per-engine FPM traces from the Plotly legend to prevent overlap with chart panels during dynamic scaling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

- Plot throughput-based scaling lower bound as a dashed red line on the Replica Counts chart when it's above 1. - Combine queued + inflight decode KV tokens into a single "total KV" trace per engine in the Decode Engine Load chart. - Add a horizontal KV capacity line to the decode engine load chart from engine capabilities (max_kv_tokens). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Generate synthetic FPMs by sweeping the mocker's polynomial perf model instead of loading external NPZ profiling data. This ensures the planner's capacity estimates are consistent with the simulated engine timing, avoiding overscaling from mismatched perf models. Add --benchmark-granularity CLI flag (default: 8, matching the profiler's default interpolation_granularity) to control the number of sweep points. Also expose throughput_lower_bound in TickDiagnostics and plot it as a dashed line on the Replica Counts chart. Combine queued + inflight decode KV tokens into a single trace and add a KV capacity horizontal line to the Decode Engine Load chart. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

The previous formula ``int(max_num_batched_tokens / avg_ctx) * 2`` capped the batch-size sweep at 2 for prefill-heavy workloads (e.g. ISL=8844, max_num_batched_tokens=8192), causing engine_rps to be dramatically underestimated and the throughput lower bound to be too high. Replace with a binary search in [1, min(kv_cap, seq_cap)] that finds the largest batch size satisfying the prefill/decode balance condition: isl / (max_num_batched_tokens - x) <= osl Above this point, prefill cannot keep up with decode and TTFT grows unbounded. Below it, additional concurrency is limited by KV cache or max_num_seqs. For our test workload (ISL=8844, OSL=173, MBT=8192, KV=262144), max_bs goes from 2 to 29 and engine_rps from 1.15 to ~3.5, reducing the throughput lower bound from 5-9 replicas to 2-3 replicas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Prefill: - Document the linear-scaling assumption behind ``engine_rps = 1/wt``: under ``wt = k·sum_prefill_tokens``, batching multiple prefills gives the same engine_rps as one-at-a-time. - Handle ``ISL > max_num_batched_tokens`` by chunking wall_time across forward passes rather than extrapolating the regression beyond its training domain (a single pass can never process more than MBT tokens). Decode: - Replace the observation-driven ``_max_observed_kv`` cap with engine capability caps (``max_kv_tokens``, ``max_num_seqs``) so the sweep reflects real engine capacity rather than runtime load history. The observed-KV cap is kept as a fallback when capabilities are not provided. Replay: - Cap the prefill benchmark sweep at ``max_num_batched_tokens`` instead of ``max_kv_tokens`` so AIC is never queried with per-pass token counts that a real engine could not process in a single iteration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

The DecodeRegressionModel fits on (num_decode_requests, sum_decode_kv_tokens). Total KV dominates wall time via attention compute; batch size has a weaker secondary effect. Under multicollinearity between the two features and noisy FPM observations, the num_decode_requests coefficient can flip slightly negative without violating physical monotonicity (total KV still drives wall time up with load). Previously the fit was rejected whenever any coefficient was negative, which caused the decode ITL estimate to be None on most load ticks, leaving the diagnostics report's "Estimated ITL vs SLA" chart sparse. Add a ``_clippable_feature_indices`` hook in the base regression model; subclasses can mark specific features as "clip to 0 if negative" rather than rejecting the whole fit. DecodeRegressionModel marks feature 0 (num_decode_requests) as clippable. sum_decode_kv_tokens must remain non-negative and will still trigger rejection if it turns negative. Also combine prefill-only and decode-only benchmark FPMs for agg-mode bootstrap so the agg regression sees variance on both features rather than only sum_decode_kv_tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

…P bound Planner perf model: - Rename ``_clippable_feature_indices`` to ``_relaxable_feature_indices``. - Accept slightly-negative relaxable coefficients as-is instead of clipping them to 0 -- the dominant feature keeps predictions monotone and clipping was unnecessary complexity. Diagnostics report: - Add a separate dashed line for ``throughput_lower_bound_prefill`` in the Replica Counts subplot so disagg replays show both prefill and decode lower bounds. Prefill is dark blue, decode is red, matching their respective replica-count traces. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

On a combined tick (when load and throughput cadences align, e.g. every 60s with default intervals), the previous order ran load scaling first using the stale ``_throughput_lower_bound_*``, then updated the bounds from throughput scaling afterward. Load could therefore decide to scale below a floor that was about to be raised in the same tick, producing a brief window where actual replicas violated the new lower bound. Swap the order: run throughput scaling first so its updated lower bound is visible to load scaling's floor check in the same tick. Keep the precedence rule (load decision wins when it produces one) so the fast reactive loop still drives immediate scaling, while the throughput floor is always up-to-date before load reads it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

The disagg ``_advance_load_disagg`` and the SLA-mode agg ``_advance_load_agg`` paths both returned ``None`` for "no change" before applying the throughput-scaling lower bound. When load scaling saw current replicas as sufficient, the floor check never ran, so replicas could linger below a newly-raised throughput floor until load scaling independently decided to scale up for other reasons. Restructure both paths so the final replica count is computed as ``max(load_decision_or_current, min_endpoint, throughput_floor)`` and the no-change early return is evaluated against the post-floor value. This way a load tick following (or coincident with) a throughput tick that raises the floor will scale up immediately, rather than waiting several ticks for the load signal to independently cross the threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Add per-component load and throughput decision reasons to TickDiagnostics (``load_decision_reason_prefill``, ``load_decision_reason_decode``, and same for throughput). Disagg load scaling sets each side's reason independently so the report's Load / Throughput decision timelines can plot two tracks (prefill at y=2, decode at y=1) with "prefill" / "decode" y-axis labels. Agg mode falls back to the existing single-track rendering. The aggregate ``load_decision_reason`` is still populated (for back-compat and Prometheus metrics) and is chosen by priority: scale_up > scale_down_capped_by_throughput > scale_down > no_change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Engine RPS, Predicted RPS, Predicted ISL, and Predicted OSL are only populated on throughput-scaling ticks (every 60s by default), so the Plotly default of disconnecting lines at None values produces scattered markers with no connecting line. Enable ``connectgaps=True`` on these four traces so the 60s data points form a continuous line across gaps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

- Split AIC bootstrap error handling: distinct try/except for session creation vs benchmark FPM generation, and validate FPM lists are non-empty before calling load_benchmark_fpms. - Add explicit boundary comment + assertion for prefill chunking at ``isl == max_num_batched_tokens``. - Expand the agg max_bs balance-formula comment to show the full admission-rate >= egress-rate derivation. - Clarify state machine's ``_next_throughput_s`` advancement on ticks without traffic data. - Preserve legitimate zero TTFT/ITL values: gate on ``num_req > 0`` instead of ``avg_ttft_ms > 0`` so cache-hit or synthetic-zero latencies are still recorded in the diagnostics report. - Rename "Scaling events" to "Replica transitions" in the HTML report summary and count actual count changes rather than decision ticks. - Document the report's agg vs disagg layout switch for the decision timeline subplots. - Explain in the base regression model docstring why only DecodeRegressionModel uses ``_relaxable_feature_indices``. - Add a maintenance note on ``TrafficStats`` so future field additions also update the PyO3 binding and Python adapter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

coderabbitai · 2026-04-16T20:43:35Z

Walkthrough

This PR enhances scaling decision diagnostics by segregating load and throughput reasons per component (prefill/decode), extends performance regression models with resource constraints (KV tokens, sequence counts, batch sizes), and enriches traffic tracking with latency metrics (TTFT, inter-token latency) across the Rust backend and Python planner.

Changes

Cohort / File(s)	Summary
Load Scaling Diagnostics `components/src/dynamo/planner/core/load_scaling.py`	Removed early-return for unchanged load decisions; refactored to apply bounds/budgets before concluding no scaling. Added per-component diagnostic computation (distinguishing `scale_up`, `scale_down`, `scale_down_capped_by_throughput`, `no_change`) and aggregate prioritization logic. Updated aggregated `desired` handling to allow throughput enforcement even on unchanged load decisions.
Performance Model Constraints `components/src/dynamo/planner/core/perf_model/agg.py`, `decode.py`, `prefill.py`	Extended regression model APIs with optional resource constraints: `find_best_engine_agg_rps` and `find_best_engine_decode_rps` now accept `max_kv_tokens` and `max_num_seqs`; `find_best_engine_prefill_rps` accepts `max_num_batched_tokens`. Prefill adds conditional chunking logic when input size exceeds batch limit. Decode and agg compute batch size upper bound as minimum of KV-capacity, sequence concurrency, and balance constraints via binary search.
Regression Model Foundation `components/src/dynamo/planner/core/perf_model/base.py`	Introduced `_relaxable_feature_indices` to allow selected regression coefficients to be negative without invalidating the fit; updated validation to reject only non-relaxable negative coefficients.
State Machine & Scaling Resolution `components/src/dynamo/planner/core/state_machine.py`, `throughput_scaling.py`	Added per-component diagnostic fields (`_diag_load_reason_prefill`, `_diag_load_reason_decode`, `_diag_throughput_reason_prefill`, `_diag_throughput_reason_decode`) and reordered tick resolution: throughput scaling executes first, load scaling follows, and load decision takes precedence when present. Extended regression queries to pass capability limits (KV tokens, sequence counts, batch sizes) to perf models.
Diagnostics Data & Reporting `components/src/dynamo/planner/core/types.py`, `monitoring/diagnostics_recorder.py`	Extended `TickDiagnostics` with throughput lower bounds and per-component decision reasons for prefill/decode. Updated `DiagnosticsRecorder` to plumb `max_kv_tokens` and enriched Plotly charts: conditional lower-bound overlay lines, disaggregated decision timelines with per-component markers, combined KV token visualizations, and replica transition counting instead of decision presence.
Traffic Latency Tracking (Python) `components/src/dynamo/planner/offline/replay_adapter.py`	Refined traffic tracking to conditionally record latency fields (TTFT, inter-token latency) only when requests are present; moved `_last_traffic` updates from diagnostics recording to traffic-metric-building phase.
Traffic Statistics Struct & Drain APIs (Rust) `lib/mocker/src/replay/offline/components/types.rs`, `agg.rs`, `disagg.rs`, `planner_handle.rs`, `lib/bindings/python/rust/llm/replay.rs`	Created new `TrafficStats` struct (adds `avg_ttft_ms`, `avg_itl_ms` to existing duration/count/throughput fields). Updated `TrafficAccumulator::on_request` to accept optional latency tuple and accumulate TTFT/ITL metrics. Changed drain APIs across `agg`, `disagg`, and `planner_handle` to return `TrafficStats` instead of tuple. Extended Python binding to expose new latency fields in JSON output.
Latency Collection & Benchmarking `lib/mocker/src/replay/collector.rs`, `components/src/dynamo/planner/offline/replay_adapter.py`, `lib/bindings/python/src/dynamo/replay/main.py`	Added `request_latencies()` method to compute per-request TTFT and inter-token latency from trace data. Integrated AIC-based synthetic benchmark generation into replay planner when model is not "easy", sweeping input parameters with engine KV budget constraints and merging points for aggregated mode.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title clearly identifies the PR's main focus: planner-in-the-loop replay diagnostics and planner scaling logic fixes, which aligns with the primary changes across multiple files.
Description check	✅ Passed	The PR description comprehensively covers the changes, objectives, use cases, and test results, but does not follow the template structure with explicit sections for Overview, Details, Where to start, and Related Issues.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jthomson04

Looks good. One nit (can be for a follow up): Would it be possible to get the x-axes of all the charts to scale together? If I want to narrow down on a specific time period, I have to separately zoom into that area on each graph.

Correctness: - load_scaling: distinguish throughput-floor lifts from min_endpoint / global-budget lifts when labelling ``scale_down_capped_by_throughput`` so the decision timeline reflects the actual cause. - load_scaling: mirror aggregate reason onto per-component fields on ``no_fpm_data`` / ``worker_count_mismatch`` early returns so disagg timelines stay consistent with the top-level reason. - base perf model: add 1e-6 noise tolerance for relaxable negative coefficients; clamp anything beyond the tolerance back to 0 so the model never predicts lower wall time for higher feature values. - throughput_scaling: when one disagg component's regression isn't fitted yet, label the other side ``partner_not_ready`` instead of leaving its per-component reason ``None``. - Rust TrafficAccumulator: track ``ttft_count`` separately from ``num_req`` so ``avg_ttft_ms`` reflects only requests that actually produced a TTFT sample (mirroring the existing ``itl_count`` pattern). Quality: - prefill perf model: replace ``assert`` with a comment stating the ``remainder ∈ (0, max_num_batched_tokens]`` invariant (asserts are stripped under ``python -O``). - replay main: narrow the broad ``except Exception`` around AIC session creation and benchmark generation to the concrete exception types those paths can raise, preserving graceful degradation without swallowing unrelated bugs. - replay main: decouple ``benchmark_granularity`` from the decode batch-size ceiling; the sweep now steps up to ``max_num_seqs`` with ``granularity`` points per axis, giving the regression variance on ``num_decode_requests`` at realistic concurrency levels. - load_scaling (agg): remove unreachable ``else`` branch after the ``desired == num_workers`` early return. - diagnostics recorder: add ``partner_not_ready`` to the throughput decision colour map (pink) so the new reason renders correctly. Documentation: - Rust bridge ``drain_traffic`` docstring: enumerate all six keys with units so the Python-facing API doc matches the returned payload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Run cargo fmt to satisfy the rust-tests CI formatting check on the three mocker replay files touched by the latency-tracking changes. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Mypy CI failed because LoadScalingMixin / ThroughputScalingMixin declared only the aggregate ``_diag_*_reason`` scratch fields while the state machine initialized four new per-component variants with ``Optional[str] = None``. Declare the per-component fields in the mixins too so the assignments in state_machine.__init__ type-check. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

tedzhouhk · 2026-04-16T23:14:17Z

Looks good. One nit (can be for a follow up): Would it be possible to get the x-axes of all the charts to scale together? If I want to narrow down on a specific time period, I have to separately zoom into that area on each graph.

sure, added a TODO

Note the reviewer request to share the x-axis across all subplots so zooming on one chart narrows all of them to the same time range. Deferred to a follow-up since the current report is shipping. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Mypy flagged passing ``Optional[str]`` (aic_system, aic_model_path) to ``create_session`` which expects ``str``. Tighten the pre-check so the bootstrap branch only runs when all three required AIC fields are non-None, mirroring the existing ``aic_backend is None`` guard. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Devin flagged two issues on #8280: 1. In disagg mode the benchmark generator used prefill_engine_args for both prefill and decode sweeps because ``ref_args`` resolved to ``prefill_engine_args`` when ``extra_engine_args`` was None. Decode engines typically have much higher ``max_num_seqs`` and larger KV caches than prefill, so decoding regressions were trained on data that didn't cover the actual decode engine's operating range. Split ``_generate_aic_benchmark_fpms`` into separate prefill and decode helpers and pick the appropriate engine args per call site: agg uses ``extra_engine_args`` for both, disagg uses ``prefill_engine_args`` / ``decode_engine_args`` respectively. 2. ``ForwardPassMetrics`` / ``ScheduledRequestMetrics`` were imported inside the function body, violating the project's Python guideline that first-party imports belong at the top of the file. Move them to module-level imports. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

… scaling logics fixes (#8280) Signed-off-by: hongkuanz <hongkuanz@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>

tedzhouhk and others added 12 commits April 15, 2026 18:30

tedzhouhk requested review from a team as code owners April 16, 2026 20:37

pull-request-size Bot added the size/XL label Apr 16, 2026

github-actions Bot added feat planner labels Apr 16, 2026

tedzhouhk changed the title ~~feat(planner): planner-in-the-loop replay diagnostics + capacity estimation fixes~~ fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fix Apr 16, 2026

tedzhouhk changed the title ~~fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fix~~ fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fixes Apr 16, 2026

github-actions Bot added fix and removed feat labels Apr 16, 2026

This comment was marked as resolved.

Sign in to view

jthomson04 approved these changes Apr 16, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB April 16, 2026 22:41 Inactive

style: cargo fmt on mocker replay files

41ef3bd

Run cargo fmt to satisfy the rust-tests CI formatting check on the three mocker replay files touched by the latency-tracking changes. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

copy-pr-bot Bot temporarily deployed to GITLAB April 16, 2026 23:08 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 16, 2026 23:11 Inactive

pull-request-size Bot added size/XXL and removed size/XL labels Apr 16, 2026

copy-pr-bot Bot temporarily deployed to GITLAB April 16, 2026 23:15 Inactive

tedzhouhk enabled auto-merge (squash) April 16, 2026 23:15

copy-pr-bot Bot temporarily deployed to GITLAB April 16, 2026 23:37 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 01:41 Inactive

Merge branch 'main' into hzhou/replay-planner-diagnostics

9aa5e59

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 16:20 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 16:21 Inactive

This comment was marked as resolved.

Sign in to view

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 18:24 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 18:31 Inactive

tedzhouhk merged commit 0c7068d into main Apr 17, 2026
90 of 92 checks passed

tedzhouhk deleted the hzhou/replay-planner-diagnostics branch April 17, 2026 19:39

dmitry-tokarev-nv mentioned this pull request Apr 20, 2026

fix(planner/tests): scope test_advisory_mode stubs so they don't leak #8418

Merged

3 tasks

tedzhouhk mentioned this pull request May 12, 2026

DEP: Refactor DGDR profiler to use Dynamo Replay #9410

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fixes#8280

fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fixes#8280
tedzhouhk merged 19 commits into
mainfrom
hzhou/replay-planner-diagnostics

tedzhouhk commented Apr 16, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

jthomson04 left a comment

Uh oh!

tedzhouhk commented Apr 16, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tedzhouhk commented Apr 16, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

Replay diagnostics

Planner correctness fixes (surfaced by replay diagnostics)

How to run

Trace preparation

Build Python bindings

Install AIC (for throughput-based scaling)

Agg mode with SLA-based scaling

Disagg mode with SLA-based scaling

Results

Full interactive reports

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

This comment was marked as resolved.

Uh oh!

jthomson04 left a comment

Choose a reason for hiding this comment

Uh oh!

tedzhouhk commented Apr 16, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tedzhouhk commented Apr 16, 2026 •

edited by devin-ai-integration Bot

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading