Skip to content

fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fixes#8280

Merged
tedzhouhk merged 19 commits into
mainfrom
hzhou/replay-planner-diagnostics
Apr 17, 2026
Merged

fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fixes#8280
tedzhouhk merged 19 commits into
mainfrom
hzhou/replay-planner-diagnostics

Conversation

@tedzhouhk
Copy link
Copy Markdown
Contributor

@tedzhouhk tedzhouhk commented Apr 16, 2026

Summary

Adds interactive HTML diagnostics reports to offline planner-in-the-loop replay, bootstraps the throughput regression from the mocker's perf model (via AIC), and fixes several planner correctness issues surfaced by the new reports.

Before this PR, `dynamo.replay --planner-config ...` produced only a text summary of trace metrics. After this PR, every run emits a Plotly HTML report showing replica counts, observed vs predicted traffic, TTFT/ITL vs SLA, per-engine KV load, engine capacity, and color-coded decision timelines — giving a per-tick view of how the planner reacted to the simulated workload.

What's in the PR

Replay diagnostics

  • HTML report — wire `DiagnosticsRecorder` into `ReplayPlannerAdapter` so offline replay produces the same interactive chart set as the live planner, plus:

    • `report_filename` config for a stable path (refresh-in-browser)
    • Throughput lower-bound dashed line on the Replica Counts chart (separate prefill / decode lines in disagg)
    • Combined queued + inflight decode KV with a horizontal `max_kv_tokens` capacity line
    • Per-component Load / Throughput decision timelines in disagg (prefill on y=2, decode on y=1)
    • Hidden per-engine legend entries to stop the legend covering charts during scaling
    • `connectgaps=True` on throughput-tick-only traces (engine RPS, predicted RPS/ISL/OSL)
  • TTFT / ITL plumbed through the traffic accumulator (Rust) — the mocker's `TraceCollector` already tracks arrival / first-token / per-token timestamps; now `TrafficAccumulator::on_request` reads `request_latencies()` and exposes `avg_ttft_ms` / `avg_itl_ms` via `drain_traffic()`. The Python adapter feeds these into `Metrics` so the report's "Observed TTFT / ITL vs SLA" charts have data.

  • Synthetic benchmark from AIC — instead of requiring the user to supply NPZ profiling data, replay now sweeps AIC's `predict_prefill` / `predict_decode` at a configurable `--benchmark-granularity` (default 8, matching the profiler) and bootstraps the planner's regression models with batch-size-aware data consistent with the mocker's timing. Prefill sweep caps at `max_num_batched_tokens`; agg-mode bootstrap combines prefill + decode points so both regression features have variance.

Planner correctness fixes (surfaced by replay diagnostics)

  • `find_best_engine_agg_rps` max_bs cap — the previous formula `int(max_num_batched_tokens / avg_ctx) * 2` collapsed to 2 whenever `ISL > MBT`, so `engine_rps` was wildly underestimated for prefill-heavy workloads. Replaced with a binary search in `[1, min(kv_cap, seq_cap)]` that enforces the prefill / decode steady-state balance `isl / (max_num_batched_tokens - x) <= osl`. Took engine_rps from 1.15 to ~3.5 on a representative workload.

  • `find_best_engine_prefill_rps` chunking — for `ISL > max_num_batched_tokens`, the wall-time prediction was being extrapolated beyond the regression's training domain. Now chunks `ISL` into `ceil(ISL / MBT)` forward passes and sums predictions per chunk.

  • `find_best_engine_decode_rps` capability caps — was using `_max_observed_kv / context_length` as the sweep upper bound, which stays low until the engine runs hot. Now uses `max_kv_tokens / context_length` and `max_num_seqs` from engine capabilities, with the observed-KV fallback only when capabilities aren't provided.

  • State-machine tick ordering — on a combined load + throughput tick, load scaling ran first with stale `throughput_lower_bound*`, then throughput updated the floor afterward, letting the load decision violate the brand-new floor. Swapped the order so throughput scaling updates the floor before load scaling reads it (precedence unchanged: load decision still wins over throughput decision when both produce one).

  • Throughput floor enforced on no-change — `_advance_load_disagg` and the SLA-mode agg path both returned `None` for "no change" before applying the floor, so replicas could linger below a raised floor for many ticks. Now computes the final count as `max(load_or_current, min_endpoint, throughput_floor)` first, then checks no-change against the post-floor value.

  • Relaxable negative coefficients — `DecodeRegressionModel` fits on `(num_decode_requests, sum_decode_kv_tokens)`; total KV dominates wall time, so `num_decode_requests` can flip slightly negative under multicollinearity + noisy FPMs. Added a `_relaxable_feature_indices` hook in the base class; decode marks index 0 as relaxable so minor negatives don't reject the whole fit. `sum_decode_kv_tokens` (and all agg / prefill features) still hard-reject on negatives.

  • Per-component decision reasons — `TickDiagnostics` now carries `{load,throughput}decision_reason{prefill,decode}` in disagg mode. Aggregate reason is derived by priority (`scale_up > scale_down_capped_by_throughput > scale_down > no_change`) for back-compat.

How to run

Trace preparation

Any Mooncake-format trace works. Example:

```bash
mkdir -p traces/mooncake_fast25
curl -sL https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl
-o traces/mooncake_fast25/toolagent_trace.jsonl
head -2361 traces/mooncake_fast25/toolagent_trace.jsonl > traces/mooncake_fast25/toolagent_trace_10pct.jsonl
```

Build Python bindings

```bash
.venv/bin/maturin develop --uv -m lib/bindings/python/Cargo.toml
export PYTHONPATH=lib/bindings/python/src:components/src
```

Install AIC (for throughput-based scaling)

```bash
uv pip install --python .venv/bin/python aiconfigurator
```

Agg mode with SLA-based scaling

```bash
.venv/bin/python -m dynamo.replay traces/mooncake_fast25/toolagent_trace_10pct.jsonl
--planner-config '{
"mode": "agg",
"optimization_target": "sla",
"ttft": 2500,
"itl": 50,
"enable_throughput_scaling": true,
"enable_load_scaling": true,
"pre_deployment_sweeping_mode": "rapid",
"throughput_adjustment_interval": 60,
"load_adjustment_interval": 5,
"report_filename": "replay_agg.html"
}'
--extra-engine-args '{"aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}'
--num-workers 2
--arrival-speedup-ratio 1.0
```

Disagg mode with SLA-based scaling

```bash
.venv/bin/python -m dynamo.replay traces/mooncake_fast25/toolagent_trace_10pct.jsonl
--planner-config '{
"mode": "disagg",
"optimization_target": "sla",
"ttft": 2500,
"itl": 50,
"enable_throughput_scaling": true,
"enable_load_scaling": true,
"pre_deployment_sweeping_mode": "rapid",
"throughput_adjustment_interval": 60,
"load_adjustment_interval": 5,
"report_filename": "replay_disagg.html"
}'
--prefill-engine-args '{"worker_type": "prefill", "aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}'
--decode-engine-args '{"worker_type": "decode", "aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}'
--num-prefill-workers 1
--num-decode-workers 1
--arrival-speedup-ratio 1.0
```

Results

Example run on 10% of the FAST25 toolagent trace (2,361 requests, ISL8.8k / OSL173).

Agg (90 planner ticks, SLA TTFT=2500ms / ITL=50ms):

Metric avg p90 p99
TTFT 861 ms 2,063 ms 4,765 ms
ITL 24 ms 29 ms 195 ms
E2E 5,165 ms 14,586 ms 31,729 ms

Disagg (83 planner ticks, same SLA):

Metric avg p90 p99
TTFT 952 ms 2,263 ms 4,612 ms
ITL 19 ms 26 ms 32 ms
E2E 4,443 ms 11,759 ms 19,147 ms

Disagg's ITL is tighter because decode is isolated from prefill chunking; agg takes a slightly lower avg TTFT because the same engine can start serving prefill without cross-worker KV transfer.

Full interactive reports

Both HTML reports (~4.7 MB each, with observed TTFT / ITL vs SLA, predicted vs observed traffic, per-engine KV queue + inflight with capacity line, engine RPS, load / throughput decision timelines):

(Hosted on a gist — htmlpreview renders the Plotly charts in-browser. Use "View raw" on this gist if the preview link stalls.)

Test plan

  • All 86 planner unit tests pass (`pytest components/src/dynamo/planner/tests/unit/`)
  • Agg SLA replay produces scaling events respecting throughput floor
  • Disagg SLA replay produces separate prefill / decode decision timelines
  • `connectgaps` renders sparse throughput traces as continuous lines
  • Pre-commit clean on all touched files

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Enhanced diagnostics for load and throughput scaling with per-component decision tracking (prefill vs. decode).
    • Extended performance models to respect KV token capacity and sequence concurrency limits.
    • Improved traffic metrics capturing first-token latency and inter-token latency data.
  • Improvements

    • Refined load scaling evaluation to properly apply post-bounds constraints.
    • Better throughput lower bound visualization in monitoring dashboards.
    • More accurate batch size calculations respecting token limits.

Open with Devin

tedzhouhk and others added 12 commits April 15, 2026 18:30
… replay

Wire DiagnosticsRecorder into ReplayPlannerAdapter so offline
planner-in-the-loop replay generates the same interactive Plotly
HTML report as the live planner.

Extend the Rust TrafficAccumulator to track per-request TTFT and
mean ITL from the TraceCollector, plumbing observed latency data
through drain_traffic() to the Python adapter and into the HTML
report's "Observed TTFT/ITL vs SLA" charts.

Bootstrap regression models from profiling data (NPZ) when
throughput scaling is enabled, enabling SLA-mode replay with
pre-deployment performance profiles.

Hide per-engine FPM traces from the Plotly legend to prevent
overlap with chart panels during dynamic scaling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
- Plot throughput-based scaling lower bound as a dashed red line on
  the Replica Counts chart when it's above 1.
- Combine queued + inflight decode KV tokens into a single "total KV"
  trace per engine in the Decode Engine Load chart.
- Add a horizontal KV capacity line to the decode engine load chart
  from engine capabilities (max_kv_tokens).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Generate synthetic FPMs by sweeping the mocker's polynomial perf
model instead of loading external NPZ profiling data. This ensures
the planner's capacity estimates are consistent with the simulated
engine timing, avoiding overscaling from mismatched perf models.

Add --benchmark-granularity CLI flag (default: 8, matching the
profiler's default interpolation_granularity) to control the number
of sweep points.

Also expose throughput_lower_bound in TickDiagnostics and plot it
as a dashed line on the Replica Counts chart. Combine queued +
inflight decode KV tokens into a single trace and add a KV
capacity horizontal line to the Decode Engine Load chart.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
The previous formula ``int(max_num_batched_tokens / avg_ctx) * 2`` capped
the batch-size sweep at 2 for prefill-heavy workloads (e.g. ISL=8844,
max_num_batched_tokens=8192), causing engine_rps to be dramatically
underestimated and the throughput lower bound to be too high.

Replace with a binary search in [1, min(kv_cap, seq_cap)] that finds the
largest batch size satisfying the prefill/decode balance condition:

    isl / (max_num_batched_tokens - x) <= osl

Above this point, prefill cannot keep up with decode and TTFT grows
unbounded. Below it, additional concurrency is limited by KV cache or
max_num_seqs.

For our test workload (ISL=8844, OSL=173, MBT=8192, KV=262144), max_bs
goes from 2 to 29 and engine_rps from 1.15 to ~3.5, reducing the
throughput lower bound from 5-9 replicas to 2-3 replicas.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Prefill:
- Document the linear-scaling assumption behind ``engine_rps = 1/wt``:
  under ``wt = k·sum_prefill_tokens``, batching multiple prefills gives
  the same engine_rps as one-at-a-time.
- Handle ``ISL > max_num_batched_tokens`` by chunking wall_time across
  forward passes rather than extrapolating the regression beyond its
  training domain (a single pass can never process more than MBT tokens).

Decode:
- Replace the observation-driven ``_max_observed_kv`` cap with engine
  capability caps (``max_kv_tokens``, ``max_num_seqs``) so the sweep
  reflects real engine capacity rather than runtime load history.
  The observed-KV cap is kept as a fallback when capabilities are not
  provided.

Replay:
- Cap the prefill benchmark sweep at ``max_num_batched_tokens`` instead
  of ``max_kv_tokens`` so AIC is never queried with per-pass token
  counts that a real engine could not process in a single iteration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
The DecodeRegressionModel fits on (num_decode_requests, sum_decode_kv_tokens).
Total KV dominates wall time via attention compute; batch size has a
weaker secondary effect.  Under multicollinearity between the two features
and noisy FPM observations, the num_decode_requests coefficient can flip
slightly negative without violating physical monotonicity (total KV still
drives wall time up with load).

Previously the fit was rejected whenever any coefficient was negative,
which caused the decode ITL estimate to be None on most load ticks,
leaving the diagnostics report's "Estimated ITL vs SLA" chart sparse.

Add a ``_clippable_feature_indices`` hook in the base regression model;
subclasses can mark specific features as "clip to 0 if negative" rather
than rejecting the whole fit.  DecodeRegressionModel marks feature 0
(num_decode_requests) as clippable.  sum_decode_kv_tokens must remain
non-negative and will still trigger rejection if it turns negative.

Also combine prefill-only and decode-only benchmark FPMs for agg-mode
bootstrap so the agg regression sees variance on both features rather
than only sum_decode_kv_tokens.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…P bound

Planner perf model:
- Rename ``_clippable_feature_indices`` to ``_relaxable_feature_indices``.
- Accept slightly-negative relaxable coefficients as-is instead of
  clipping them to 0 -- the dominant feature keeps predictions monotone
  and clipping was unnecessary complexity.

Diagnostics report:
- Add a separate dashed line for ``throughput_lower_bound_prefill`` in
  the Replica Counts subplot so disagg replays show both prefill and
  decode lower bounds.  Prefill is dark blue, decode is red, matching
  their respective replica-count traces.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
On a combined tick (when load and throughput cadences align, e.g. every
60s with default intervals), the previous order ran load scaling first
using the stale ``_throughput_lower_bound_*``, then updated the bounds
from throughput scaling afterward.  Load could therefore decide to scale
below a floor that was about to be raised in the same tick, producing a
brief window where actual replicas violated the new lower bound.

Swap the order: run throughput scaling first so its updated lower bound
is visible to load scaling's floor check in the same tick.  Keep the
precedence rule (load decision wins when it produces one) so the fast
reactive loop still drives immediate scaling, while the throughput floor
is always up-to-date before load reads it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
The disagg ``_advance_load_disagg`` and the SLA-mode agg
``_advance_load_agg`` paths both returned ``None`` for "no change"
before applying the throughput-scaling lower bound.  When load scaling
saw current replicas as sufficient, the floor check never ran, so
replicas could linger below a newly-raised throughput floor until load
scaling independently decided to scale up for other reasons.

Restructure both paths so the final replica count is computed as
``max(load_decision_or_current, min_endpoint, throughput_floor)`` and
the no-change early return is evaluated against the post-floor value.
This way a load tick following (or coincident with) a throughput tick
that raises the floor will scale up immediately, rather than waiting
several ticks for the load signal to independently cross the threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Add per-component load and throughput decision reasons to
TickDiagnostics (``load_decision_reason_prefill``,
``load_decision_reason_decode``, and same for throughput).  Disagg
load scaling sets each side's reason independently so the report's
Load / Throughput decision timelines can plot two tracks (prefill at
y=2, decode at y=1) with "prefill" / "decode" y-axis labels.  Agg
mode falls back to the existing single-track rendering.

The aggregate ``load_decision_reason`` is still populated (for
back-compat and Prometheus metrics) and is chosen by priority:
scale_up > scale_down_capped_by_throughput > scale_down > no_change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Engine RPS, Predicted RPS, Predicted ISL, and Predicted OSL are only
populated on throughput-scaling ticks (every 60s by default), so the
Plotly default of disconnecting lines at None values produces scattered
markers with no connecting line.  Enable ``connectgaps=True`` on these
four traces so the 60s data points form a continuous line across gaps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
- Split AIC bootstrap error handling: distinct try/except for session
  creation vs benchmark FPM generation, and validate FPM lists are
  non-empty before calling load_benchmark_fpms.
- Add explicit boundary comment + assertion for prefill chunking at
  ``isl == max_num_batched_tokens``.
- Expand the agg max_bs balance-formula comment to show the full
  admission-rate >= egress-rate derivation.
- Clarify state machine's ``_next_throughput_s`` advancement on ticks
  without traffic data.
- Preserve legitimate zero TTFT/ITL values: gate on ``num_req > 0``
  instead of ``avg_ttft_ms > 0`` so cache-hit or synthetic-zero
  latencies are still recorded in the diagnostics report.
- Rename "Scaling events" to "Replica transitions" in the HTML report
  summary and count actual count changes rather than decision ticks.
- Document the report's agg vs disagg layout switch for the decision
  timeline subplots.
- Explain in the base regression model docstring why only
  DecodeRegressionModel uses ``_relaxable_feature_indices``.
- Add a maintenance note on ``TrafficStats`` so future field additions
  also update the PyO3 binding and Python adapter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk requested review from a team as code owners April 16, 2026 20:37
@tedzhouhk tedzhouhk changed the title feat(planner): planner-in-the-loop replay diagnostics + capacity estimation fixes fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fix Apr 16, 2026
@tedzhouhk tedzhouhk changed the title fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fix fix(replay/planner): planner-in-the-loop replay diagnostics + planner scaling logics fixes Apr 16, 2026
@github-actions github-actions Bot added fix and removed feat labels Apr 16, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

Walkthrough

This PR enhances scaling decision diagnostics by segregating load and throughput reasons per component (prefill/decode), extends performance regression models with resource constraints (KV tokens, sequence counts, batch sizes), and enriches traffic tracking with latency metrics (TTFT, inter-token latency) across the Rust backend and Python planner.

Changes

Cohort / File(s) Summary
Load Scaling Diagnostics
components/src/dynamo/planner/core/load_scaling.py
Removed early-return for unchanged load decisions; refactored to apply bounds/budgets before concluding no scaling. Added per-component diagnostic computation (distinguishing scale_up, scale_down, scale_down_capped_by_throughput, no_change) and aggregate prioritization logic. Updated aggregated desired handling to allow throughput enforcement even on unchanged load decisions.
Performance Model Constraints
components/src/dynamo/planner/core/perf_model/agg.py, decode.py, prefill.py
Extended regression model APIs with optional resource constraints: find_best_engine_agg_rps and find_best_engine_decode_rps now accept max_kv_tokens and max_num_seqs; find_best_engine_prefill_rps accepts max_num_batched_tokens. Prefill adds conditional chunking logic when input size exceeds batch limit. Decode and agg compute batch size upper bound as minimum of KV-capacity, sequence concurrency, and balance constraints via binary search.
Regression Model Foundation
components/src/dynamo/planner/core/perf_model/base.py
Introduced _relaxable_feature_indices to allow selected regression coefficients to be negative without invalidating the fit; updated validation to reject only non-relaxable negative coefficients.
State Machine & Scaling Resolution
components/src/dynamo/planner/core/state_machine.py, throughput_scaling.py
Added per-component diagnostic fields (_diag_load_reason_prefill, _diag_load_reason_decode, _diag_throughput_reason_prefill, _diag_throughput_reason_decode) and reordered tick resolution: throughput scaling executes first, load scaling follows, and load decision takes precedence when present. Extended regression queries to pass capability limits (KV tokens, sequence counts, batch sizes) to perf models.
Diagnostics Data & Reporting
components/src/dynamo/planner/core/types.py, monitoring/diagnostics_recorder.py
Extended TickDiagnostics with throughput lower bounds and per-component decision reasons for prefill/decode. Updated DiagnosticsRecorder to plumb max_kv_tokens and enriched Plotly charts: conditional lower-bound overlay lines, disaggregated decision timelines with per-component markers, combined KV token visualizations, and replica transition counting instead of decision presence.
Traffic Latency Tracking (Python)
components/src/dynamo/planner/offline/replay_adapter.py
Refined traffic tracking to conditionally record latency fields (TTFT, inter-token latency) only when requests are present; moved _last_traffic updates from diagnostics recording to traffic-metric-building phase.
Traffic Statistics Struct & Drain APIs (Rust)
lib/mocker/src/replay/offline/components/types.rs, agg.rs, disagg.rs, planner_handle.rs, lib/bindings/python/rust/llm/replay.rs
Created new TrafficStats struct (adds avg_ttft_ms, avg_itl_ms to existing duration/count/throughput fields). Updated TrafficAccumulator::on_request to accept optional latency tuple and accumulate TTFT/ITL metrics. Changed drain APIs across agg, disagg, and planner_handle to return TrafficStats instead of tuple. Extended Python binding to expose new latency fields in JSON output.
Latency Collection & Benchmarking
lib/mocker/src/replay/collector.rs, components/src/dynamo/planner/offline/replay_adapter.py, lib/bindings/python/src/dynamo/replay/main.py
Added request_latencies() method to compute per-request TTFT and inter-token latency from trace data. Integrated AIC-based synthetic benchmark generation into replay planner when model is not "easy", sweeping input parameters with engine KV budget constraints and merging points for aggregated mode.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title clearly identifies the PR's main focus: planner-in-the-loop replay diagnostics and planner scaling logic fixes, which aligns with the primary changes across multiple files.
Description check ✅ Passed The PR description comprehensively covers the changes, objectives, use cases, and test results, but does not follow the template structure with explicit sections for Overview, Details, Where to start, and Related Issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@jthomson04 jthomson04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One nit (can be for a follow up): Would it be possible to get the x-axes of all the charts to scale together? If I want to narrow down on a specific time period, I have to separately zoom into that area on each graph.

Correctness:
- load_scaling: distinguish throughput-floor lifts from min_endpoint /
  global-budget lifts when labelling ``scale_down_capped_by_throughput``
  so the decision timeline reflects the actual cause.
- load_scaling: mirror aggregate reason onto per-component fields on
  ``no_fpm_data`` / ``worker_count_mismatch`` early returns so disagg
  timelines stay consistent with the top-level reason.
- base perf model: add 1e-6 noise tolerance for relaxable negative
  coefficients; clamp anything beyond the tolerance back to 0 so the
  model never predicts lower wall time for higher feature values.
- throughput_scaling: when one disagg component's regression isn't
  fitted yet, label the other side ``partner_not_ready`` instead of
  leaving its per-component reason ``None``.
- Rust TrafficAccumulator: track ``ttft_count`` separately from
  ``num_req`` so ``avg_ttft_ms`` reflects only requests that actually
  produced a TTFT sample (mirroring the existing ``itl_count`` pattern).

Quality:
- prefill perf model: replace ``assert`` with a comment stating the
  ``remainder ∈ (0, max_num_batched_tokens]`` invariant (asserts are
  stripped under ``python -O``).
- replay main: narrow the broad ``except Exception`` around AIC session
  creation and benchmark generation to the concrete exception types
  those paths can raise, preserving graceful degradation without
  swallowing unrelated bugs.
- replay main: decouple ``benchmark_granularity`` from the decode
  batch-size ceiling; the sweep now steps up to ``max_num_seqs`` with
  ``granularity`` points per axis, giving the regression variance on
  ``num_decode_requests`` at realistic concurrency levels.
- load_scaling (agg): remove unreachable ``else`` branch after the
  ``desired == num_workers`` early return.
- diagnostics recorder: add ``partner_not_ready`` to the throughput
  decision colour map (pink) so the new reason renders correctly.

Documentation:
- Rust bridge ``drain_traffic`` docstring: enumerate all six keys with
  units so the Python-facing API doc matches the returned payload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Run cargo fmt to satisfy the rust-tests CI formatting check on the
three mocker replay files touched by the latency-tracking changes.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Mypy CI failed because LoadScalingMixin / ThroughputScalingMixin
declared only the aggregate ``_diag_*_reason`` scratch fields while
the state machine initialized four new per-component variants with
``Optional[str] = None``.  Declare the per-component fields in the
mixins too so the assignments in state_machine.__init__ type-check.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk
Copy link
Copy Markdown
Contributor Author

Looks good. One nit (can be for a follow up): Would it be possible to get the x-axes of all the charts to scale together? If I want to narrow down on a specific time period, I have to separately zoom into that area on each graph.

sure, added a TODO

Note the reviewer request to share the x-axis across all subplots
so zooming on one chart narrows all of them to the same time range.
Deferred to a follow-up since the current report is shipping.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk enabled auto-merge (squash) April 16, 2026 23:15
Mypy flagged passing ``Optional[str]`` (aic_system, aic_model_path)
to ``create_session`` which expects ``str``.  Tighten the pre-check
so the bootstrap branch only runs when all three required AIC fields
are non-None, mirroring the existing ``aic_backend is None`` guard.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Devin flagged two issues on #8280:

1. In disagg mode the benchmark generator used prefill_engine_args for
   both prefill and decode sweeps because ``ref_args`` resolved to
   ``prefill_engine_args`` when ``extra_engine_args`` was None.  Decode
   engines typically have much higher ``max_num_seqs`` and larger KV
   caches than prefill, so decoding regressions were trained on data
   that didn't cover the actual decode engine's operating range.

   Split ``_generate_aic_benchmark_fpms`` into separate prefill and
   decode helpers and pick the appropriate engine args per call site:
   agg uses ``extra_engine_args`` for both, disagg uses
   ``prefill_engine_args`` / ``decode_engine_args`` respectively.

2. ``ForwardPassMetrics`` / ``ScheduledRequestMetrics`` were imported
   inside the function body, violating the project's Python guideline
   that first-party imports belong at the top of the file.  Move them
   to module-level imports.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk merged commit 0c7068d into main Apr 17, 2026
90 of 92 checks passed
@tedzhouhk tedzhouhk deleted the hzhou/replay-planner-diagnostics branch April 17, 2026 19:39
indrajit96 pushed a commit that referenced this pull request Apr 20, 2026
… scaling logics fixes (#8280)

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants