Skip to content

feat(planner): AIC closed-loop optimizer#9685

Draft
kaim-eng wants to merge 1 commit into
pr2/power-budgetfrom
pr3/aic-optimizer
Draft

feat(planner): AIC closed-loop optimizer#9685
kaim-eng wants to merge 1 commit into
pr2/power-budgetfrom
pr3/aic-optimizer

Conversation

@kaim-eng
Copy link
Copy Markdown

@kaim-eng kaim-eng commented May 18, 2026

Part of the PR #9369 split plan.
This is PR 4 of 6 (PR 3 — AIC Closed-Loop Optimizer + Integration Tests). Held in Draft per plan §4.5.

Predecessor: #9684 — Power budget enforcement
Successor: #9686 (Draft) — Stress testbed

Scope

The algorithm-rich PR. Adds AICPowerOptimizer — the correction-coefficient EMA engine that observes live TTFT/ITL/power, detects drift, and re-invokes the AIC sweep. Wires it into the planner run loop. Includes frontend admission-control fanout, AIC-specific integration tests, and the live-cluster metric-paths tests. Completes the 6 shared files split with PR 1b.

~4,400 lines · 12 files.

  • monitoring/aic_power_optimizer.py (715 LOC) — c_ttft / c_itl / c_power_p/d/agg correction coefficients, EMA gates, drift detection, hysteresis
  • tests/integration/test_aic_power_optimizer.py (943 LOC) — executable spec for the EMA mechanics
  • tests/integration/test_aic_power_e2e_sim.py (710 LOC) — end-to-end closed-loop sim
  • tests/integration/test_metric_paths_live.py (558 LOC) — live-cluster Prometheus path checks
  • core/base.py Phase-3 hunks — _aic_optimizer, _apply_aic_config, _fanout_busy_threshold, _min_prefill_max_num_batched_tokens, AIC tick-loop block
  • config/{defaults,planner_config}.py — remaining 11 AIC fields + admission block
  • monitoring/{planner_metrics,traffic_metrics}.py — AIC gauges/counters + remaining DCGM queries (get_avg_per_gpu_power_by_component, get_avg_request_duration)
  • core/types.py — 5 new TrafficObservation fields (TTFT/ITL/throughput/scheduled tokens)
  • profiler/utils/aic_dataframe.py — defensive fix (14 lines)
  • tests/unit/test_prometheus.py — remaining ~550 lines of Phase-3 test classes

Reviewer onboarding

Recommended reading order before diving into the diff (plan §2.4 + §3.2):

  1. docs/design-docs/powerplanner-design.md §5.3 — correction coefficients + EMA gates
  2. §5.6 — drift detection, hysteresis, _estimated_throughput reference
  3. §5.1 — aic_to_planner_cap bridge between AIC and autoscaling
  4. §5.7 + §6.7 — frontend /busy_threshold admission coupling
  5. §8 (failure modes 1–14) — fail-open principle and the power_w clamp (row 14)

Then read monitoring/aic_power_optimizer.py against §5.3/§5.6/§5.1 and core/base.py Phase-3 additions against §6.4/§6.7. The integration tests are the executable specification.

Tests at this tip (projection per plan §4.4 — will re-measure on dev pod before Ready)

  • All PR 1a + 1b + 2 tests still pass
  • test_aic_power_optimizer.py — 34 passed
  • test_aic_power_e2e_sim.py — 15 passed
  • test_metric_paths_live.py (live cluster) — 22 passed, 3 skipped (DCGM-per-pod not emitted in some clusters; open question feat: OAI compatible endpoints for TRTLLM #14)
  • test_prometheus.py remaining additions — ~35 passed
  • Plan projection: ~610 passed, 4 skipped (planner) + 43 (power_agent) = ~653 total

Merge strategy

Rebase-and-merge (no squash).

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Adds the correction-coefficient EMA engine that observes live
TTFT/ITL/power, detects drift against the planner's expected operating
point, and re-invokes the AIC sweep to recompute (cap_p, cap_d, n_p,
n_d) plus implied frontend admission thresholds.

Components:
- AICPowerOptimizer (monitoring/aic_power_optimizer.py): correction
  coefficients (c_ttft, c_itl, c_power_p, c_power_d, c_power_agg) with
  EMA smoothing, drift detection with hysteresis, optimize() sweep
  driver, defensive power_w clamp at nameplate TDP (guards against
  AIC's non-physical interpolator extrapolations at sparse data-grid
  corners), and the aic_to_planner_cap bridge.
- core/base.py Phase-3 additions: _aic_optimizer field + TYPE_CHECKING
  import; AIC startup sweep in _async_init(); _apply_aic_config()
  (caps + replica targets + drift-ref pin + admission gauges + POST
  fanout); _fanout_busy_threshold() with B6 absolute-prefill defense;
  _min_prefill_max_num_batched_tokens() helper; AIC tick-loop block
  in run() with disagg vs agg branches and conditional re-sweep;
  traffic observation enrichment with AIC-driven fields.
- 5 new TrafficObservation fields (core/types.py): ttft_avg, itl_avg,
  total_tokens_per_s, scheduled_prefill_tokens, scheduled_decode_kv_tokens.
- 13 AIC config fields + 3 admission config fields on PlannerConfig
  with validator (enable_aic_optimizer requires enable_power_awareness
  AND aic_interpolation).
- get_avg_per_gpu_power_by_component DCGM query on PrometheusAPIClient
  and supporting documentation on DirectRouterMetricsClient (callable
  surface unchanged).
- 9 AIC gauges/counters + 5 admission metrics on PlannerPrometheusMetrics.

Test suites (~2,000 lines new tests):
- 967-line test_aic_power_optimizer.py: 34 unit-grade tests covering
  EMA mechanics, drift detection, sweep failure modes, and the
  power_w clamp.
- 712-line test_aic_power_e2e_sim.py: 15 closed-loop integration tests
  with the fake-AIC + fake-Prometheus harness.
- 558-line test_metric_paths_live.py: live-cluster validation of
  DCGM per-component queries.
- 505 additional lines on test_prometheus.py: TestMetricsIsValid,
  TestDirectRouterMetricsClient*, TestGetAvgRequestDurationRouterSource,
  TestGetAvgKvHitRateEdgeCases, plus the get_avg_per_gpu_power_by_component
  half of TestPowerAwareDcgmQueries.
- profiler/utils/aic_dataframe.py: 14-line defensive fix for AIC
  table lookups.

The PR 1b hand-authored intermediate of the 6 shared files plus this PR's
verbatim checkout reproduces kaim/power-planner byte-for-byte (validated
via git diff).

Part of the PR #9369 split (PR 3 of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr3/aic-optimizer branch from 2761ac2 to 3edb72e Compare May 19, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant