feat(planner): AIC closed-loop optimizer#9685
Draft
kaim-eng wants to merge 1 commit into
Draft
Conversation
This was referenced May 18, 2026
d32fbdb to
739bed9
Compare
7f28917 to
c0e744b
Compare
739bed9 to
7be3a27
Compare
c0e744b to
2761ac2
Compare
7be3a27 to
0c9050c
Compare
Adds the correction-coefficient EMA engine that observes live TTFT/ITL/power, detects drift against the planner's expected operating point, and re-invokes the AIC sweep to recompute (cap_p, cap_d, n_p, n_d) plus implied frontend admission thresholds. Components: - AICPowerOptimizer (monitoring/aic_power_optimizer.py): correction coefficients (c_ttft, c_itl, c_power_p, c_power_d, c_power_agg) with EMA smoothing, drift detection with hysteresis, optimize() sweep driver, defensive power_w clamp at nameplate TDP (guards against AIC's non-physical interpolator extrapolations at sparse data-grid corners), and the aic_to_planner_cap bridge. - core/base.py Phase-3 additions: _aic_optimizer field + TYPE_CHECKING import; AIC startup sweep in _async_init(); _apply_aic_config() (caps + replica targets + drift-ref pin + admission gauges + POST fanout); _fanout_busy_threshold() with B6 absolute-prefill defense; _min_prefill_max_num_batched_tokens() helper; AIC tick-loop block in run() with disagg vs agg branches and conditional re-sweep; traffic observation enrichment with AIC-driven fields. - 5 new TrafficObservation fields (core/types.py): ttft_avg, itl_avg, total_tokens_per_s, scheduled_prefill_tokens, scheduled_decode_kv_tokens. - 13 AIC config fields + 3 admission config fields on PlannerConfig with validator (enable_aic_optimizer requires enable_power_awareness AND aic_interpolation). - get_avg_per_gpu_power_by_component DCGM query on PrometheusAPIClient and supporting documentation on DirectRouterMetricsClient (callable surface unchanged). - 9 AIC gauges/counters + 5 admission metrics on PlannerPrometheusMetrics. Test suites (~2,000 lines new tests): - 967-line test_aic_power_optimizer.py: 34 unit-grade tests covering EMA mechanics, drift detection, sweep failure modes, and the power_w clamp. - 712-line test_aic_power_e2e_sim.py: 15 closed-loop integration tests with the fake-AIC + fake-Prometheus harness. - 558-line test_metric_paths_live.py: live-cluster validation of DCGM per-component queries. - 505 additional lines on test_prometheus.py: TestMetricsIsValid, TestDirectRouterMetricsClient*, TestGetAvgRequestDurationRouterSource, TestGetAvgKvHitRateEdgeCases, plus the get_avg_per_gpu_power_by_component half of TestPowerAwareDcgmQueries. - profiler/utils/aic_dataframe.py: 14-line defensive fix for AIC table lookups. The PR 1b hand-authored intermediate of the 6 shared files plus this PR's verbatim checkout reproduces kaim/power-planner byte-for-byte (validated via git diff). Part of the PR #9369 split (PR 3 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
2761ac2 to
3edb72e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of the PR #9369 split plan.
This is PR 4 of 6 (PR 3 — AIC Closed-Loop Optimizer + Integration Tests). Held in Draft per plan §4.5.
Predecessor: #9684 — Power budget enforcement
Successor: #9686 (Draft) — Stress testbed
Scope
The algorithm-rich PR. Adds
AICPowerOptimizer— the correction-coefficient EMA engine that observes live TTFT/ITL/power, detects drift, and re-invokes the AIC sweep. Wires it into the planner run loop. Includes frontend admission-control fanout, AIC-specific integration tests, and the live-cluster metric-paths tests. Completes the 6 shared files split with PR 1b.~4,400 lines · 12 files.
monitoring/aic_power_optimizer.py(715 LOC) —c_ttft/c_itl/c_power_p/d/aggcorrection coefficients, EMA gates, drift detection, hysteresistests/integration/test_aic_power_optimizer.py(943 LOC) — executable spec for the EMA mechanicstests/integration/test_aic_power_e2e_sim.py(710 LOC) — end-to-end closed-loop simtests/integration/test_metric_paths_live.py(558 LOC) — live-cluster Prometheus path checkscore/base.pyPhase-3 hunks —_aic_optimizer,_apply_aic_config,_fanout_busy_threshold,_min_prefill_max_num_batched_tokens, AIC tick-loop blockconfig/{defaults,planner_config}.py— remaining 11 AIC fields + admission blockmonitoring/{planner_metrics,traffic_metrics}.py— AIC gauges/counters + remaining DCGM queries (get_avg_per_gpu_power_by_component,get_avg_request_duration)core/types.py— 5 newTrafficObservationfields (TTFT/ITL/throughput/scheduled tokens)profiler/utils/aic_dataframe.py— defensive fix (14 lines)tests/unit/test_prometheus.py— remaining ~550 lines of Phase-3 test classesReviewer onboarding
Recommended reading order before diving into the diff (plan §2.4 + §3.2):
docs/design-docs/powerplanner-design.md§5.3 — correction coefficients + EMA gates_estimated_throughputreferenceaic_to_planner_capbridge between AIC and autoscaling/busy_thresholdadmission couplingpower_wclamp (row 14)Then read
monitoring/aic_power_optimizer.pyagainst §5.3/§5.6/§5.1 andcore/base.pyPhase-3 additions against §6.4/§6.7. The integration tests are the executable specification.Tests at this tip (projection per plan §4.4 — will re-measure on dev pod before Ready)
test_aic_power_optimizer.py— 34 passedtest_aic_power_e2e_sim.py— 15 passedtest_metric_paths_live.py(live cluster) — 22 passed, 3 skipped (DCGM-per-pod not emitted in some clusters; open question feat: OAI compatible endpoints for TRTLLM #14)test_prometheus.pyremaining additions — ~35 passedMerge strategy
Rebase-and-merge (no squash).