feat: OAI compatible endpoints for TRTLLM#14
Merged
Conversation
rmccorm4
reviewed
Mar 4, 2025
rmccorm4
reviewed
Mar 4, 2025
GuanLuo
approved these changes
Mar 5, 2025
kylehh
pushed a commit
to kylehh/dynamo
that referenced
this pull request
Apr 11, 2025
Co-authored-by: Tanmay Verma <tanmayv@nvidia.com> Co-authored-by: Tanmay Verma <tanmayv@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Tanmay Verma <tanmay2592@gmail.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 12, 2026
Adds power-aware intelligence to the Dynamo planner across three layers, plus tooling that pipecleans the Phase 4 selection path with currently available single-TDP AIC data. Reimplementation of unmerged PR #5280 against post-refactor ToT (planner/utils/ no longer exists; configuration is Pydantic; layout is config/, connectors/, core/, monitoring/). Full design: docs/design-docs/powerplanner-design.md Repro / dev environment: docs/components/planner/dpp-dev-env.md What's in this PR ----------------- Phase 1 — infrastructure - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/} Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop. Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore. - Pod annotation actuation: connectors/kubernetes.py writes dynamo.nvidia.com/gpu-power-limit on worker pods. - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml. - Operator chart fix: ClusterRole now grants `pods` permissions (deploy/helm/charts/platform/components/operator/templates/planner.yaml, chart bumped to 1.2.1). Older clusters can apply deploy/planner-pod-rbac-dev.yaml as a temporary patch. Phase 2 — budget enforcement - _apply_power_budget() in core/state_machine.py clamps replica scaling within a watt budget. New PlannerConfig fields: enable_power_awareness, total_gpu_power_limit (required when awareness enabled — validator enforces, no silent default), prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts. Phase 3 — AIC optimizer - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks (replica, power) configs that maximise throughput within the watt budget. Per-component EMA correction (c_ttft, c_itl, c_power_p, c_power_d for disagg; c_power_agg for agg) closes the loop on silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded re-optimize, auto-disable on consecutive failures. - core/base.py: NativePlannerBase plumbing (admission threshold push, named-port resolution, AIC integration). Phase 4 — pipeclean (using existing single-TDP data) - tools/integrate_aic_power_data.py: copies measured H200/B200 power data into an AIC checkout and reinstalls. - tools/validate_aic_power_integration.py: asserts estimate_perf() returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback). - tools/compute_power_budget.py: rack-capacity → safe-budget helper. - examples/deployments/powerplanner/: PIPECLEAN.md runbook, disagg-{power-aware,conservative-cold-start}.yaml, verify_poweraware.bash (8-section health check inc. Phase 4 preview), MULTI_DGD.md operator playbook, README.md. Bug fixes (found by live integration tests, fixed in this patchset) ------------------------------------------------------------------- - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner couldn't list workers or patch annotations. Fixed in chart template; workaround manifest provided for clusters not yet upgraded. - Stale label selectors in connectors/kubernetes.py: `dynamo-graph-deployment` → `dynamo-graph-deployment-name`, `dynamo-service` → `dynamo-component`/`dynamo-component-type`. - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the wrong namespace; corrected to `exported_pod=~` + `exported_namespace=<k8s-namespace>` and pod regex now matches the operator's `<dgd>-<replica-idx>-<service-key>-...` format (monitoring/traffic_metrics.py). - Frontend-source metric methods returned NaN on quiet clusters (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN guard, frontend didn't — added matching `math.isnan` filter. Tests ----- Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS dev cluster (qwen3-quickstart DGD, single dev namespace): Suite Pass Skip Fail ---------------------------------------------------- ---- ---- ---- planner/tests/unit/ 465 0 0 planner/tests/integration/test_aic_power_e2e_sim 15 0 0 planner/tests/integration/test_aic_power_optimizer 34 0 0 planner/tests/integration/test_metric_paths_live 22-23 3-2 0 planner/tests/integration/test_actuation_knobs_live 10-11 1-0 0 power_agent/tests/ 43 0 0 ---------------------------------------------------- ---- ---- ---- Total (cold deploy / after sanity chat completion) 590-591 4-3 0 Documented skips: - test_frontend_metric_series_exists — passes after any traffic. - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose dynamo_frontend_worker_* gauges (open-question #14 in design doc). - TestScalingRealMutation — opt-in disruptive test, gated by RUN_DISRUPTIVE_TESTS=1 (passes when enabled). Repro recipe: docs/components/planner/dpp-dev-env.md "From-Scratch Repro Script" section. Compatibility ------------- Power awareness is opt-in (enable_power_awareness=False by default). AIC optimizer is opt-in (enable_aic_optimizer=False by default). No existing planner behavior changes when both flags are off. Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation) remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4 pipeclean code path so it can be exercised before AIC ships the full sweep API. Signed-off-by: Kai Ma <kaim@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
kaim-eng
added a commit
that referenced
this pull request
May 12, 2026
Adds power-aware intelligence to the Dynamo planner across three layers, plus tooling that pipecleans the Phase 4 selection path with currently available single-TDP AIC data. Reimplementation of unmerged PR #5280 against post-refactor ToT (planner/utils/ no longer exists; configuration is Pydantic; layout is config/, connectors/, core/, monitoring/). Full design: docs/design-docs/powerplanner-design.md Repro / dev environment: docs/components/planner/dpp-dev-env.md What's in this PR ----------------- Phase 1 — infrastructure - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/} Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop. Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore. - Pod annotation actuation: connectors/kubernetes.py writes dynamo.nvidia.com/gpu-power-limit on worker pods. - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml. - Operator chart fix: ClusterRole now grants `pods` permissions (deploy/helm/charts/platform/components/operator/templates/planner.yaml, chart bumped to 1.2.1). Older clusters can apply deploy/planner-pod-rbac-dev.yaml as a temporary patch. Phase 2 — budget enforcement - _apply_power_budget() in core/state_machine.py clamps replica scaling within a watt budget. New PlannerConfig fields: enable_power_awareness, total_gpu_power_limit (required when awareness enabled — validator enforces, no silent default), prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts. Phase 3 — AIC optimizer - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks (replica, power) configs that maximise throughput within the watt budget. Per-component EMA correction (c_ttft, c_itl, c_power_p, c_power_d for disagg; c_power_agg for agg) closes the loop on silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded re-optimize, auto-disable on consecutive failures. - core/base.py: NativePlannerBase plumbing (admission threshold push, named-port resolution, AIC integration). Phase 4 — pipeclean (using existing single-TDP data) - tools/integrate_aic_power_data.py: copies measured H200/B200 power data into an AIC checkout and reinstalls. - tools/validate_aic_power_integration.py: asserts estimate_perf() returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback). - tools/compute_power_budget.py: rack-capacity → safe-budget helper. - examples/deployments/powerplanner/: PIPECLEAN.md runbook, disagg-{power-aware,conservative-cold-start}.yaml, verify_poweraware.bash (8-section health check inc. Phase 4 preview), MULTI_DGD.md operator playbook, README.md. Bug fixes (found by live integration tests, fixed in this patchset) ------------------------------------------------------------------- - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner couldn't list workers or patch annotations. Fixed in chart template; workaround manifest provided for clusters not yet upgraded. - Stale label selectors in connectors/kubernetes.py: `dynamo-graph-deployment` → `dynamo-graph-deployment-name`, `dynamo-service` → `dynamo-component`/`dynamo-component-type`. - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the wrong namespace; corrected to `exported_pod=~` + `exported_namespace=<k8s-namespace>` and pod regex now matches the operator's `<dgd>-<replica-idx>-<service-key>-...` format (monitoring/traffic_metrics.py). - Frontend-source metric methods returned NaN on quiet clusters (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN guard, frontend didn't — added matching `math.isnan` filter. Tests ----- Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS dev cluster (qwen3-quickstart DGD, single dev namespace): Suite Pass Skip Fail ---------------------------------------------------- ---- ---- ---- planner/tests/unit/ 465 0 0 planner/tests/integration/test_aic_power_e2e_sim 15 0 0 planner/tests/integration/test_aic_power_optimizer 34 0 0 planner/tests/integration/test_metric_paths_live 22-23 3-2 0 planner/tests/integration/test_actuation_knobs_live 10-11 1-0 0 power_agent/tests/ 43 0 0 ---------------------------------------------------- ---- ---- ---- Total (cold deploy / after sanity chat completion) 590-591 4-3 0 Documented skips: - test_frontend_metric_series_exists — passes after any traffic. - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose dynamo_frontend_worker_* gauges (open-question #14 in design doc). - TestScalingRealMutation — opt-in disruptive test, gated by RUN_DISRUPTIVE_TESTS=1 (passes when enabled). Repro recipe: docs/components/planner/dpp-dev-env.md "From-Scratch Repro Script" section. Compatibility ------------- Power awareness is opt-in (enable_power_awareness=False by default). AIC optimizer is opt-in (enable_aic_optimizer=False by default). No existing planner behavior changes when both flags are off. Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation) remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4 pipeclean code path so it can be exercised before AIC ships the full sweep API. Signed-off-by: Kai Ma <kaim@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 18, 2026
Adds power-aware intelligence to the Dynamo planner across three layers, plus tooling that pipecleans the Phase 4 selection path with currently available single-TDP AIC data. Reimplementation of unmerged PR #5280 against post-refactor ToT (planner/utils/ no longer exists; configuration is Pydantic; layout is config/, connectors/, core/, monitoring/). Full design: docs/design-docs/powerplanner-design.md Repro / dev environment: docs/components/planner/dpp-dev-env.md What's in this PR ----------------- Phase 1 — infrastructure - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/} Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop. Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore. - Pod annotation actuation: connectors/kubernetes.py writes dynamo.nvidia.com/gpu-power-limit on worker pods. - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml. - Operator chart fix: ClusterRole now grants `pods` permissions (deploy/helm/charts/platform/components/operator/templates/planner.yaml, chart bumped to 1.2.1). Older clusters can apply deploy/planner-pod-rbac-dev.yaml as a temporary patch. Phase 2 — budget enforcement - _apply_power_budget() in core/state_machine.py clamps replica scaling within a watt budget. New PlannerConfig fields: enable_power_awareness, total_gpu_power_limit (required when awareness enabled — validator enforces, no silent default), prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts. Phase 3 — AIC optimizer - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks (replica, power) configs that maximise throughput within the watt budget. Per-component EMA correction (c_ttft, c_itl, c_power_p, c_power_d for disagg; c_power_agg for agg) closes the loop on silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded re-optimize, auto-disable on consecutive failures. - core/base.py: NativePlannerBase plumbing (admission threshold push, named-port resolution, AIC integration). Phase 4 — pipeclean (using existing single-TDP data) - tools/integrate_aic_power_data.py: copies measured H200/B200 power data into an AIC checkout and reinstalls. - tools/validate_aic_power_integration.py: asserts estimate_perf() returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback). - tools/compute_power_budget.py: rack-capacity → safe-budget helper. - examples/deployments/powerplanner/: PIPECLEAN.md runbook, disagg-{power-aware,conservative-cold-start}.yaml, verify_poweraware.bash (8-section health check inc. Phase 4 preview), MULTI_DGD.md operator playbook, README.md. Bug fixes (found by live integration tests, fixed in this patchset) ------------------------------------------------------------------- - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner couldn't list workers or patch annotations. Fixed in chart template; workaround manifest provided for clusters not yet upgraded. - Stale label selectors in connectors/kubernetes.py: `dynamo-graph-deployment` → `dynamo-graph-deployment-name`, `dynamo-service` → `dynamo-component`/`dynamo-component-type`. - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the wrong namespace; corrected to `exported_pod=~` + `exported_namespace=<k8s-namespace>` and pod regex now matches the operator's `<dgd>-<replica-idx>-<service-key>-...` format (monitoring/traffic_metrics.py). - Frontend-source metric methods returned NaN on quiet clusters (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN guard, frontend didn't — added matching `math.isnan` filter. Tests ----- Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS dev cluster (qwen3-quickstart DGD, single dev namespace): Suite Pass Skip Fail ---------------------------------------------------- ---- ---- ---- planner/tests/unit/ 465 0 0 planner/tests/integration/test_aic_power_e2e_sim 15 0 0 planner/tests/integration/test_aic_power_optimizer 34 0 0 planner/tests/integration/test_metric_paths_live 22-23 3-2 0 planner/tests/integration/test_actuation_knobs_live 10-11 1-0 0 power_agent/tests/ 43 0 0 ---------------------------------------------------- ---- ---- ---- Total (cold deploy / after sanity chat completion) 590-591 4-3 0 Documented skips: - test_frontend_metric_series_exists — passes after any traffic. - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose dynamo_frontend_worker_* gauges (open-question #14 in design doc). - TestScalingRealMutation — opt-in disruptive test, gated by RUN_DISRUPTIVE_TESTS=1 (passes when enabled). Repro recipe: docs/components/planner/dpp-dev-env.md "From-Scratch Repro Script" section. Compatibility ------------- Power awareness is opt-in (enable_power_awareness=False by default). AIC optimizer is opt-in (enable_aic_optimizer=False by default). No existing planner behavior changes when both flags are off. Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation) remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4 pipeclean code path so it can be exercised before AIC ships the full sweep API. Signed-off-by: Kai Ma <kaim@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does the PR do?