feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus#9683
Open
kaim-eng wants to merge 1 commit into
Open
feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus#9683kaim-eng wants to merge 1 commit into
kaim-eng wants to merge 1 commit into
Conversation
This was referenced May 18, 2026
3cde449 to
1338028
Compare
2222dbf to
1d7b243
Compare
dynamo-ops
reviewed
May 18, 2026
| self._publish_inventory_and_gpu_hours(tick_input) | ||
| effects = self.state_machine.on_tick(next_tick, tick_input) | ||
| await self._apply_effects(effects) | ||
| await self._apply_power_annotations() # Phase 1: annotate pods with GPU power limit |
There was a problem hiding this comment.
This calls the pod-annotation side effect on every tick even when config.advisory is true, so advisory mode can still mutate Kubernetes pods. Fix: gate this call or _apply_power_annotations with self.config.advisory before any pod listing or patching.
1338028 to
ec21081
Compare
1d7b243 to
8126b1a
Compare
…rometheus Adds the planner-side mechanism for power-aware operation: - 5 PlannerConfig fields (enable_power_awareness defaults False) with required-when-enabled validator for total_gpu_power_limit and power_agent_safe_default_watts - _apply_power_annotations() reconciliation loop on every tick, gated on the feature flag, with POWER_ANNOTATION_KEY constant - _publish_power_budget_metrics() dashboard gauges (static config, not DCGM, so it remains valid when attribution drops) - CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI - get_component_pods + list_frontend_pods + resolve_frontend_http_port on KubernetesConnector - get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the frontend-source path - 3 planner_metrics gauges: power_budget_total_watts, power_projected_watts, power_budget_utilization - Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml for non-operator development clusters - repo-shared components/src/conftest.py (planner pkg conftest) - 623-line unit suite (test_actuation_knobs.py) + 567-line live integration suite (test_actuation_knobs_live.py) covering every actuation knob this PR adds - Unit test additions on test_kubernetes_connector.py (~119 lines) and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries get_total_dgd_power subset, ~100 lines) - CI filter (.github/filters.yaml) extends the planner group with components/src/conftest.py Power awareness is off by default. Budget enforcement (_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3. Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent component) must merge first or land in parallel. See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
8126b1a to
f7f3226
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of the PR #9369 split plan.
This is PR 2 of 6 (PR 1b — Planner Power Infrastructure).
Predecessor: #9682 — Power Agent DaemonSet (touches disjoint files; either can land first per plan §4.5)
Successor: #9684 (Draft) — Power budget enforcement
Scope
Planner-side wires from config to pod annotation; nothing uses them yet for scaling. Phase 1 only — the AIC closed-loop optimizer and admission coupling are deferred to PR 3.
~3,300 lines · 20 files. The 6 shared files with PR 3 are hand-authored here at the Phase-1 line subset (plan §2.2.2 has the full include/exclude table).
enable_power_awarenessdefaultsFalse;total_gpu_power_limit,prefill_engine_gpu_power_limit,decode_engine_gpu_power_limit,power_agent_safe_default_watts)_apply_power_annotations()+_publish_power_budget_metrics()incore/base.py— emitdynamo.nvidia.com/gpu-power-limitper worker pod + the three Phase-1 power-budget gaugesKubernetesConnector/KubernetesAPIplumbing —patch_pod_annotation,list_pods_by_label,get_component_pods,list_frontend_pods,resolve_frontend_http_portget_total_dgd_power()Prometheus query + NaN-filter fix in_get_metric_increaseChart.yaml/README.md/components/operator/Chart.yaml/components/operator/templates/planner.yaml(pod RBAC rules)deploy/planner-pod-rbac-dev.yaml— dev RBAC patch for operator charts ≤ 1.2.0Reviewer onboarding
git show pr5/docs-devenv:docs/design-docs/powerplanner-design.md):docs/components/planner/dpp-dev-env.md§8 (one-shot dev-pod sweep) — also in PR 5Tests at this tip (dev pod against running
llama-quickstartDGD, 2026-05-18)test_virtual_connector.py, per plan §4.4)RUN_DISRUPTIVE_TESTS=1test_actuation_knobs.py+test_kubernetes_connector.py+test_prometheus.py)test_actuation_knobs_live.pynon-disruptivetest_actuation_knobs_live.pywith disruptive opt-inorigin/main..pr1b/planner-infraMissing sets: 0in CI-faithful pod runThe 2 non-disruptive skips are both expected and documented in plan §2.2.3 / §2.6.2:
TestPostBusyThresholdLive::test_post_busy_threshold_returns_2xx→ frontend image in the dev cluster lacks/busy_threshold(admission control is a Phase-3 endpoint, arrives in PR 3).TestScalingRealMutation::test_set_component_replicas_mutates_dgd_spec→ opt-in viaRUN_DISRUPTIVE_TESTS=1; passes when enabled.Merge strategy
Rebase-and-merge (no squash). See plan §4.3.
Stacked-PR cascade
Per plan §4.5: this PR is Ready for Review alongside PR 1a (#9682). PRs 2–5 are held in Draft until both foundations approve. The two foundation PRs touch disjoint files; either can land first. PR 1b's diff against
mainis identical to its diff againstpr1a/power-agent.