Skip to content

feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus#9683

Open
kaim-eng wants to merge 1 commit into
pr1a/power-agentfrom
pr1b/planner-infra
Open

feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus#9683
kaim-eng wants to merge 1 commit into
pr1a/power-agentfrom
pr1b/planner-infra

Conversation

@kaim-eng
Copy link
Copy Markdown

@kaim-eng kaim-eng commented May 18, 2026

Part of the PR #9369 split plan.
This is PR 2 of 6 (PR 1b — Planner Power Infrastructure).

Predecessor: #9682 — Power Agent DaemonSet (touches disjoint files; either can land first per plan §4.5)
Successor: #9684 (Draft) — Power budget enforcement

Scope

Planner-side wires from config to pod annotation; nothing uses them yet for scaling. Phase 1 only — the AIC closed-loop optimizer and admission coupling are deferred to PR 3.

~3,300 lines · 20 files. The 6 shared files with PR 3 are hand-authored here at the Phase-1 line subset (plan §2.2.2 has the full include/exclude table).

  • 5 power-aware config fields (enable_power_awareness defaults False; total_gpu_power_limit, prefill_engine_gpu_power_limit, decode_engine_gpu_power_limit, power_agent_safe_default_watts)
  • _apply_power_annotations() + _publish_power_budget_metrics() in core/base.py — emit dynamo.nvidia.com/gpu-power-limit per worker pod + the three Phase-1 power-budget gauges
  • KubernetesConnector / KubernetesAPI plumbing — patch_pod_annotation, list_pods_by_label, get_component_pods, list_frontend_pods, resolve_frontend_http_port
  • DCGM get_total_dgd_power() Prometheus query + NaN-filter fix in _get_metric_increase
  • Helm chart 1.2.0 → 1.2.1 bump across Chart.yaml / README.md / components/operator/Chart.yaml / components/operator/templates/planner.yaml (pod RBAC rules)
  • deploy/planner-pod-rbac-dev.yaml — dev RBAC patch for operator charts ≤ 1.2.0
  • 121 new unit-test cases + 11 live-integration cases (incl. real-mutation disruptive opt-in)

Reviewer onboarding

  • Design context (lands in PR 5, readable from this branch via git show pr5/docs-devenv:docs/design-docs/powerplanner-design.md):
    • §5.5 — DCGM power queries
    • §6.4 — pod annotation orchestrator
    • §7 — planner ↔ power-agent annotation contract
  • Dev environment: docs/components/planner/dpp-dev-env.md §8 (one-shot dev-pod sweep) — also in PR 5
  • Plan sections: §2.2 (this PR), §2.2.2 (Phase 1 hunks split table for the 6 shared files), §3.2 (review-effort heuristic), §5 (risk areas)

Tests at this tip (dev pod against running llama-quickstart DGD, 2026-05-18)

Suite Result
Full planner sweep (excl. test_virtual_connector.py, per plan §4.4) 483 passed, 2 skipped in 12.46 s
Same sweep with RUN_DISRUPTIVE_TESTS=1 484 passed, 1 skipped — disruptive real-mutation passes
Phase-1 critical files verbose (test_actuation_knobs.py + test_kubernetes_connector.py + test_prometheus.py) 121 / 121 PASSED
test_actuation_knobs_live.py non-disruptive 9 passed, 2 skipped
test_actuation_knobs_live.py with disruptive opt-in 10 passed, 1 skipped
Power Agent regression at PR 1b tip 43 / 43 PASSED — no PR 1a regressions
Pre-commit on origin/main..pr1b/planner-infra 14 hooks pass, 4 skipped; marker compliance Missing sets: 0 in CI-faithful pod run

The 2 non-disruptive skips are both expected and documented in plan §2.2.3 / §2.6.2:

  • TestPostBusyThresholdLive::test_post_busy_threshold_returns_2xx → frontend image in the dev cluster lacks /busy_threshold (admission control is a Phase-3 endpoint, arrives in PR 3).
  • TestScalingRealMutation::test_set_component_replicas_mutates_dgd_spec → opt-in via RUN_DISRUPTIVE_TESTS=1; passes when enabled.

Merge strategy

Rebase-and-merge (no squash). See plan §4.3.

Stacked-PR cascade

Per plan §4.5: this PR is Ready for Review alongside PR 1a (#9682). PRs 2–5 are held in Draft until both foundations approve. The two foundation PRs touch disjoint files; either can land first. PR 1b's diff against main is identical to its diff against pr1a/power-agent.

@kaim-eng kaim-eng requested review from a team as code owners May 18, 2026 15:54
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added feat documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes planner actions labels May 18, 2026
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 3cde449 to 1338028 Compare May 18, 2026 16:03
@kaim-eng kaim-eng force-pushed the pr1b/planner-infra branch from 2222dbf to 1d7b243 Compare May 18, 2026 16:03
self._publish_inventory_and_gpu_hours(tick_input)
effects = self.state_machine.on_tick(next_tick, tick_input)
await self._apply_effects(effects)
await self._apply_power_annotations() # Phase 1: annotate pods with GPU power limit
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calls the pod-annotation side effect on every tick even when config.advisory is true, so advisory mode can still mutate Kubernetes pods. Fix: gate this call or _apply_power_annotations with self.config.advisory before any pod listing or patching.

@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 1338028 to ec21081 Compare May 19, 2026 12:54
@kaim-eng kaim-eng force-pushed the pr1b/planner-infra branch from 1d7b243 to 8126b1a Compare May 19, 2026 12:57
…rometheus

Adds the planner-side mechanism for power-aware operation:
- 5 PlannerConfig fields (enable_power_awareness defaults False) with
  required-when-enabled validator for total_gpu_power_limit and
  power_agent_safe_default_watts
- _apply_power_annotations() reconciliation loop on every tick,
  gated on the feature flag, with POWER_ANNOTATION_KEY constant
- _publish_power_budget_metrics() dashboard gauges (static config,
  not DCGM, so it remains valid when attribution drops)
- CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI
- get_component_pods + list_frontend_pods + resolve_frontend_http_port
  on KubernetesConnector
- get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the
  frontend-source path
- 3 planner_metrics gauges: power_budget_total_watts,
  power_projected_watts, power_budget_utilization
- Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch
  (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml
  for non-operator development clusters
- repo-shared components/src/conftest.py (planner pkg conftest)
- 623-line unit suite (test_actuation_knobs.py) + 567-line live
  integration suite (test_actuation_knobs_live.py) covering every
  actuation knob this PR adds
- Unit test additions on test_kubernetes_connector.py (~119 lines)
  and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries
  get_total_dgd_power subset, ~100 lines)
- CI filter (.github/filters.yaml) extends the planner group with
  components/src/conftest.py

Power awareness is off by default. Budget enforcement
(_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3.

Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent
component) must merge first or land in parallel. See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1b/planner-infra branch from 8126b1a to f7f3226 Compare May 19, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat planner size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants