feat(power-agent): per-node power-cap enforcement DaemonSet by kaim-eng · Pull Request #9682 · ai-dynamo/dynamo

kaim-eng · 2026-05-18T15:53:54Z

Part of the PR #9369 split plan.
This is PR 1 of 6 (PR 1a — Power Agent).

Predecessor: none (base = main)
Successor: #9683 (depends on this PR via stacked branch pr1b/planner-infra)

Scope

Standalone Power Agent DaemonSet — per-node GPU power-cap enforcement.
~1,500 lines · 10 files · all net-new, zero modifications to existing files.
Zero import dynamo.planner.*. Reviewable in isolation from the planner.

components/power_agent/power_agent.py (529 LOC) — NVML clamp helper, cgroup parser, multi-pod policy, SIGTERM handler, UUID-gated orphan-cap restoration
components/power_agent/tests/ — 43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy, shutdown
deploy/power_agent/{daemonset,dev-pod,rbac}.yaml — Kubernetes manifests
.github/filters.yaml — planner CI filter extended to power_agent paths

Reviewer onboarding

Design context: docs/design-docs/powerplanner-design.md §7 — per-node power-cap enforcement (lands in PR 5; readable from this branch tip too via git show pr5/docs-devenv:docs/design-docs/powerplanner-design.md)
Annotation contract: dynamo.nvidia.com/gpu-power-limit is the only coupling to the planner
Plan sections: §2.1 (this PR), §2.1.2 (standalone-component property)

Tests at this tip

components/power_agent/tests/ — 43 passed in 0.73 s on dev pod (2026-05-18)
Pre-commit on origin/main..pr1a/power-agent: 14 hooks pass, 4 skipped (no matching files). pytest-marker-report reconfirmed in pod with Missing sets: 0. The Windows-host run shows a known pre-existing fcntl import error in tests/utils/port_utils.py:10 (POSIX-only stdlib in an untouched file) — plan §7 acknowledges this; CI runs on Linux are unaffected.

Merge strategy

Rebase-and-merge (no squash) — preserves 1 commit in main history per plan §4.3. Please configure this on the PR before merging.

Stacked-PR cascade

Per plan §4.5: this PR is Ready for Review alongside PR 1b (#9683). PRs 2–5 are held in Draft until both foundations approve. The two foundation PRs touch disjoint files; either can land first.

copy-pr-bot · 2026-05-18T15:53:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-18T15:59:24Z

Walkthrough

This PR introduces a complete Kubernetes DaemonSet agent that enforces per-GPU NVML power caps on worker nodes. The daemon reconciles every 15 seconds, mapping running GPU processes to pod UIDs via cgroup parsing, reading power-limit annotations, resolving multi-pod conflicts, and applying caps via NVML with Prometheus metrics and graceful shutdown support.

Changes

Power Agent: GPU NVML Power Cap Enforcement DaemonSet

Layer / File(s)	Summary
Module foundation and utilities `components/power_agent/power_agent.py` (lines 1–213)	Module header, optional lazy imports for NVML/Kubernetes/Prometheus with fallbacks, logging, constants, cgroup parsing regex for extracting pod UIDs from `/proc/{pid}/cgroup`, atomic JSON persistence for managed GPU state, UUID normalization for pynvml version compatibility, and Prometheus metrics infrastructure with noop fallback when unavailable.
NVML cap application and multi-pod policy `components/power_agent/power_agent.py` (lines 220–404), `components/power_agent/tests/test_apply_cap.py`, `components/power_agent/tests/test_multi_pod_policy.py`	NVML helper functions to clamp watts to device min/max constraints and apply power caps with logging and metric updates; multi-pod-per-GPU resolution policy that applies an agreed cap when all pods match, otherwise applies safe default on conflicts or missing/invalid annotations, incrementing error counters; unit tests validating clamping boundaries, NVML error tolerance, UUID handling (`bytes` vs `str`), and all policy scenarios (single pod, agreement, conflict, invalid annotations, mixed None/valid).
Shutdown and startup recovery `components/power_agent/power_agent.py` (lines 280–337), `components/power_agent/tests/test_shutdown.py`	SIGTERM/SIGINT handler restores managed GPU power limits to NVML defaults and signals main loop exit; startup orphan recovery loads persisted GPU UUIDs, checks idle status, restores default caps for idle GPUs, and updates persisted state; unit tests validating per-GPU restoration, graceful shutdown with no managed GPUs, and NVML error handling.
PowerAgent orchestration `components/power_agent/power_agent.py` (lines 411–577)	PowerAgent class initializes NVML/Kubernetes clients, loads orphan state, lists node pods, maps running PIDs to pod UIDs, builds UID-to-annotation mappings, resolves GPU cap via policy, applies caps via NVML, and persists managed state; blocking reconcile loop runs every 15 seconds; CLI main() entry point accepts `--safe-default-watts`, optional node/namespace scope, and Prometheus port, then starts daemon.
Cgroup parser validation `components/power_agent/tests/test_cgroup_parser.py`	Unit tests for cgroup UID extraction across cgroup v1/v2 with systemd and cgroupfs layouts, QoS classes (Guaranteed/Burstable/BestEffort), and non-Kubernetes inputs; validates OSError handling and first-match-wins behavior when multiple cgroup lines exist.
Kubernetes deployment `deploy/power_agent/rbac.yaml`, `deploy/power_agent/daemonset.yaml`, `deploy/power_agent/dev-pod.yaml`	RBAC ServiceAccount, ClusterRole with pod read-access, and ClusterRoleBinding; production DaemonSet runs on GPU-labeled nodes with hostPID, privileged container, nvidia runtime, environment variables for safe default (500 W) and node name, resource limits, and volume mounts for `/proc` (read-only) and persisted state at `/var/lib/dynamo-power-agent`; development Pod harness for manual testing and validation on live clusters.
Documentation and CI integration `components/power_agent/README.md`, `.github/filters.yaml`	README documents Power Agent's 15-second reconciliation, cgroup UID extraction, annotation reading, NVML power cap enforcement, troubleshooting steps (annotations, logs), Prometheus metrics (applied limit, conflict/safe-default/failure counters), and shutdown/startup semantics; CI filter update triggers planner jobs on power_agent component and deployment changes.

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: introduction of a per-node power-cap enforcement DaemonSet for the Power Agent.
Description check	✅ Passed	The PR description is comprehensive and covers all required template sections: overview (PR 1 of 6, split from `#9369`), detailed scope, reviewer onboarding, test results, and merge strategy. Some optional details like specific file callouts are included.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

components/power_agent/power_agent.py (1)
537-539: ⚡ Quick win

Move argparse import to module scope.

Line 538 imports inside main(), which violates the repo’s Python import placement rule.
Proposed fix
+import argparse
 import json
 import logging
@@
 def main() -> None:
-    import argparse
-
     parser = argparse.ArgumentParser(description="Dynamo Power Agent DaemonSet")
As per coding guidelines: "Keep all imports at the top of each file (flag any import inside functions/classes)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/power_agent.py` around lines 537 - 539, The argparse
import is inside the main() function; move the import to module scope by adding
"import argparse" at the top of the file and removing the in-function import in
main(), ensuring any type or usage remains correct (adjust other imports if
needed) so linting and the repo rule about top-level imports are satisfied.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/power_agent/power_agent.py`:
- Around line 119-124: The function _load_previously_managed_gpus should
defensively handle malformed or unreadable state files: when reading
_MANAGED_STATE_PATH, catch OSError in addition to
FileNotFoundError/JSONDecodeError, load the JSON into a variable (e.g., data =
json.load(f)), verify isinstance(data, dict) before calling
data.get("managed_uuids", []), and if the root is not a dict (or managed_uuids
is not an iterable of strings) return an empty set; ensure any
TypeError/ValueError from unexpected types is handled and results in returning
set() rather than letting startup crash.
- Around line 294-297: Replace the broad except that swallows errors from the
pynvml.nvmlShutdown() call with a targeted handler: catch pynvml.NVMLError (or
the specific pynvml error class) as e, log the exception with the module/class
logger (e.g., logger.exception("NVML shutdown failed: %s", e) or
logging.exception(...)) and then re-raise the error to avoid silent failures;
modify the try/except around pynvml.nvmlShutdown() accordingly so only
NVML-specific errors are caught, logged, and propagated.
- Around line 494-501: The loop that builds pod_annotations appends one entry
per GPU process, causing pods with multiple processes to be counted multiple
times; change the logic in the procs loop (using _extract_pod_uid_from_cgroup
and uid_to_annotation) to deduplicate by pod UID before applying multi-pod
policy — e.g., track seen UIDs (or build a uid->annotation map) and only append
one (uid, annotation) pair per unique UID so each pod is counted once when
evaluating multi-pod warnings/metrics.

In `@deploy/power_agent/daemonset.yaml`:
- Around line 16-35: The DaemonSet metadata.namespace is hardcoded to "default",
which causes mismatches with the RBAC/service account namespace; update the
manifest to use the parameterized namespace variable (e.g.
${POWER_AGENT_NAMESPACE}) instead of "default" and ensure the ServiceAccount
referenced by spec.serviceAccountName ("dynamo-power-agent") is created/bound in
that same parameterized namespace so RBAC bindings resolve correctly; locate
metadata.name ("dynamo-power-agent"), metadata.namespace, and
spec.serviceAccountName in the template to make the change.
- Around line 55-58: Replace the mutable image tag
"nvcr.io/nvidia/dynamo/power-agent:latest" with an immutable reference (a
specific version tag or an image digest), e.g.
"nvcr.io/nvidia/dynamo/power-agent:vX.Y.Z" or
"nvcr.io/nvidia/dynamo/power-agent@sha256:<digest>", so deployments are
reproducible; keep or verify imagePullPolicy (e.g. IfNotPresent) is appropriate
for pinned images and update the inline comment to note the image is
intentionally pinned.

---

Nitpick comments:
In `@components/power_agent/power_agent.py`:
- Around line 537-539: The argparse import is inside the main() function; move
the import to module scope by adding "import argparse" at the top of the file
and removing the in-function import in main(), ensuring any type or usage
remains correct (adjust other imports if needed) so linting and the repo rule
about top-level imports are satisfied.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f9faddf8-30e0-4a7b-9055-b45244b2a979

📥 Commits

Reviewing files that changed from the base of the PR and between f20ff4e and 3cde449.

📒 Files selected for processing (11)

.github/filters.yaml
components/power_agent/README.md
components/power_agent/power_agent.py
components/power_agent/tests/__init__.py
components/power_agent/tests/test_apply_cap.py
components/power_agent/tests/test_cgroup_parser.py
components/power_agent/tests/test_multi_pod_policy.py
components/power_agent/tests/test_shutdown.py
deploy/power_agent/daemonset.yaml
deploy/power_agent/dev-pod.yaml
deploy/power_agent/rbac.yaml

coderabbitai · 2026-05-18T15:59:27Z

+def _load_previously_managed_gpus() -> set[str]:
+    try:
+        with open(_MANAGED_STATE_PATH) as f:
+            return set(json.load(f).get("managed_uuids", []))
+    except (FileNotFoundError, json.JSONDecodeError):
+        return set()


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden managed-state parsing to avoid startup crashes on malformed file content.

Line 122 assumes the JSON root is a dict; if the file contains a list/string/corrupt schema, .get(...) raises and can abort startup. Also unreadable-file OSError cases (e.g., permissions) are currently uncaught.

Proposed fix

def _load_previously_managed_gpus() -> set[str]: try: with open(_MANAGED_STATE_PATH) as f: - return set(json.load(f).get("managed_uuids", [])) - except (FileNotFoundError, json.JSONDecodeError): + payload = json.load(f) + except (OSError, json.JSONDecodeError) as e: + logger.warning("Failed to read managed GPU state file: %s", e) return set() + + if not isinstance(payload, dict): + logger.warning("Managed GPU state file has invalid root type: %s", type(payload)) + return set() + managed = payload.get("managed_uuids", []) + if not isinstance(managed, list): + logger.warning("managed_uuids must be a list, got: %s", type(managed)) + return set() + return {u for u in managed if isinstance(u, str)}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/power_agent.py` around lines 119 - 124, The function _load_previously_managed_gpus should defensively handle malformed or unreadable state files: when reading _MANAGED_STATE_PATH, catch OSError in addition to FileNotFoundError/JSONDecodeError, load the JSON into a variable (e.g., data = json.load(f)), verify isinstance(data, dict) before calling data.get("managed_uuids", []), and if the root is not a dict (or managed_uuids is not an iterable of strings) return an empty set; ensure any TypeError/ValueError from unexpected types is handled and results in returning set() rather than letting startup crash.

coderabbitai · 2026-05-18T15:59:27Z

+    try:
+        pynvml.nvmlShutdown()
+    except Exception:
+        pass


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t silently swallow NVML shutdown failures.

Line 296 catches all exceptions and drops them. This hides shutdown problems and complicates recovery/debugging.

Proposed fix

try: pynvml.nvmlShutdown() - except Exception: - pass + except Exception as e: + logger.warning("nvmlShutdown failed during SIGTERM handling: %s", e)

As per coding guidelines: "Fail fast: avoid broad exception swallowing; catch only specific exceptions you can handle, and if you catch Exception you must log + re-raise."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

pynvml.nvmlShutdown()

except Exception:

pass

try:

pynvml.nvmlShutdown()

except Exception as e:

logger.warning("nvmlShutdown failed during SIGTERM handling: %s", e)

raise

🧰 Tools

🪛 Ruff (0.15.12)

[error] 296-297: try-except-pass detected, consider logging the exception

(S110)

[warning] 296-296: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/power_agent.py` around lines 294 - 297, Replace the broad except that swallows errors from the pynvml.nvmlShutdown() call with a targeted handler: catch pynvml.NVMLError (or the specific pynvml error class) as e, log the exception with the module/class logger (e.g., logger.exception("NVML shutdown failed: %s", e) or logging.exception(...)) and then re-raise the error to avoid silent failures; modify the try/except around pynvml.nvmlShutdown() accordingly so only NVML-specific errors are caught, logged, and propagated.

coderabbitai · 2026-05-18T15:59:27Z

+        pod_annotations: list[tuple[str, Optional[str]]] = []
+        for proc in procs:
+            uid = _extract_pod_uid_from_cgroup(proc.pid)
+            if uid is None:
+                continue  # non-K8s process — skip
+            if uid in uid_to_annotation:
+                pod_annotations.append((uid, uid_to_annotation[uid]))
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Deduplicate by pod UID before applying multi-pod policy.

Line 500 appends one entry per GPU process. A single pod with multiple processes is then counted as “multi-pod”, skewing warnings/metrics and policy disposition.

Proposed fix

- pod_annotations: list[tuple[str, Optional[str]]] = [] + pod_annotations_by_uid: dict[str, Optional[str]] = {} for proc in procs: uid = _extract_pod_uid_from_cgroup(proc.pid) if uid is None: continue # non-K8s process — skip if uid in uid_to_annotation: - pod_annotations.append((uid, uid_to_annotation[uid])) + pod_annotations_by_uid[uid] = uid_to_annotation[uid] + + pod_annotations = list(pod_annotations_by_uid.items())

Based on learnings: "Review this logic to avoid assuming a strict 1:1 mapping ... or extend the implementation to correctly handle multiple pods per GPU ... and pods missing power annotations."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pod_annotations: list[tuple[str, Optional[str]]] = []

for proc in procs:

uid = _extract_pod_uid_from_cgroup(proc.pid)

if uid is None:

continue # non-K8s process — skip

if uid in uid_to_annotation:

pod_annotations.append((uid, uid_to_annotation[uid]))

pod_annotations_by_uid: dict[str, Optional[str]] = {}

for proc in procs:

uid = _extract_pod_uid_from_cgroup(proc.pid)

if uid is None:

continue # non-K8s process — skip

if uid in uid_to_annotation:

pod_annotations_by_uid[uid] = uid_to_annotation[uid]

pod_annotations = list(pod_annotations_by_uid.items())

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/power_agent.py` around lines 494 - 501, The loop that builds pod_annotations appends one entry per GPU process, causing pods with multiple processes to be counted multiple times; change the logic in the procs loop (using _extract_pod_uid_from_cgroup and uid_to_annotation) to deduplicate by pod UID before applying multi-pod policy — e.g., track seen UIDs (or build a uid->annotation map) and only append one (uid, annotation) pair per unique UID so each pod is counted once when evaluating multi-pod warnings/metrics.

coderabbitai · 2026-05-18T15:59:27Z

+metadata:
+  name: dynamo-power-agent
+  namespace: default
+  labels:
+    app: dynamo-power-agent
+spec:
+  selector:
+    matchLabels:
+      app: dynamo-power-agent
+  updateStrategy:
+    type: RollingUpdate
+    rollingUpdate:
+      maxUnavailable: 1
+  template:
+    metadata:
+      labels:
+        app: dynamo-power-agent
+    spec:
+      serviceAccountName: dynamo-power-agent
+      hostPID: true


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Parameterize the DaemonSet namespace to prevent RBAC/SA mismatches.

Line 18 hardcodes default, while RBAC is set up to support ${POWER_AGENT_NAMESPACE}. Deploying outside default can break ServiceAccount resolution/binding.

Suggested fix

metadata: name: dynamo-power-agent - namespace: default + namespace: ${POWER_AGENT_NAMESPACE} labels: app: dynamo-power-agent

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

metadata:

name: dynamo-power-agent

namespace: default

labels:

app: dynamo-power-agent

spec:

selector:

matchLabels:

app: dynamo-power-agent

updateStrategy:

type: RollingUpdate

rollingUpdate:

maxUnavailable: 1

template:

metadata:

labels:

app: dynamo-power-agent

spec:

serviceAccountName: dynamo-power-agent

hostPID: true

metadata:

name: dynamo-power-agent

namespace: ${POWER_AGENT_NAMESPACE}

labels:

app: dynamo-power-agent

spec:

selector:

matchLabels:

app: dynamo-power-agent

updateStrategy:

type: RollingUpdate

rollingUpdate:

maxUnavailable: 1

template:

metadata:

labels:

app: dynamo-power-agent

spec:

serviceAccountName: dynamo-power-agent

hostPID: true

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/power_agent/daemonset.yaml` around lines 16 - 35, The DaemonSet metadata.namespace is hardcoded to "default", which causes mismatches with the RBAC/service account namespace; update the manifest to use the parameterized namespace variable (e.g. ${POWER_AGENT_NAMESPACE}) instead of "default" and ensure the ServiceAccount referenced by spec.serviceAccountName ("dynamo-power-agent") is created/bound in that same parameterized namespace so RBAC bindings resolve correctly; locate metadata.name ("dynamo-power-agent"), metadata.namespace, and spec.serviceAccountName in the template to make the change.

coderabbitai · 2026-05-18T15:59:27Z

+        - name: power-agent
+          # Replace with your registry/image
+          image: nvcr.io/nvidia/dynamo/power-agent:latest
+          imagePullPolicy: IfNotPresent


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use an immutable image reference instead of latest.

Line 57 uses a mutable tag, which makes rollouts non-reproducible and can introduce unreviewed runtime changes.

Suggested fix

- image: nvcr.io/nvidia/dynamo/power-agent:latest + image: nvcr.io/nvidia/dynamo/power-agent:<release-tag>`@sha256`:<digest>

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/power_agent/daemonset.yaml` around lines 55 - 58, Replace the mutable image tag "nvcr.io/nvidia/dynamo/power-agent:latest" with an immutable reference (a specific version tag or an image digest), e.g. "nvcr.io/nvidia/dynamo/power-agent:vX.Y.Z" or "nvcr.io/nvidia/dynamo/power-agent@sha256:<digest>", so deployments are reproducible; keep or verify imagePullPolicy (e.g. IfNotPresent) is appropriate for pinned images and update the inline comment to note the image is intentionally pinned.

dynamo-ops · 2026-05-18T16:48:59Z

+subjects:
+  - kind: ServiceAccount
+    name: dynamo-power-agent
+    namespace: ${POWER_AGENT_NAMESPACE}


The manifest cannot be applied as-is because ${POWER_AGENT_NAMESPACE} is not a valid Kubernetes namespace value and kubectl apply -n ... does not substitute fields inside a ClusterRoleBinding. Fix: set a concrete namespace that matches the DaemonSet namespace or template this manifest through Helm/Kustomize/envsubst before shipping it.

grahamking · 2026-05-18T17:12:00Z

@@ -0,0 +1,577 @@
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0


In all your files, you only need these first two SPDX lines. You can delete the rest of the copyright header.

Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods for the dynamo.nvidia.com/gpu-power-limit annotation, parses /proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup driver x CRI runtime variants), and calls nvmlDeviceSetPowerManagementLimit per physical GPU. Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds, UUID-gated orphan-cap restoration on startup, SIGTERM-handler default- TGP restore, and a multi-pod-per-GPU policy that falls back to power_agent_safe_default_watts on conflict. 43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy, shutdown. Standalone component - no coupling to planner internals. CI filter (.github/filters.yaml) extends the planner group to cover the new component + deploy paths so changes here trip the planner job set. Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>

Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single Helm chart at deploy/helm/charts/power-agent/, resolving the three CodeRabbit findings on PR #9682: * hardcoded metadata.namespace=default -> {{ .Release.Namespace }} * mutable image :latest -> required image.tag with fail-fast validator * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating The chart supports three deployment shapes selectable via values: production DaemonSet (default, cluster-wide RBAC), namespace-restricted production (Role+RoleBinding), and an in-cluster dev-iteration Pod mounting power_agent.py from a ConfigMap. Three template-time validators reject foot-guns at install time: empty image.tag, mutex violations between daemonset.enabled and dev.enabled, and dev mode without a pinned dev.nodeName. Dev mode also automatically forces namespace-scoped RBAC (least privilege), leveraging power_agent.py's --namespace flag. Design rationale, scope decisions, and review-feedback responses are captured in docs/design-docs/power-agent-helm-chart-plan.md, committed alongside the chart. components/power_agent/README.md flips its install recipe to ``helm install``, and the planner CI filter (.github/filters.yaml) is retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**. Two examples/deployments/powerplanner/*.yaml header references live on PR #9687 and will be updated during that PR's cascade rebase per plan section 5.3. Validated locally: helm lint -> 0 errors helm template (3 positive exercises) -> expected resources helm template (3 negative exercises) -> expected fail-fast errors components/power_agent/tests/ -> 43/43 passed .github/scripts/test-filters.js -> 20/20 passed pre-commit (cross-cutting hooks) -> all applicable passed Signed-off-by: Kai Ma <kaim@nvidia.com>

@grahamking

…S.txt copyright-checks (regex requiring `#`-prefixed SPDX lines) rejected the Go-template-comment header in _helpers.tpl and the missing header in NOTES.txt on commit 8e490a7. Switch both to the same 2-line `#`-prefixed form already used by the other 9 chart files, aligning with @grahamking's PR review preference (minimal SPDX, no Apache boilerplate). For NOTES.txt this adds two visible lines to the post-`helm install` banner, matching the project convention established by deploy/helm/charts/platform/templates/NOTES.txt. For _helpers.tpl the header sits outside any `{{- define -}}` block, so it is never invoked at render time and is invisible to consumers despite not being inside a Go template comment. Re-validated locally: helm lint -> 0 errors helm template (default mode) -> 4 expected kinds pre-commit on the two touched files -> all applicable hooks pass Signed-off-by: Kai Ma <kaim@nvidia.com>

Updates the three reference sites in examples/deployments/powerplanner/ that previously instructed users to ``kubectl apply -f deploy/power_agent/...``, flipping them to the new Helm chart at deploy/helm/charts/power-agent/ that landed in PR #9682: * disagg-power-aware.yaml header recipe * README.md Prerequisites + verify section * MULTI_DGD.md file index Also updates the verify-pods label selector from the legacy ``app=dynamo-power-agent`` to the chart-emitted ``app.kubernetes.io/name=power-agent``. Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md, pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference the old paths in their historical / architectural narrative sections, which is intentional -- those describe the pre-chart state and the transition rationale, not current deployment instructions. Part of the PR #9369 cascade following PR #9682''s Helm chart landing. Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng requested review from a team as code owners May 18, 2026 15:53

pull-request-size Bot added the size/XXL label May 18, 2026

github-actions Bot added feat documentation Improvements or additions to documentation actions labels May 18, 2026

kaim-eng mentioned this pull request May 18, 2026

feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus #9683

Open

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

kaim-eng force-pushed the pr1a/power-agent branch from 3cde449 to 1338028 Compare May 18, 2026 16:03

dynamo-ops reviewed May 18, 2026

View reviewed changes

grahamking reviewed May 18, 2026

View reviewed changes

kaim-eng force-pushed the pr1a/power-agent branch from 1338028 to ec21081 Compare May 19, 2026 12:54

github-actions Bot added the deployment::k8s Relates to dynamo deployment in kubernetes label May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(power-agent): per-node power-cap enforcement DaemonSet#9682

feat(power-agent): per-node power-cap enforcement DaemonSet#9682
kaim-eng wants to merge 3 commits into
mainfrom
pr1a/power-agent

kaim-eng commented May 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

dynamo-ops May 18, 2026

Uh oh!

grahamking May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaim-eng commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

Reviewer onboarding

Tests at this tip

Merge strategy

Stacked-PR cascade

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 18, 2026

Choose a reason for hiding this comment

Uh oh!

grahamking May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaim-eng commented May 18, 2026 •

edited

Loading