Skip to content

feat(power-agent): per-node power-cap enforcement DaemonSet#9682

Open
kaim-eng wants to merge 3 commits into
mainfrom
pr1a/power-agent
Open

feat(power-agent): per-node power-cap enforcement DaemonSet#9682
kaim-eng wants to merge 3 commits into
mainfrom
pr1a/power-agent

Conversation

@kaim-eng
Copy link
Copy Markdown

@kaim-eng kaim-eng commented May 18, 2026

Part of the PR #9369 split plan.
This is PR 1 of 6 (PR 1a — Power Agent).

Predecessor: none (base = main)
Successor: #9683 (depends on this PR via stacked branch pr1b/planner-infra)

Scope

Standalone Power Agent DaemonSet — per-node GPU power-cap enforcement.
~1,500 lines · 10 files · all net-new, zero modifications to existing files.
Zero import dynamo.planner.*. Reviewable in isolation from the planner.

  • components/power_agent/power_agent.py (529 LOC) — NVML clamp helper, cgroup parser, multi-pod policy, SIGTERM handler, UUID-gated orphan-cap restoration
  • components/power_agent/tests/ — 43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy, shutdown
  • deploy/power_agent/{daemonset,dev-pod,rbac}.yaml — Kubernetes manifests
  • .github/filters.yaml — planner CI filter extended to power_agent paths

Reviewer onboarding

  • Design context: docs/design-docs/powerplanner-design.md §7 — per-node power-cap enforcement (lands in PR 5; readable from this branch tip too via git show pr5/docs-devenv:docs/design-docs/powerplanner-design.md)
  • Annotation contract: dynamo.nvidia.com/gpu-power-limit is the only coupling to the planner
  • Plan sections: §2.1 (this PR), §2.1.2 (standalone-component property)

Tests at this tip

  • components/power_agent/tests/43 passed in 0.73 s on dev pod (2026-05-18)
  • Pre-commit on origin/main..pr1a/power-agent: 14 hooks pass, 4 skipped (no matching files). pytest-marker-report reconfirmed in pod with Missing sets: 0. The Windows-host run shows a known pre-existing fcntl import error in tests/utils/port_utils.py:10 (POSIX-only stdlib in an untouched file) — plan §7 acknowledges this; CI runs on Linux are unaffected.

Merge strategy

Rebase-and-merge (no squash) — preserves 1 commit in main history per plan §4.3. Please configure this on the PR before merging.

Stacked-PR cascade

Per plan §4.5: this PR is Ready for Review alongside PR 1b (#9683). PRs 2–5 are held in Draft until both foundations approve. The two foundation PRs touch disjoint files; either can land first.

@kaim-eng kaim-eng requested review from a team as code owners May 18, 2026 15:53
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Walkthrough

This PR introduces a complete Kubernetes DaemonSet agent that enforces per-GPU NVML power caps on worker nodes. The daemon reconciles every 15 seconds, mapping running GPU processes to pod UIDs via cgroup parsing, reading power-limit annotations, resolving multi-pod conflicts, and applying caps via NVML with Prometheus metrics and graceful shutdown support.

Changes

Power Agent: GPU NVML Power Cap Enforcement DaemonSet

Layer / File(s) Summary
Module foundation and utilities
components/power_agent/power_agent.py (lines 1–213)
Module header, optional lazy imports for NVML/Kubernetes/Prometheus with fallbacks, logging, constants, cgroup parsing regex for extracting pod UIDs from /proc/{pid}/cgroup, atomic JSON persistence for managed GPU state, UUID normalization for pynvml version compatibility, and Prometheus metrics infrastructure with noop fallback when unavailable.
NVML cap application and multi-pod policy
components/power_agent/power_agent.py (lines 220–404), components/power_agent/tests/test_apply_cap.py, components/power_agent/tests/test_multi_pod_policy.py
NVML helper functions to clamp watts to device min/max constraints and apply power caps with logging and metric updates; multi-pod-per-GPU resolution policy that applies an agreed cap when all pods match, otherwise applies safe default on conflicts or missing/invalid annotations, incrementing error counters; unit tests validating clamping boundaries, NVML error tolerance, UUID handling (bytes vs str), and all policy scenarios (single pod, agreement, conflict, invalid annotations, mixed None/valid).
Shutdown and startup recovery
components/power_agent/power_agent.py (lines 280–337), components/power_agent/tests/test_shutdown.py
SIGTERM/SIGINT handler restores managed GPU power limits to NVML defaults and signals main loop exit; startup orphan recovery loads persisted GPU UUIDs, checks idle status, restores default caps for idle GPUs, and updates persisted state; unit tests validating per-GPU restoration, graceful shutdown with no managed GPUs, and NVML error handling.
PowerAgent orchestration
components/power_agent/power_agent.py (lines 411–577)
PowerAgent class initializes NVML/Kubernetes clients, loads orphan state, lists node pods, maps running PIDs to pod UIDs, builds UID-to-annotation mappings, resolves GPU cap via policy, applies caps via NVML, and persists managed state; blocking reconcile loop runs every 15 seconds; CLI main() entry point accepts --safe-default-watts, optional node/namespace scope, and Prometheus port, then starts daemon.
Cgroup parser validation
components/power_agent/tests/test_cgroup_parser.py
Unit tests for cgroup UID extraction across cgroup v1/v2 with systemd and cgroupfs layouts, QoS classes (Guaranteed/Burstable/BestEffort), and non-Kubernetes inputs; validates OSError handling and first-match-wins behavior when multiple cgroup lines exist.
Kubernetes deployment
deploy/power_agent/rbac.yaml, deploy/power_agent/daemonset.yaml, deploy/power_agent/dev-pod.yaml
RBAC ServiceAccount, ClusterRole with pod read-access, and ClusterRoleBinding; production DaemonSet runs on GPU-labeled nodes with hostPID, privileged container, nvidia runtime, environment variables for safe default (500 W) and node name, resource limits, and volume mounts for /proc (read-only) and persisted state at /var/lib/dynamo-power-agent; development Pod harness for manual testing and validation on live clusters.
Documentation and CI integration
components/power_agent/README.md, .github/filters.yaml
README documents Power Agent's 15-second reconciliation, cgroup UID extraction, annotation reading, NVML power cap enforcement, troubleshooting steps (annotations, logs), Prometheus metrics (applied limit, conflict/safe-default/failure counters), and shutdown/startup semantics; CI filter update triggers planner jobs on power_agent component and deployment changes.

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: introduction of a per-node power-cap enforcement DaemonSet for the Power Agent.
Description check ✅ Passed The PR description is comprehensive and covers all required template sections: overview (PR 1 of 6, split from #9369), detailed scope, reviewer onboarding, test results, and merge strategy. Some optional details like specific file callouts are included.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
components/power_agent/power_agent.py (1)

537-539: ⚡ Quick win

Move argparse import to module scope.

Line 538 imports inside main(), which violates the repo’s Python import placement rule.

Proposed fix
+import argparse
 import json
 import logging
@@
 def main() -> None:
-    import argparse
-
     parser = argparse.ArgumentParser(description="Dynamo Power Agent DaemonSet")
As per coding guidelines: "Keep all imports at the top of each file (flag any import inside functions/classes)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/power_agent.py` around lines 537 - 539, The argparse
import is inside the main() function; move the import to module scope by adding
"import argparse" at the top of the file and removing the in-function import in
main(), ensuring any type or usage remains correct (adjust other imports if
needed) so linting and the repo rule about top-level imports are satisfied.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/power_agent/power_agent.py`:
- Around line 119-124: The function _load_previously_managed_gpus should
defensively handle malformed or unreadable state files: when reading
_MANAGED_STATE_PATH, catch OSError in addition to
FileNotFoundError/JSONDecodeError, load the JSON into a variable (e.g., data =
json.load(f)), verify isinstance(data, dict) before calling
data.get("managed_uuids", []), and if the root is not a dict (or managed_uuids
is not an iterable of strings) return an empty set; ensure any
TypeError/ValueError from unexpected types is handled and results in returning
set() rather than letting startup crash.
- Around line 294-297: Replace the broad except that swallows errors from the
pynvml.nvmlShutdown() call with a targeted handler: catch pynvml.NVMLError (or
the specific pynvml error class) as e, log the exception with the module/class
logger (e.g., logger.exception("NVML shutdown failed: %s", e) or
logging.exception(...)) and then re-raise the error to avoid silent failures;
modify the try/except around pynvml.nvmlShutdown() accordingly so only
NVML-specific errors are caught, logged, and propagated.
- Around line 494-501: The loop that builds pod_annotations appends one entry
per GPU process, causing pods with multiple processes to be counted multiple
times; change the logic in the procs loop (using _extract_pod_uid_from_cgroup
and uid_to_annotation) to deduplicate by pod UID before applying multi-pod
policy — e.g., track seen UIDs (or build a uid->annotation map) and only append
one (uid, annotation) pair per unique UID so each pod is counted once when
evaluating multi-pod warnings/metrics.

In `@deploy/power_agent/daemonset.yaml`:
- Around line 16-35: The DaemonSet metadata.namespace is hardcoded to "default",
which causes mismatches with the RBAC/service account namespace; update the
manifest to use the parameterized namespace variable (e.g.
${POWER_AGENT_NAMESPACE}) instead of "default" and ensure the ServiceAccount
referenced by spec.serviceAccountName ("dynamo-power-agent") is created/bound in
that same parameterized namespace so RBAC bindings resolve correctly; locate
metadata.name ("dynamo-power-agent"), metadata.namespace, and
spec.serviceAccountName in the template to make the change.
- Around line 55-58: Replace the mutable image tag
"nvcr.io/nvidia/dynamo/power-agent:latest" with an immutable reference (a
specific version tag or an image digest), e.g.
"nvcr.io/nvidia/dynamo/power-agent:vX.Y.Z" or
"nvcr.io/nvidia/dynamo/power-agent@sha256:<digest>", so deployments are
reproducible; keep or verify imagePullPolicy (e.g. IfNotPresent) is appropriate
for pinned images and update the inline comment to note the image is
intentionally pinned.

---

Nitpick comments:
In `@components/power_agent/power_agent.py`:
- Around line 537-539: The argparse import is inside the main() function; move
the import to module scope by adding "import argparse" at the top of the file
and removing the in-function import in main(), ensuring any type or usage
remains correct (adjust other imports if needed) so linting and the repo rule
about top-level imports are satisfied.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f9faddf8-30e0-4a7b-9055-b45244b2a979

📥 Commits

Reviewing files that changed from the base of the PR and between f20ff4e and 3cde449.

📒 Files selected for processing (11)
  • .github/filters.yaml
  • components/power_agent/README.md
  • components/power_agent/power_agent.py
  • components/power_agent/tests/__init__.py
  • components/power_agent/tests/test_apply_cap.py
  • components/power_agent/tests/test_cgroup_parser.py
  • components/power_agent/tests/test_multi_pod_policy.py
  • components/power_agent/tests/test_shutdown.py
  • deploy/power_agent/daemonset.yaml
  • deploy/power_agent/dev-pod.yaml
  • deploy/power_agent/rbac.yaml

Comment on lines +119 to +124
def _load_previously_managed_gpus() -> set[str]:
try:
with open(_MANAGED_STATE_PATH) as f:
return set(json.load(f).get("managed_uuids", []))
except (FileNotFoundError, json.JSONDecodeError):
return set()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden managed-state parsing to avoid startup crashes on malformed file content.

Line 122 assumes the JSON root is a dict; if the file contains a list/string/corrupt schema, .get(...) raises and can abort startup. Also unreadable-file OSError cases (e.g., permissions) are currently uncaught.

Proposed fix
 def _load_previously_managed_gpus() -> set[str]:
     try:
         with open(_MANAGED_STATE_PATH) as f:
-            return set(json.load(f).get("managed_uuids", []))
-    except (FileNotFoundError, json.JSONDecodeError):
+            payload = json.load(f)
+    except (OSError, json.JSONDecodeError) as e:
+        logger.warning("Failed to read managed GPU state file: %s", e)
         return set()
+
+    if not isinstance(payload, dict):
+        logger.warning("Managed GPU state file has invalid root type: %s", type(payload))
+        return set()
+    managed = payload.get("managed_uuids", [])
+    if not isinstance(managed, list):
+        logger.warning("managed_uuids must be a list, got: %s", type(managed))
+        return set()
+    return {u for u in managed if isinstance(u, str)}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/power_agent.py` around lines 119 - 124, The function
_load_previously_managed_gpus should defensively handle malformed or unreadable
state files: when reading _MANAGED_STATE_PATH, catch OSError in addition to
FileNotFoundError/JSONDecodeError, load the JSON into a variable (e.g., data =
json.load(f)), verify isinstance(data, dict) before calling
data.get("managed_uuids", []), and if the root is not a dict (or managed_uuids
is not an iterable of strings) return an empty set; ensure any
TypeError/ValueError from unexpected types is handled and results in returning
set() rather than letting startup crash.

Comment on lines +294 to +297
try:
pynvml.nvmlShutdown()
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t silently swallow NVML shutdown failures.

Line 296 catches all exceptions and drops them. This hides shutdown problems and complicates recovery/debugging.

Proposed fix
     try:
         pynvml.nvmlShutdown()
-    except Exception:
-        pass
+    except Exception as e:
+        logger.warning("nvmlShutdown failed during SIGTERM handling: %s", e)
As per coding guidelines: "Fail fast: avoid broad exception swallowing; catch only specific exceptions you can handle, and if you catch Exception you must log + re-raise."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
pynvml.nvmlShutdown()
except Exception:
pass
try:
pynvml.nvmlShutdown()
except Exception as e:
logger.warning("nvmlShutdown failed during SIGTERM handling: %s", e)
raise
🧰 Tools
🪛 Ruff (0.15.12)

[error] 296-297: try-except-pass detected, consider logging the exception

(S110)


[warning] 296-296: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/power_agent.py` around lines 294 - 297, Replace the
broad except that swallows errors from the pynvml.nvmlShutdown() call with a
targeted handler: catch pynvml.NVMLError (or the specific pynvml error class) as
e, log the exception with the module/class logger (e.g., logger.exception("NVML
shutdown failed: %s", e) or logging.exception(...)) and then re-raise the error
to avoid silent failures; modify the try/except around pynvml.nvmlShutdown()
accordingly so only NVML-specific errors are caught, logged, and propagated.

Comment on lines +494 to +501
pod_annotations: list[tuple[str, Optional[str]]] = []
for proc in procs:
uid = _extract_pod_uid_from_cgroup(proc.pid)
if uid is None:
continue # non-K8s process — skip
if uid in uid_to_annotation:
pod_annotations.append((uid, uid_to_annotation[uid]))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Deduplicate by pod UID before applying multi-pod policy.

Line 500 appends one entry per GPU process. A single pod with multiple processes is then counted as “multi-pod”, skewing warnings/metrics and policy disposition.

Proposed fix
-        pod_annotations: list[tuple[str, Optional[str]]] = []
+        pod_annotations_by_uid: dict[str, Optional[str]] = {}
         for proc in procs:
             uid = _extract_pod_uid_from_cgroup(proc.pid)
             if uid is None:
                 continue  # non-K8s process — skip
             if uid in uid_to_annotation:
-                pod_annotations.append((uid, uid_to_annotation[uid]))
+                pod_annotations_by_uid[uid] = uid_to_annotation[uid]
+
+        pod_annotations = list(pod_annotations_by_uid.items())
Based on learnings: "Review this logic to avoid assuming a strict 1:1 mapping ... or extend the implementation to correctly handle multiple pods per GPU ... and pods missing power annotations."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pod_annotations: list[tuple[str, Optional[str]]] = []
for proc in procs:
uid = _extract_pod_uid_from_cgroup(proc.pid)
if uid is None:
continue # non-K8s process — skip
if uid in uid_to_annotation:
pod_annotations.append((uid, uid_to_annotation[uid]))
pod_annotations_by_uid: dict[str, Optional[str]] = {}
for proc in procs:
uid = _extract_pod_uid_from_cgroup(proc.pid)
if uid is None:
continue # non-K8s process — skip
if uid in uid_to_annotation:
pod_annotations_by_uid[uid] = uid_to_annotation[uid]
pod_annotations = list(pod_annotations_by_uid.items())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/power_agent.py` around lines 494 - 501, The loop that
builds pod_annotations appends one entry per GPU process, causing pods with
multiple processes to be counted multiple times; change the logic in the procs
loop (using _extract_pod_uid_from_cgroup and uid_to_annotation) to deduplicate
by pod UID before applying multi-pod policy — e.g., track seen UIDs (or build a
uid->annotation map) and only append one (uid, annotation) pair per unique UID
so each pod is counted once when evaluating multi-pod warnings/metrics.

Comment thread deploy/power_agent/daemonset.yaml Outdated
Comment on lines +16 to +35
metadata:
name: dynamo-power-agent
namespace: default
labels:
app: dynamo-power-agent
spec:
selector:
matchLabels:
app: dynamo-power-agent
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
app: dynamo-power-agent
spec:
serviceAccountName: dynamo-power-agent
hostPID: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Parameterize the DaemonSet namespace to prevent RBAC/SA mismatches.

Line 18 hardcodes default, while RBAC is set up to support ${POWER_AGENT_NAMESPACE}. Deploying outside default can break ServiceAccount resolution/binding.

Suggested fix
 metadata:
   name: dynamo-power-agent
-  namespace: default
+  namespace: ${POWER_AGENT_NAMESPACE}
   labels:
     app: dynamo-power-agent
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
metadata:
name: dynamo-power-agent
namespace: default
labels:
app: dynamo-power-agent
spec:
selector:
matchLabels:
app: dynamo-power-agent
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
app: dynamo-power-agent
spec:
serviceAccountName: dynamo-power-agent
hostPID: true
metadata:
name: dynamo-power-agent
namespace: ${POWER_AGENT_NAMESPACE}
labels:
app: dynamo-power-agent
spec:
selector:
matchLabels:
app: dynamo-power-agent
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
app: dynamo-power-agent
spec:
serviceAccountName: dynamo-power-agent
hostPID: true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/power_agent/daemonset.yaml` around lines 16 - 35, The DaemonSet
metadata.namespace is hardcoded to "default", which causes mismatches with the
RBAC/service account namespace; update the manifest to use the parameterized
namespace variable (e.g. ${POWER_AGENT_NAMESPACE}) instead of "default" and
ensure the ServiceAccount referenced by spec.serviceAccountName
("dynamo-power-agent") is created/bound in that same parameterized namespace so
RBAC bindings resolve correctly; locate metadata.name ("dynamo-power-agent"),
metadata.namespace, and spec.serviceAccountName in the template to make the
change.

Comment thread deploy/power_agent/daemonset.yaml Outdated
Comment on lines +55 to +58
- name: power-agent
# Replace with your registry/image
image: nvcr.io/nvidia/dynamo/power-agent:latest
imagePullPolicy: IfNotPresent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use an immutable image reference instead of latest.

Line 57 uses a mutable tag, which makes rollouts non-reproducible and can introduce unreviewed runtime changes.

Suggested fix
-          image: nvcr.io/nvidia/dynamo/power-agent:latest
+          image: nvcr.io/nvidia/dynamo/power-agent:<release-tag>`@sha256`:<digest>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/power_agent/daemonset.yaml` around lines 55 - 58, Replace the mutable
image tag "nvcr.io/nvidia/dynamo/power-agent:latest" with an immutable reference
(a specific version tag or an image digest), e.g.
"nvcr.io/nvidia/dynamo/power-agent:vX.Y.Z" or
"nvcr.io/nvidia/dynamo/power-agent@sha256:<digest>", so deployments are
reproducible; keep or verify imagePullPolicy (e.g. IfNotPresent) is appropriate
for pinned images and update the inline comment to note the image is
intentionally pinned.

@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 3cde449 to 1338028 Compare May 18, 2026 16:03
Comment thread deploy/power_agent/rbac.yaml Outdated
subjects:
- kind: ServiceAccount
name: dynamo-power-agent
namespace: ${POWER_AGENT_NAMESPACE}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manifest cannot be applied as-is because ${POWER_AGENT_NAMESPACE} is not a valid Kubernetes namespace value and kubectl apply -n ... does not substitute fields inside a ClusterRoleBinding. Fix: set a concrete namespace that matches the DaemonSet namespace or template this manifest through Helm/Kustomize/envsubst before shipping it.

@@ -0,0 +1,577 @@
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all your files, you only need these first two SPDX lines. You can delete the rest of the copyright header.

Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.

Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.

43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.

CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.

Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 1338028 to ec21081 Compare May 19, 2026 12:54
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:

  * hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
  * mutable image :latest -> required image.tag with fail-fast validator
  * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating

The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.

Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.

components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.

Validated locally:
  helm lint                                -> 0 errors
  helm template (3 positive exercises)     -> expected resources
  helm template (3 negative exercises)     -> expected fail-fast errors
  components/power_agent/tests/            -> 43/43 passed
  .github/scripts/test-filters.js          -> 20/20 passed
  pre-commit (cross-cutting hooks)         -> all applicable passed

Signed-off-by: Kai Ma <kaim@nvidia.com>
@github-actions github-actions Bot added the deployment::k8s Relates to dynamo deployment in kubernetes label May 19, 2026
…S.txt

copyright-checks (regex requiring `#`-prefixed SPDX lines) rejected the
Go-template-comment header in _helpers.tpl and the missing header in
NOTES.txt on commit 8e490a7. Switch both to the same 2-line
`#`-prefixed form already used by the other 9 chart files, aligning
with @grahamking's PR review preference (minimal SPDX, no Apache
boilerplate).

For NOTES.txt this adds two visible lines to the post-`helm install`
banner, matching the project convention established by
deploy/helm/charts/platform/templates/NOTES.txt.

For _helpers.tpl the header sits outside any `{{- define -}}` block, so
it is never invoked at render time and is invisible to consumers
despite not being inside a Go template comment.

Re-validated locally:
  helm lint                              -> 0 errors
  helm template (default mode)           -> 4 expected kinds
  pre-commit on the two touched files    -> all applicable hooks pass

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Updates the three reference sites in examples/deployments/powerplanner/
that previously instructed users to ``kubectl apply -f deploy/power_agent/...``,
flipping them to the new Helm chart at deploy/helm/charts/power-agent/
that landed in PR #9682:

  * disagg-power-aware.yaml header recipe
  * README.md Prerequisites + verify section
  * MULTI_DGD.md file index

Also updates the verify-pods label selector from the legacy
``app=dynamo-power-agent`` to the chart-emitted
``app.kubernetes.io/name=power-agent``.

Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md,
pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference
the old paths in their historical / architectural narrative sections,
which is intentional -- those describe the pre-chart state and the
transition rationale, not current deployment instructions.

Part of the PR #9369 cascade following PR #9682''s Helm chart landing.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Updates the three reference sites in examples/deployments/powerplanner/
that previously instructed users to ``kubectl apply -f deploy/power_agent/...``,
flipping them to the new Helm chart at deploy/helm/charts/power-agent/
that landed in PR #9682:

  * disagg-power-aware.yaml header recipe
  * README.md Prerequisites + verify section
  * MULTI_DGD.md file index

Also updates the verify-pods label selector from the legacy
``app=dynamo-power-agent`` to the chart-emitted
``app.kubernetes.io/name=power-agent``.

Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md,
pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference
the old paths in their historical / architectural narrative sections,
which is intentional -- those describe the pre-chart state and the
transition rationale, not current deployment instructions.

Part of the PR #9369 cascade following PR #9682''s Helm chart landing.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants