init: helm charts by whoisj · Pull Request #5 · ai-dynamo/dynamo

whoisj · 2025-03-04T16:37:06Z

Adds a basic Helm chart for a generic dynemo container.
Lack specific support for API server vs. worker, and OpenAI vs. VLLM vs. TRTLLM, etc.
Future iteration may include specific charts for those topics.
Should be nominally compatible with upcoming Compound AI tooling while retaining flexibility required by sophisticated customers.

Adds a basic test framework for validating Helm Chart using helm template w/ regular expressions.

Adds a set of 3 basic tests:

basic positive test
simple invalid values test
simple GPU resource specification test

Reviewer Instructions

Please focus on the following files:

deploy/Kubernetes/common/chart/Chart.yaml
deploy/Kubernetes/common/chart/values.yaml
deploy/Kubernetes/common/chart/templates/_helpers.tpl
deploy/Kubernetes/common/chart/templates/component-deployment.yaml

Other code is less important, but reviews are welcome
... and (mostly) ignore / accept the pwsh code for now.

DLIS-8146

whoisj · 2025-03-04T17:02:09Z

@hutm, ideally you can offload the review effort to Julien Mancuso. I would add him, but I do not know his Github alias and the search feature isn't finding it for me. Thanks.

note: Julien has since been added to the repository and to the reviewers list for this PR.

github-actions · 2025-03-04T17:13:51Z

Test Results

2 files 2 suites 25s ⏱️
73 tests 73 ✅ 0 💤 0 ❌
91 runs 90 ✅ 1 💤 0 ❌

Results for commit 00b7104.

♻️ This comment has been updated with latest results.

ishandhanani

Looks good to me. I would wait for Julien to give the final review before we get this in. Thanks for cooking on this J!

julienmancuso

lgtm

Adds a basic Helm chart for a generic dynemo container. Lack specific support for API server vs. worker, and OpenAI vs. VLLM vs. TRTLLM, etc. Future iteration may include specific charts for those topics. Should be nominally compatible with upcoming Compound AI tooling while retaining flexibility required by sophisticated customers. Adds a basic test framework for validating Helm Chart using `helm template` w/ regular expressions. Adds a set of 3 basic tests: - basic positive test - simple invalid values test - simple GPU resource specification test

This chang excludes YAML files contained in a Helm chart's templates/ directory as they'll never pass a simple-minded YAML linter.

) (#5…" This reverts commit 2a315ec.

Add agg_vision.sh for single-GPU aggregated multimodal serving with Qwen2.5-VL. Fix compliance test #5 to include required "detail" field on input_image content per the OpenResponses spec. Tested: 5/6 compliance tests pass against Qwen2.5-VL-7B-Instruct (tool calling fails as expected -- VL model lacks structured tool use). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fixes all actionable items from the second review: Bug fixes: - #1: Change returncode=4 → returncode=2 in pytest_configure exit (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED) - #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original) Test quality: - #7: Add missing assertions to test_apply_hf_home_layout (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE) - #8: Use monkeypatch in tests 3 & 4 for proper env isolation (prevents pre-existing env vars from leaking on test failure) Design / correctness: - #3: Fix _models_dir_env docstring ("exactly once" → "once per worker") - #4: Add comment noting TRANSFORMERS_CACHE deprecation - #5: Update --models-dir help text and docs to reflect both supported layouts (bare HF_HUB_CACHE and HF_HOME), not just bare - #10: Restore pytest.skip() in download_lora() (test-only infra); remove now-redundant guard from minio_lora_service fixture - #11: Raise hub/ detection log to WARNING with guidance - #12: Replace shutil.rmtree(ignore_errors=True) with try/except so cleanup failures are logged rather than silently swallowed Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester test deferred — complex due to conftest dependencies, low severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>

Bug fixes: - #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache - #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout (bare-layout path was missing the writable-dir check present in HF_HOME test) Minor cleanups: - #4: Move `import pytest` from inside download_lora() to module top-level (lora_utils.py is test-only infra; pytest is always available) - #5: Replace pytestconfig.getoption("--models-dir") re-checks in predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR") (_models_dir_env runs first and sets the var; single source of truth) New coverage tests: - #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test verifying pytest_configure exits with returncode=2 on bad path - #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora() raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set Not addressed: #3 (keep gpu_0 per project guidelines and previous review retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>

CodeRabbit flagged the new tests as carrying internal Linear ticket references in their doc-comments — repo policy forbids Linear refs in source code. Replaced each occurrence with neutral wording: - DIS-1842 work-item #1 → cross-parser tool_choice parametrisation work-item (tracked separately) - DIS-1842 work-item #5 → cross-parser finish_reason mapping work-item (tracked separately) - Universal gap noted in DIS-1842 → Universal gap noted in the test taxonomy Linear refs remain in the PR description (allowed per repo convention). 424 parser tests + 16 tool_choice tests pass. Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

…ery path The previous commit switched the PyO3 batch binding to detect_and_parse_tool_call_with_recovery (Ayush #5), which legitimately changed Dynamo's behavior on missing-end-token / malformed inputs: the recovery path now extracts a tool call where the streaming-safe path used to drop it. * Updated 5 fixture expected values to match Dynamo's recovery output: glm47/PARSER.batch.5 minimax_m2/PARSER.batch.5 nemotron_deci/PARSER.batch.4 nemotron_deci/PARSER.batch.5 qwen3_coder/PARSER.batch.5 * Added vllm/sglang KNOWN_DIVERGENCES entries for the same cases since those parsers still drop where Dynamo now recovers (impl-defined recovery contract per PARSER_CASES.md). sglang/qwen3_coder/batch.5 intentionally omitted — sglang's qwen3_coder detector recovers identically to Dynamo and matches the new expected. Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping: makes the dev environment shareable across teammates by removing personal identifiers, parameterizing all dev-pod / probe references, and folding 11 review fixes into the three Phase 1-4 design documents. Dev-env hardening ----------------- * Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset: - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline usage instructions. - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime. - 5 `scripts/inspect_*.py` cluster probes parameterized via `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit) rather than a hard-coded namespace. - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...` one-liner showing how to discover the right node. * `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally not committed" convention (matches `examples/deployments/powerplanner/ .tmp-gp-minimal.yaml`) and to block the four common root-level personal-scratch files from sneaking back in via `git add .`. dpp-dev-env.md updates ---------------------- * All 10 path references rewritten to point at the new `deploy/planner/dev/` and `scripts/dev/` homes. * New §5 ("Deploy the Dev Pod") subsection documenting the `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an `envsubst` (Linux/WSL) path and a Windows edit-in-place path. * Quick Deploy Checklist filename corrected from the stale-DGDR `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a cross-reference to the One-Time Setup §4 warning. * DGD-ready wait command standardized to the programmatic `kubectl wait --for=jsonpath='{.status.state}'=successful` form in both §4 and the Checklist (removes the manual-`-w` divergence). Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md) ------------------------------------------------------------------------------- * powerplanner-design.md - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 - 590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still draft.` with last-validation date. - §3.1: added an explicit note that `aic_interpolation` and `mode` are pre-existing PlannerConfig fields (owned by `monitoring/aic_interpolation.py`) and intentionally absent from the Names Registry. - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so it correctly points at the *throughput regression* case rather than reading as a blanket "config revert". - §6.5 pseudocode: replaced module-level `self.device_count` with `pynvml.nvmlDeviceGetCount()` (the original was syntactically incorrect outside the daemon class). - §13 Open Question #11: corrected the duplicated `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching §5.3. * powerplanner-testbed-design.md - §7: added a "Numeric-suffix convention" note explaining the intentional D21/E21 ID collision (IDs unique by filename + tuple, not renumbered). - §11: corrected "Six guards" -> "Seven guards" to match the seven items actually listed. - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`. - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper + 30 parametrized) so the listing matches the actual run. Verification ------------ * Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented Windows baseline; gamma auto-skipped without the Rust mocker). * AIC no-cluster integration (test_aic_power_optimizer.py + test_aic_power_e2e_sim.py): 49 passed. * Unit tests: 456 passed; the 9 remaining failures are pre-existing Windows-only environment limitations (cp1252 codec, `os.killpg` POSIX-only, missing `filterpy`) confirmed via `git stash` to be untouched by this commit. * All touched .py files: `python3.10 -m py_compile` clean. * All touched .yaml files: `yaml.safe_load_all` clean. * `ReadLints` over all 12 touched files: no errors. * Final `rg "kaim|aks-a100a-36888584"` outside committed `tests/fault_tolerance/...` (pre-existing, separate component) and `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits. Co-authored-by: Cursor <cursoragent@cursor.com>

…later Reader-first ordering. Old order buried Quick start at position 8 of 9; new order surfaces the runnable commands above all the reference sections (FSx vs EBS, label conventions, aiperf SHA rationale). New section order: 1. Title + 3-config table 2. Pre-requisites (was #3) 3. Quick start (was #8 — promoted) 4. Directory layout (was #7 — now serves as map for the rest) 5. Hardware targets (was #2 — now pure reference; invocation examples moved into Quick start) 6. Storage (was #5) 7. aiperf install (was #6) 8. Naming & ownership (was #4) 9. Notes (unchanged) Also drop the stray "We use ebs by default" sentence — it contradicted both the Storage section and the actual yaml (where the PVC block is fully commented out, no default storage class is set). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping: makes the dev environment shareable across teammates by removing personal identifiers, parameterizing all dev-pod / probe references, and folding 11 review fixes into the three Phase 1-4 design documents. Dev-env hardening ----------------- * Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset: - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline usage instructions. - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime. - 5 `scripts/inspect_*.py` cluster probes parameterized via `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit) rather than a hard-coded namespace. - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...` one-liner showing how to discover the right node. * `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally not committed" convention (matches `examples/deployments/powerplanner/ .tmp-gp-minimal.yaml`) and to block the four common root-level personal-scratch files from sneaking back in via `git add .`. dpp-dev-env.md updates ---------------------- * All 10 path references rewritten to point at the new `deploy/planner/dev/` and `scripts/dev/` homes. * New §5 ("Deploy the Dev Pod") subsection documenting the `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an `envsubst` (Linux/WSL) path and a Windows edit-in-place path. * Quick Deploy Checklist filename corrected from the stale-DGDR `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a cross-reference to the One-Time Setup §4 warning. * DGD-ready wait command standardized to the programmatic `kubectl wait --for=jsonpath='{.status.state}'=successful` form in both §4 and the Checklist (removes the manual-`-w` divergence). Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md) ------------------------------------------------------------------------------- * powerplanner-design.md - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 - 590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still draft.` with last-validation date. - §3.1: added an explicit note that `aic_interpolation` and `mode` are pre-existing PlannerConfig fields (owned by `monitoring/aic_interpolation.py`) and intentionally absent from the Names Registry. - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so it correctly points at the *throughput regression* case rather than reading as a blanket "config revert". - §6.5 pseudocode: replaced module-level `self.device_count` with `pynvml.nvmlDeviceGetCount()` (the original was syntactically incorrect outside the daemon class). - §13 Open Question #11: corrected the duplicated `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching §5.3. * powerplanner-testbed-design.md - §7: added a "Numeric-suffix convention" note explaining the intentional D21/E21 ID collision (IDs unique by filename + tuple, not renumbered). - §11: corrected "Six guards" -> "Seven guards" to match the seven items actually listed. - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`. - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper + 30 parametrized) so the listing matches the actual run. Verification ------------ * Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented Windows baseline; gamma auto-skipped without the Rust mocker). * AIC no-cluster integration (test_aic_power_optimizer.py + test_aic_power_e2e_sim.py): 49 passed. * Unit tests: 456 passed; the 9 remaining failures are pre-existing Windows-only environment limitations (cp1252 codec, `os.killpg` POSIX-only, missing `filterpy`) confirmed via `git stash` to be untouched by this commit. * All touched .py files: `python3.10 -m py_compile` clean. * All touched .yaml files: `yaml.safe_load_all` clean. * `ReadLints` over all 12 touched files: no errors. * Final `rg "kaim|aks-a100a-36888584"` outside committed `tests/fault_tolerance/...` (pre-existing, separate component) and `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits. Co-authored-by: Cursor <cursoragent@cursor.com>

Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping: makes the dev environment shareable across teammates by removing personal identifiers, parameterizing all dev-pod / probe references, and folding 11 review fixes into the three Phase 1-4 design documents. Dev-env hardening ----------------- * Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset: - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline usage instructions. - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime. - 5 `scripts/inspect_*.py` cluster probes parameterized via `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit) rather than a hard-coded namespace. - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...` one-liner showing how to discover the right node. * `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally not committed" convention (matches `examples/deployments/powerplanner/ .tmp-gp-minimal.yaml`) and to block the four common root-level personal-scratch files from sneaking back in via `git add .`. dpp-dev-env.md updates ---------------------- * All 10 path references rewritten to point at the new `deploy/planner/dev/` and `scripts/dev/` homes. * New §5 ("Deploy the Dev Pod") subsection documenting the `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an `envsubst` (Linux/WSL) path and a Windows edit-in-place path. * Quick Deploy Checklist filename corrected from the stale-DGDR `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a cross-reference to the One-Time Setup §4 warning. * DGD-ready wait command standardized to the programmatic `kubectl wait --for=jsonpath='{.status.state}'=successful` form in both §4 and the Checklist (removes the manual-`-w` divergence). Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md) ------------------------------------------------------------------------------- * powerplanner-design.md - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 - 590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still draft.` with last-validation date. - §3.1: added an explicit note that `aic_interpolation` and `mode` are pre-existing PlannerConfig fields (owned by `monitoring/aic_interpolation.py`) and intentionally absent from the Names Registry. - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so it correctly points at the *throughput regression* case rather than reading as a blanket "config revert". - §6.5 pseudocode: replaced module-level `self.device_count` with `pynvml.nvmlDeviceGetCount()` (the original was syntactically incorrect outside the daemon class). - §13 Open Question #11: corrected the duplicated `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching §5.3. * powerplanner-testbed-design.md - §7: added a "Numeric-suffix convention" note explaining the intentional D21/E21 ID collision (IDs unique by filename + tuple, not renumbered). - §11: corrected "Six guards" -> "Seven guards" to match the seven items actually listed. - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`. - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper + 30 parametrized) so the listing matches the actual run. Verification ------------ * Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented Windows baseline; gamma auto-skipped without the Rust mocker). * AIC no-cluster integration (test_aic_power_optimizer.py + test_aic_power_e2e_sim.py): 49 passed. * Unit tests: 456 passed; the 9 remaining failures are pre-existing Windows-only environment limitations (cp1252 codec, `os.killpg` POSIX-only, missing `filterpy`) confirmed via `git stash` to be untouched by this commit. * All touched .py files: `python3.10 -m py_compile` clean. * All touched .yaml files: `yaml.safe_load_all` clean. * `ReadLints` over all 12 touched files: no errors. * Final `rg "kaim|aks-a100a-36888584"` outside committed `tests/fault_tolerance/...` (pre-existing, separate component) and `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>

…cycle Address senior-architect review of the metrics infrastructure: #1 (two parallel Prometheus paths): merge `register_prometheus` + `setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings` method on the Rust `LLMEngine` trait. The trait surface shrinks from 7 → 6 methods; both expfmt bridging and the structured component publisher go through one place. The Python ABC keeps its two methods — the PyO3 bridge adapts (engine-author API unchanged for Python). #3 + #8 (framework lifecycle gauges gated on engine opt-in): - `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side `LifecycleGauges` (prometheus crate gauges registered via the runtime's `MetricsRegistry`). Worker constructs them after `engine.start()` succeeds; observes during cleanup/drain. Emits regardless of engine opt-in. - Drop `set_cleanup_time` / `set_drain_time` from the `ComponentMetricsPublisher` trait — engine no longer responsible. - Drop the corresponding methods from `PyComponentMetricsPublisher`. - `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for parity with the legacy entry points (both legacy `main.py` and the unified bridge populate it). The unified bridge now constructs `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when the engine returns no per-rank sources. #5 (kv_cache_hit_rate Optional semantics): clearer docstring on both Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly documented (`None` = no data, `0.0` = legitimate zero-hit). Deferred with rationale (explained in the post-review architect's note): - #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing but no current bug; 5 fields for 5 signals isn't worth the refactor cost yet. Revisit at ~10+ signals. - #4 (per-source cadence): explicit defer in original review. - #7 (conformance in CI): workflow change, separate from trait surface. Conformance kit collapses `check_register_prometheus` + `check_component_metrics` into `check_setup_metrics`. All 68 backend-common unit tests pass. Mocker example updated. PyO3 bridge wheel rebuilt successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping: makes the dev environment shareable across teammates by removing personal identifiers, parameterizing all dev-pod / probe references, and folding 11 review fixes into the three Phase 1-4 design documents. Dev-env hardening ----------------- * Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset: - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline usage instructions. - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime. - 5 `scripts/inspect_*.py` cluster probes parameterized via `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit) rather than a hard-coded namespace. - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...` one-liner showing how to discover the right node. * `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally not committed" convention (matches `examples/deployments/powerplanner/ .tmp-gp-minimal.yaml`) and to block the four common root-level personal-scratch files from sneaking back in via `git add .`. dpp-dev-env.md updates ---------------------- * All 10 path references rewritten to point at the new `deploy/planner/dev/` and `scripts/dev/` homes. * New §5 ("Deploy the Dev Pod") subsection documenting the `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an `envsubst` (Linux/WSL) path and a Windows edit-in-place path. * Quick Deploy Checklist filename corrected from the stale-DGDR `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a cross-reference to the One-Time Setup §4 warning. * DGD-ready wait command standardized to the programmatic `kubectl wait --for=jsonpath='{.status.state}'=successful` form in both §4 and the Checklist (removes the manual-`-w` divergence). Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md) ------------------------------------------------------------------------------- * powerplanner-design.md - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 - 590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still draft.` with last-validation date. - §3.1: added an explicit note that `aic_interpolation` and `mode` are pre-existing PlannerConfig fields (owned by `monitoring/aic_interpolation.py`) and intentionally absent from the Names Registry. - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so it correctly points at the *throughput regression* case rather than reading as a blanket "config revert". - §6.5 pseudocode: replaced module-level `self.device_count` with `pynvml.nvmlDeviceGetCount()` (the original was syntactically incorrect outside the daemon class). - §13 Open Question #11: corrected the duplicated `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching §5.3. * powerplanner-testbed-design.md - §7: added a "Numeric-suffix convention" note explaining the intentional D21/E21 ID collision (IDs unique by filename + tuple, not renumbered). - §11: corrected "Six guards" -> "Seven guards" to match the seven items actually listed. - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`. - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper + 30 parametrized) so the listing matches the actual run. Verification ------------ * Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented Windows baseline; gamma auto-skipped without the Rust mocker). * AIC no-cluster integration (test_aic_power_optimizer.py + test_aic_power_e2e_sim.py): 49 passed. * Unit tests: 456 passed; the 9 remaining failures are pre-existing Windows-only environment limitations (cp1252 codec, `os.killpg` POSIX-only, missing `filterpy`) confirmed via `git stash` to be untouched by this commit. * All touched .py files: `python3.10 -m py_compile` clean. * All touched .yaml files: `yaml.safe_load_all` clean. * `ReadLints` over all 12 touched files: no errors. * Final `rg "kaim|aks-a100a-36888584"` outside committed `tests/fault_tolerance/...` (pre-existing, separate component) and `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>

whoisj requested a review from hutm as a code owner March 4, 2025 16:37

whoisj temporarily deployed to GITLAB March 4, 2025 16:37 — with GitHub Actions Inactive

whoisj force-pushed the whoisj/init-helm-chart branch from b1e4d86 to 605f0e9 Compare March 4, 2025 16:44

whoisj temporarily deployed to GITLAB March 4, 2025 16:44 — with GitHub Actions Inactive

whoisj temporarily deployed to GITLAB March 4, 2025 16:45 — with GitHub Actions Inactive

whoisj requested a review from julienmancuso March 4, 2025 18:09

julienmancuso reviewed Mar 4, 2025

View reviewed changes

Comment thread deploy/Kubernetes/common/chart/templates/_helpers.tpl Outdated

ishandhanani approved these changes Mar 4, 2025

View reviewed changes

julienmancuso reviewed Mar 4, 2025

View reviewed changes

Comment thread deploy/Kubernetes/common/chart/values.yaml Outdated

whoisj force-pushed the whoisj/init-helm-chart branch from 605f0e9 to 7f42f6c Compare March 4, 2025 18:46

whoisj temporarily deployed to GITLAB March 4, 2025 18:46 — with GitHub Actions Inactive

whoisj temporarily deployed to GITLAB March 4, 2025 18:48 — with GitHub Actions Inactive

whoisj requested a review from julienmancuso March 4, 2025 19:00

julienmancuso approved these changes Mar 4, 2025

View reviewed changes

whoisj temporarily deployed to GITLAB March 4, 2025 21:09 — with GitHub Actions Inactive

whoisj temporarily deployed to GITLAB March 4, 2025 21:10 — with GitHub Actions Inactive

hutm approved these changes Mar 4, 2025

View reviewed changes

whoisj added 2 commits March 4, 2025 16:11

ci: exclude Helm Chart YAML files from check

00b7104

This chang excludes YAML files contained in a Helm chart's templates/ directory as they'll never pass a simple-minded YAML linter.

whoisj force-pushed the whoisj/init-helm-chart branch from 7f42f6c to 00b7104 Compare March 4, 2025 21:11

whoisj requested a review from biswapanda as a code owner March 4, 2025 21:11

whoisj temporarily deployed to GITLAB March 4, 2025 21:11 — with GitHub Actions Inactive

whoisj temporarily deployed to GITLAB March 4, 2025 21:15 — with GitHub Actions Inactive

whoisj temporarily deployed to GITLAB March 4, 2025 21:30 — with GitHub Actions Inactive

whoisj temporarily deployed to GITLAB March 4, 2025 21:31 — with GitHub Actions Inactive

whoisj merged commit c909567 into main Mar 4, 2025

whoisj deleted the whoisj/init-helm-chart branch March 4, 2025 21:34

piotrm-nvidia mentioned this pull request Mar 26, 2025

ERROR _core: Failed to send response: SendError #401

Closed

kylehh pushed a commit to kylehh/dynamo that referenced this pull request Apr 11, 2025

init: helm charts (ai-dynamo#5)

59d916e

dagil-nvidia mentioned this pull request Oct 21, 2025

feat: Add automated dependency version tracking and extraction #3547

Closed

coderabbitai Bot mentioned this pull request Nov 5, 2025

feat: Update Dynamo k8s deployment example to use ModelExpress #4112

Closed

jthomson04 added a commit that referenced this pull request Jan 22, 2026

Revert "feat: (cherrypick) Support Dynamo KVBM with TRTLLM Disagg (#3527

56504bc

) (#5…" This reverts commit 2a315ec.

tanmayv25 mentioned this pull request Apr 15, 2026

DEP: Backend Interface -- LLMEngine ABC and Worker #8251

Open

PeaBrane mentioned this pull request Apr 20, 2026

feat(mocker): KVBM G2 offload for on/offline replay #8184

Merged

atchernych mentioned this pull request Apr 23, 2026

Proposal: Hybrid Priority Scheduling in Dynamo #8580

Open

This was referenced Apr 24, 2026

[INVESTIGATE] sglang + DSv4: kwarg workaround breaks live — needs diagnosis #8671

Closed

test(parsers): DIS-1842 — DSv4 + Kimi K2 unit-test coverage gaps #8946

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

init: helm charts#5

init: helm charts#5
whoisj merged 2 commits into
mainfrom
whoisj/init-helm-chart

whoisj commented Mar 4, 2025

Uh oh!

whoisj commented Mar 4, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

ishandhanani left a comment

Uh oh!

Uh oh!

julienmancuso left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

whoisj commented Mar 4, 2025

Uh oh!

whoisj commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Uh oh!

ishandhanani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

julienmancuso left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

whoisj commented Mar 4, 2025 •

edited

Loading

github-actions Bot commented Mar 4, 2025 •

edited

Loading