Skip to content

chore(recipe): bump dynamo-platform from 0.9.x to 1.0.1#459

Open
Jont828 wants to merge 6 commits intoNVIDIA:mainfrom
Jont828:chore/bump-dynamo-platform-1.0.1
Open

chore(recipe): bump dynamo-platform from 0.9.x to 1.0.1#459
Jont828 wants to merge 6 commits intoNVIDIA:mainfrom
Jont828:chore/bump-dynamo-platform-1.0.1

Conversation

@Jont828
Copy link
Copy Markdown
Contributor

@Jont828 Jont828 commented Mar 24, 2026

Summary

  • Upgrade dynamo-platform from 0.9.x to the latest 1.0.1 release across the component registry and all 5 Dynamo inference overlay recipes
  • Rewrite recipes/components/dynamo-platform/values.yaml for the 1.0 Helm schema: global.* subchart controls, upgradeCRD: true, removed stale image pin and kube-rbac-proxy workaround (fixed upstream)
  • dynamo-crds version intentionally unchanged (no 1.0 CRD chart exists; platform chart now bundles CRDs via upgradeCRD)

Test plan

  • go test -race ./pkg/recipe/... -count=1 — passes
  • go test -race ./pkg/bundler/... -count=1 — passes
  • make test — all tests pass, coverage 72%+
  • make lint (golangci-lint + yamllint on changed files) — clean
  • KWOK e2e with a Dynamo overlay (make kwok-e2e RECIPE=h100-eks-ubuntu-inference-dynamo)
  • Deploy to AKS/EKS cluster and verify dynamo-platform 1.0.1 chart renders correctly with global.* keys

🤖 Generated with Claude Code

Upgrade Dynamo platform to the latest 1.0.1 release across registry
and all inference overlay recipes. Key changes for the 1.0 schema:

- Registry: defaultVersion 0.9.1 → 1.0.1
- Overlays: all 5 dynamo overlays updated from 0.9.0 → 1.0.1
- Values: rewritten for 1.0 Helm schema (global.* subchart controls,
  upgradeCRD: true, removed stale image pins and kube-rbac-proxy
  workaround fixed upstream)

Signed-off-by: Jont828 <jt572@cornell.edu>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@dims
Copy link
Copy Markdown
Collaborator

dims commented Mar 31, 2026

/ok to test 852739e

@ayuskauskas ayuskauskas requested a review from a team as a code owner March 31, 2026 17:30
@Jont828
Copy link
Copy Markdown
Contributor Author

Jont828 commented Mar 31, 2026

@dims Merged changes from main, can we get another CI run?

@dims
Copy link
Copy Markdown
Collaborator

dims commented Mar 31, 2026

/ok to test 852739e
/ok to test 7330295
/ok to test 6793278

@dims
Copy link
Copy Markdown
Collaborator

dims commented Apr 2, 2026

/ok to test 852739e
/ok to test 7330295
/ok to test 6793278
/ok to test 652031e

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Superseded by cross-review below)

@yuanchen8911
Copy link
Copy Markdown
Contributor

yuanchen8911 commented Apr 9, 2026

Cross-Review Summary for PR #459

Reviewers: Claude Code, Codex, CodeRabbit + Integration Analysis
Rounds: 1 + Codex follow-up
Consensus reached: Yes

Confirmed Issues

# Severity Finding Confirmed By
1 High Missed overlay: gb200-oke-ubuntu-inference-dynamo.yaml not updated — There are 6 dynamo overlay files but the PR only updates 5. gb200-oke-ubuntu-inference-dynamo.yaml#L49-L53 still has dynamo-platform at "0.9.0". The shared values.yaml has been rewritten for 1.0.1 (global.* subchart controls). Deploying this overlay will use the 0.9.0 chart with 1.0.1 values — the global.* keys will be silently ignored, causing incorrect behavior (etcd/NATS/grove subcharts unexpectedly enabled). Claude Code + CodeRabbit + Integration
2 High Grove deployment appears dropped — migration may be incomplete — The new values.yaml#L31-L35 sets global.grove.install: false with a comment saying grove is "managed as a separate AICR component." However, no standalone grove component exists in registry.yaml and no Dynamo overlay declares a grove dependency. The old values had grove: enabled: true which deployed grove as a subchart. With the subchart now disabled and no external replacement visible in this repo, grove may no longer be deployed. The stale grove.nodeSelector/grove.tolerations paths in registry.yaml#L376-L379 are a symptom of the same gap. (Caveat: the 1.0.1 chart source has not been fetched — if 1.0.1 internalized grove, this finding may not apply.) Codex follow-up + orchestrator verification

Open Questions

  • Was gb200-oke-ubuntu-inference-dynamo.yaml intentionally excluded (OKE not yet validated with dynamo 1.0), or missed?
  • Does the 1.0.1 chart still require an external grove deployment, or has grove been internalized? If still required, a standalone grove component should be added to the registry and overlays.
  • Should dynamo-crds be removed from overlay dependencyRefs since upgradeCRD: true makes the separate CRD chart redundant?

Positive Observations

  • Values.yaml rewrite is clean and well-commented — each global.* subchart control has a comment explaining the rationale
  • All 5 updated overlays are consistent (version "1.0.1", same source, same dependencyRefs)
  • Removed stale workarounds (image pin, kube-rbac-proxy) now fixed upstream
  • kai-scheduler dual setting is intentional and correct
  • PR description clearly explains the dynamo-crds decision

Cross-review by Claude Code + Codex + CodeRabbit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants