Skip to content

feat: simplify CRD management#6466

Merged
julienmancuso merged 14 commits into
mainfrom
jsm/operator-managed-crds
Feb 24, 2026
Merged

feat: simplify CRD management#6466
julienmancuso merged 14 commits into
mainfrom
jsm/operator-managed-crds

Conversation

@julienmancuso
Copy link
Copy Markdown
Contributor

@julienmancuso julienmancuso commented Feb 21, 2026

Overview:

simplify CRD management

Problem
The Dynamo CRDs total ~2.9MB across 7 files, exceeding the 1MB etcd limit for Helm release Secrets. Currently CRDs live in a separate dynamo-crds chart (deploy/helm/charts/crds/), but this is hitting the limit.

CRDs cannot go into Helm templates/ directory because the rendered output is stored in the Helm release Secret, which is bounded by etcd's 1MB object size limit.

Chosen Approach: NIM Operator Pattern (Single Image)
Embed CRD files and a small Go apply tool directly in the existing operator Docker image. A Helm hook Job runs the same image with an alternate command. No separate image to build or publish.

CRDs are applied unconditionally via server-side apply with --force-conflicts. A dynamo.nvidia.com/operator-version annotation is stamped on each CRD for observability (visible via kubectl get crd -o yaml) but does not gate the apply.

Why This Approach
No separate image to build/publish/version -- only the existing operator image is needed
Proven pattern -- identical to NVIDIA NIM Operator's approach
Server-side apply with --force-conflicts -- handles migration from the old CRD chart (claims field ownership)
Pre-install AND pre-upgrade -- CRDs are applied before both fresh installs and upgrades
helm.sh/resource-policy: keep -- already on dynamo CRDs, prevents CRD deletion on chart uninstall

Summary by CodeRabbit

  • New Features

    • Custom Resource Definitions (CRDs) are now automatically installed during platform deployment, eliminating the need for separate installation steps.
  • Documentation

    • Updated installation guides to reflect the simplified deployment process with integrated CRD management.
    • Clarified that the platform chart now includes and manages all required CRDs automatically.

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso julienmancuso requested a review from a team as a code owner February 21, 2026 02:03
@github-actions github-actions Bot added feat documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes labels Feb 21, 2026
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 21, 2026

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 21, 2026

Walkthrough

This pull request consolidates CRD (Custom Resource Definition) management by removing the standalone CRDs Helm chart and introducing an automated pre-install/pre-upgrade hook Job that applies CRDs using a new crd-apply binary. CRDs are now installed directly by the platform chart.

Changes

Cohort / File(s) Summary
Deprecated CRDs Chart
deploy/helm/charts/crds/Chart.yaml, deploy/helm/charts/crds/README.md, deploy/helm/charts/crds/templates/nvidia.com_dynamoworkermetadatas.yaml
Removed the standalone CRDs Helm chart and its resources, including the Chart manifest, documentation, and DynamoWorkerMetadata CRD definition. CRD management is now integrated into the operator chart.
Operator Chart Enhancement
deploy/helm/charts/platform/components/operator/templates/upgrade-crd.yaml, deploy/helm/charts/platform/components/operator/values.yaml
Added new pre-install/pre-upgrade hook Job template and enabled it via a new upgradeCRD flag. The Job applies CRDs using the crd-apply binary with server-side apply, including RBAC resources and Pod security controls.
Platform Chart Configuration
deploy/helm/charts/platform/values.yaml, deploy/helm/README.md
Added upgradeCRD boolean flag to the platform chart and updated the main README to reflect single-chart approach with automatic CRD management via hook Job.
Build Toolchain
deploy/operator/Dockerfile, deploy/operator/Makefile
Updated Dockerfile to build the new crd-apply binary and copy CRDs into the runtime image at /opt/dynamo-operator/crds/. Modified Makefile to distribute CRDs to the operator chart instead of the separate CRDs chart.
CRD Application Tool
deploy/operator/cmd/crd-apply/main.go
New Go program that reads CRD YAML files from a directory, applies them to the cluster using server-side Apply with the "dynamo-crd-apply" field manager, and optionally stamps a version annotation.
Installation Documentation
docs/pages/kubernetes/README.md, docs/pages/kubernetes/dynamo-operator.md, docs/pages/kubernetes/installation-guide.md, docs/pages/reference/support-matrix.md
Updated documentation to remove explicit CRD installation steps and reflect that CRDs are now installed automatically by the platform chart.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hops with delight at consolidation divine,
CRDs bundled now, no more charts to align!
A hook and a Job, pre-install so neat,
Platform chart's babies, now one-chart complete!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: simplify CRD management' accurately and concisely summarizes the main change: moving from a separate CRD chart to operator-managed CRDs via a Helm hook Job.
Description check ✅ Passed Pull request description covers overview, problem statement, approach, and rationale but lacks explicit 'Related Issues' section.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/pages/kubernetes/installation-guide.md (2)

15-16: ⚠️ Potential issue | 🟡 Minor

Stale guidance — "skip CRD installation step" no longer applies.

The new hook applies CRDs unconditionally via server-side apply with --force-conflicts, so there is nothing to skip even on shared clusters where CRDs already exist.

📝 Suggested update
-**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts):
-- CRDs already installed cluster-wide - skip CRD installation step
+**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts):
+- CRDs are managed automatically by the platform chart (idempotent server-side apply)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/pages/kubernetes/installation-guide.md` around lines 15 - 16, Update the
"Shared/Multi-Tenant Cluster" section header text to remove the instruction to
"skip CRD installation step" and replace it with a note that CRDs are applied
unconditionally by the installer hook using server-side apply with
--force-conflicts; specifically edit the bullet that currently reads 'CRDs
already installed cluster-wide - skip CRD installation step' to state that CRDs
will be applied/updated automatically even if they already exist (mention
server-side apply and --force-conflicts as the mechanism).

321-325: ⚠️ Potential issue | 🟡 Minor

Stale troubleshooting entry — references a "step 2 (CRD installation)" that no longer exists.

With the new flow, Path A has no separate CRD installation step; the platform install (step 2) applies CRDs automatically and idempotently. The guidance to "Skip step 2" is now confusing and inaccurate.

📝 Suggested update
 **CRDs already exist**

 Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).

-Solution: Skip step 2 (CRD installation), proceed directly to platform installation.
-
-To check if CRDs exist:
+Note: CRDs are applied idempotently (server-side apply with `--force-conflicts`) by the pre-install/pre-upgrade hook.
+No action is required — the platform install will reconcile existing CRDs safely.
+
+To verify CRD status after install:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/pages/kubernetes/installation-guide.md` around lines 321 - 325, Update
the "CRDs already exist" troubleshooting entry to remove the outdated reference
to "Skip step 2 (CRD installation)" and instead state that the platform install
applies CRDs automatically and idempotently, so no separate CRD installation is
required; edit the paragraph under the "**CRDs already exist**" heading to
advise users to proceed with the platform installation (Path A) as-is and note
that the installer will detect and skip already-present CRDs.
🧹 Nitpick comments (2)
deploy/helm/charts/platform/components/operator/templates/upgrade-crd.yaml (1)

66-123: Consider setting an explicit backoffLimit on the Job.

The Kubernetes default backoffLimit is 6. While this is generally reasonable, explicitly declaring it improves readability and ensures the behavior doesn't change if cluster-level defaults are overridden via an admission webhook.

🔧 Suggested addition
 spec:
+  backoffLimit: 3
   template:
     metadata:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/helm/charts/platform/components/operator/templates/upgrade-crd.yaml`
around lines 66 - 123, Add an explicit backoffLimit to the Job spec to avoid
relying on cluster defaults: in the Job manifest for the crd-apply job (metadata
name using include "dynamo-operator.fullname" and container "crd-apply"), add a
top-level spec.backoffLimit (e.g. 0–3 depending on desired retries) so the Job's
retry behavior is explicit and not dependent on admission webhooks or cluster
defaults.
deploy/operator/cmd/crd-apply/main.go (1)

72-75: Nit: Only .yaml files are matched; .yml files are silently skipped.

If any CRD file were named with a .yml extension it would be silently ignored. Since the files are generated by controller-gen (which uses .yaml), this is fine today, but consider matching both for resilience.

🔧 Optional fix
-		if entry.IsDir() || !strings.HasSuffix(entry.Name(), ".yaml") {
+		if entry.IsDir() || !(strings.HasSuffix(entry.Name(), ".yaml") || strings.HasSuffix(entry.Name(), ".yml")) {
 			continue
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/cmd/crd-apply/main.go` around lines 72 - 75, The loop that
filters CRD files only accepts filenames ending with ".yaml" (in the block that
iterates over entries in main.go), which silently skips ".yml" files; update the
filter in the loop that contains entry.IsDir() check to accept both ".yaml" and
".yml" (e.g., check strings.HasSuffix(entry.Name(), ".yaml") ||
strings.HasSuffix(entry.Name(), ".yml")) so CRD files with either extension are
processed by the code that loads/applies CRDs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deploy/operator/Dockerfile`:
- Around line 57-58: The container fails to find crd-apply because the Helm hook
uses a bare command name ("crd-apply") while the Dockerfile installs crd-apply
to /usr/local/bin but the base image’s PATH may not include that directory; fix
by either (A) updating the Helm template (e.g., upgrade-crd.yaml) to invoke the
tool with its absolute path "/usr/local/bin/crd-apply" instead of "crd-apply",
or (B) modify the Dockerfile to ensure /usr/local/bin is in PATH (set ENV
PATH=/usr/local/bin:$PATH) so the bare name resolves at runtime; reference the
crd-apply binary and the Dockerfile/upgrade-crd.yaml invocation when making the
change.

In `@deploy/operator/Makefile`:
- Around line 114-116: The CI fails because the Makefile manifests target copies
generated CRD YAMLs into the operator Helm chart crds directory but those
generated files are not committed; run the manifests target locally (make
manifests) to generate the CRDs and commit the resulting YAMLs into the
repository at the operator Helm chart crds location so that the manifests target
no longer produces uncommitted changes during make check; update the repo by
adding and committing the generated CRD files so subsequent CI runs see a clean
git status.

---

Outside diff comments:
In `@docs/pages/kubernetes/installation-guide.md`:
- Around line 15-16: Update the "Shared/Multi-Tenant Cluster" section header
text to remove the instruction to "skip CRD installation step" and replace it
with a note that CRDs are applied unconditionally by the installer hook using
server-side apply with --force-conflicts; specifically edit the bullet that
currently reads 'CRDs already installed cluster-wide - skip CRD installation
step' to state that CRDs will be applied/updated automatically even if they
already exist (mention server-side apply and --force-conflicts as the
mechanism).
- Around line 321-325: Update the "CRDs already exist" troubleshooting entry to
remove the outdated reference to "Skip step 2 (CRD installation)" and instead
state that the platform install applies CRDs automatically and idempotently, so
no separate CRD installation is required; edit the paragraph under the "**CRDs
already exist**" heading to advise users to proceed with the platform
installation (Path A) as-is and note that the installer will detect and skip
already-present CRDs.

---

Nitpick comments:
In `@deploy/helm/charts/platform/components/operator/templates/upgrade-crd.yaml`:
- Around line 66-123: Add an explicit backoffLimit to the Job spec to avoid
relying on cluster defaults: in the Job manifest for the crd-apply job (metadata
name using include "dynamo-operator.fullname" and container "crd-apply"), add a
top-level spec.backoffLimit (e.g. 0–3 depending on desired retries) so the Job's
retry behavior is explicit and not dependent on admission webhooks or cluster
defaults.

In `@deploy/operator/cmd/crd-apply/main.go`:
- Around line 72-75: The loop that filters CRD files only accepts filenames
ending with ".yaml" (in the block that iterates over entries in main.go), which
silently skips ".yml" files; update the filter in the loop that contains
entry.IsDir() check to accept both ".yaml" and ".yml" (e.g., check
strings.HasSuffix(entry.Name(), ".yaml") || strings.HasSuffix(entry.Name(),
".yml")) so CRD files with either extension are processed by the code that
loads/applies CRDs.

Comment thread deploy/operator/Dockerfile Outdated
Comment thread deploy/operator/Makefile
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Comment thread deploy/operator/config/crd/bases/nvidia.com_dynamoworkermetadatas.yaml Outdated
Comment thread deploy/operator/cmd/crd-apply/main.go Outdated
Comment thread deploy/helm/charts/platform/components/operator/templates/upgrade-crd.yaml Outdated
Comment thread deploy/operator/cmd/crd-apply/main.go
Comment thread deploy/helm/README.md Outdated
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso julienmancuso requested a review from a team February 23, 2026 20:32
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso julienmancuso merged commit 5277fb9 into main Feb 24, 2026
79 of 83 checks passed
@julienmancuso julienmancuso deleted the jsm/operator-managed-crds branch February 24, 2026 02:18
@ranrubin ranrubin mentioned this pull request Feb 26, 2026
yao531441 pushed a commit to yao531441/dynamo that referenced this pull request May 13, 2026
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation Dynamo 1.0.0 feat size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants