feat: simplify CRD management#6466
Conversation
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
WalkthroughThis pull request consolidates CRD (Custom Resource Definition) management by removing the standalone CRDs Helm chart and introducing an automated pre-install/pre-upgrade hook Job that applies CRDs using a new crd-apply binary. CRDs are now installed directly by the platform chart. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
docs/pages/kubernetes/installation-guide.md (2)
15-16:⚠️ Potential issue | 🟡 MinorStale guidance — "skip CRD installation step" no longer applies.
The new hook applies CRDs unconditionally via server-side apply with
--force-conflicts, so there is nothing to skip even on shared clusters where CRDs already exist.📝 Suggested update
-**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts): -- CRDs already installed cluster-wide - skip CRD installation step +**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts): +- CRDs are managed automatically by the platform chart (idempotent server-side apply)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/pages/kubernetes/installation-guide.md` around lines 15 - 16, Update the "Shared/Multi-Tenant Cluster" section header text to remove the instruction to "skip CRD installation step" and replace it with a note that CRDs are applied unconditionally by the installer hook using server-side apply with --force-conflicts; specifically edit the bullet that currently reads 'CRDs already installed cluster-wide - skip CRD installation step' to state that CRDs will be applied/updated automatically even if they already exist (mention server-side apply and --force-conflicts as the mechanism).
321-325:⚠️ Potential issue | 🟡 MinorStale troubleshooting entry — references a "step 2 (CRD installation)" that no longer exists.
With the new flow, Path A has no separate CRD installation step; the platform install (step 2) applies CRDs automatically and idempotently. The guidance to "Skip step 2" is now confusing and inaccurate.
📝 Suggested update
**CRDs already exist** Cause: Installing CRDs on a cluster where they're already present (common on shared clusters). -Solution: Skip step 2 (CRD installation), proceed directly to platform installation. - -To check if CRDs exist: +Note: CRDs are applied idempotently (server-side apply with `--force-conflicts`) by the pre-install/pre-upgrade hook. +No action is required — the platform install will reconcile existing CRDs safely. + +To verify CRD status after install:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/pages/kubernetes/installation-guide.md` around lines 321 - 325, Update the "CRDs already exist" troubleshooting entry to remove the outdated reference to "Skip step 2 (CRD installation)" and instead state that the platform install applies CRDs automatically and idempotently, so no separate CRD installation is required; edit the paragraph under the "**CRDs already exist**" heading to advise users to proceed with the platform installation (Path A) as-is and note that the installer will detect and skip already-present CRDs.
🧹 Nitpick comments (2)
deploy/helm/charts/platform/components/operator/templates/upgrade-crd.yaml (1)
66-123: Consider setting an explicitbackoffLimiton the Job.The Kubernetes default
backoffLimitis 6. While this is generally reasonable, explicitly declaring it improves readability and ensures the behavior doesn't change if cluster-level defaults are overridden via an admission webhook.🔧 Suggested addition
spec: + backoffLimit: 3 template: metadata:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/helm/charts/platform/components/operator/templates/upgrade-crd.yaml` around lines 66 - 123, Add an explicit backoffLimit to the Job spec to avoid relying on cluster defaults: in the Job manifest for the crd-apply job (metadata name using include "dynamo-operator.fullname" and container "crd-apply"), add a top-level spec.backoffLimit (e.g. 0–3 depending on desired retries) so the Job's retry behavior is explicit and not dependent on admission webhooks or cluster defaults.deploy/operator/cmd/crd-apply/main.go (1)
72-75: Nit: Only.yamlfiles are matched;.ymlfiles are silently skipped.If any CRD file were named with a
.ymlextension it would be silently ignored. Since the files are generated bycontroller-gen(which uses.yaml), this is fine today, but consider matching both for resilience.🔧 Optional fix
- if entry.IsDir() || !strings.HasSuffix(entry.Name(), ".yaml") { + if entry.IsDir() || !(strings.HasSuffix(entry.Name(), ".yaml") || strings.HasSuffix(entry.Name(), ".yml")) { continue }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/operator/cmd/crd-apply/main.go` around lines 72 - 75, The loop that filters CRD files only accepts filenames ending with ".yaml" (in the block that iterates over entries in main.go), which silently skips ".yml" files; update the filter in the loop that contains entry.IsDir() check to accept both ".yaml" and ".yml" (e.g., check strings.HasSuffix(entry.Name(), ".yaml") || strings.HasSuffix(entry.Name(), ".yml")) so CRD files with either extension are processed by the code that loads/applies CRDs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deploy/operator/Dockerfile`:
- Around line 57-58: The container fails to find crd-apply because the Helm hook
uses a bare command name ("crd-apply") while the Dockerfile installs crd-apply
to /usr/local/bin but the base image’s PATH may not include that directory; fix
by either (A) updating the Helm template (e.g., upgrade-crd.yaml) to invoke the
tool with its absolute path "/usr/local/bin/crd-apply" instead of "crd-apply",
or (B) modify the Dockerfile to ensure /usr/local/bin is in PATH (set ENV
PATH=/usr/local/bin:$PATH) so the bare name resolves at runtime; reference the
crd-apply binary and the Dockerfile/upgrade-crd.yaml invocation when making the
change.
In `@deploy/operator/Makefile`:
- Around line 114-116: The CI fails because the Makefile manifests target copies
generated CRD YAMLs into the operator Helm chart crds directory but those
generated files are not committed; run the manifests target locally (make
manifests) to generate the CRDs and commit the resulting YAMLs into the
repository at the operator Helm chart crds location so that the manifests target
no longer produces uncommitted changes during make check; update the repo by
adding and committing the generated CRD files so subsequent CI runs see a clean
git status.
---
Outside diff comments:
In `@docs/pages/kubernetes/installation-guide.md`:
- Around line 15-16: Update the "Shared/Multi-Tenant Cluster" section header
text to remove the instruction to "skip CRD installation step" and replace it
with a note that CRDs are applied unconditionally by the installer hook using
server-side apply with --force-conflicts; specifically edit the bullet that
currently reads 'CRDs already installed cluster-wide - skip CRD installation
step' to state that CRDs will be applied/updated automatically even if they
already exist (mention server-side apply and --force-conflicts as the
mechanism).
- Around line 321-325: Update the "CRDs already exist" troubleshooting entry to
remove the outdated reference to "Skip step 2 (CRD installation)" and instead
state that the platform install applies CRDs automatically and idempotently, so
no separate CRD installation is required; edit the paragraph under the "**CRDs
already exist**" heading to advise users to proceed with the platform
installation (Path A) as-is and note that the installer will detect and skip
already-present CRDs.
---
Nitpick comments:
In `@deploy/helm/charts/platform/components/operator/templates/upgrade-crd.yaml`:
- Around line 66-123: Add an explicit backoffLimit to the Job spec to avoid
relying on cluster defaults: in the Job manifest for the crd-apply job (metadata
name using include "dynamo-operator.fullname" and container "crd-apply"), add a
top-level spec.backoffLimit (e.g. 0–3 depending on desired retries) so the Job's
retry behavior is explicit and not dependent on admission webhooks or cluster
defaults.
In `@deploy/operator/cmd/crd-apply/main.go`:
- Around line 72-75: The loop that filters CRD files only accepts filenames
ending with ".yaml" (in the block that iterates over entries in main.go), which
silently skips ".yml" files; update the filter in the loop that contains
entry.IsDir() check to accept both ".yaml" and ".yml" (e.g., check
strings.HasSuffix(entry.Name(), ".yaml") || strings.HasSuffix(entry.Name(),
".yml")) so CRD files with either extension are processed by the code that
loads/applies CRDs.
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Overview:
simplify CRD management
Problem
The Dynamo CRDs total ~2.9MB across 7 files, exceeding the 1MB etcd limit for Helm release Secrets. Currently CRDs live in a separate dynamo-crds chart (deploy/helm/charts/crds/), but this is hitting the limit.
CRDs cannot go into Helm templates/ directory because the rendered output is stored in the Helm release Secret, which is bounded by etcd's 1MB object size limit.
Chosen Approach: NIM Operator Pattern (Single Image)
Embed CRD files and a small Go apply tool directly in the existing operator Docker image. A Helm hook Job runs the same image with an alternate command. No separate image to build or publish.
CRDs are applied unconditionally via server-side apply with --force-conflicts. A dynamo.nvidia.com/operator-version annotation is stamped on each CRD for observability (visible via kubectl get crd -o yaml) but does not gate the apply.
Why This Approach
No separate image to build/publish/version -- only the existing operator image is needed
Proven pattern -- identical to NVIDIA NIM Operator's approach
Server-side apply with --force-conflicts -- handles migration from the old CRD chart (claims field ownership)
Pre-install AND pre-upgrade -- CRDs are applied before both fresh installs and upgrades
helm.sh/resource-policy: keep -- already on dynamo CRDs, prevents CRD deletion on chart uninstall
Summary by CodeRabbit
New Features
Documentation