refactor: move MPI SSH key generation from Helm hook Job into operator reconciliation#6940
Conversation
…r reconciliation Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
WalkthroughThis pull request transitions SSH key generation and certificate management from Helm-based hooks to operator-controlled components. The operator now includes a built-in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
deploy/operator/internal/secret/ssh_key_manager.go (1)
136-172: Consider whether replicas should be updated when source secret changes.The current implementation skips replication if a secret already exists in the target namespace (line 138-139). If the source secret is ever rotated/regenerated, existing replicas would become stale.
This may be intentional (SSH keys are long-lived and shouldn't change), but worth confirming whether key rotation is a future requirement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/operator/internal/secret/ssh_key_manager.go` around lines 136 - 172, The replicateToNamespace function currently returns early if a target secret exists; change that to fetch the existing target (use m.client.Get into a corev1.Secret variable instead of returning on nil), load the source (source := &corev1.Secret...) as now, then compare relevant fields (Data, Type, Labels) between source and target and, if they differ, update the target secret's Data/Type/Labels and call m.client.Update (or perform a strategic merge Patch) to apply changes; keep the existing create path using m.client.Create and still handle apierrors.IsAlreadyExists/IsNotFound appropriately, handle update conflicts/errors and log when you perform a replica update (same logging as current creation log).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/kubernetes/webhooks.md`:
- Around line 196-203: Update the documentation steps to stop using the
hard-coded Secret name `<release>-webhook-server-cert` and instead reference the
configured webhook.certificateSecret.name (noting its default value), e.g., in
the descriptions for CertManager, CABundleInjector, and certificate rotation;
update the text around the CertManager, CABundleInjector, and webhook server
start so they explicitly say "reads/writes the Secret named by
webhook.certificateSecret.name (default: <release>-webhook-server-cert)" to make
it clear the name is configurable.
---
Nitpick comments:
In `@deploy/operator/internal/secret/ssh_key_manager.go`:
- Around line 136-172: The replicateToNamespace function currently returns early
if a target secret exists; change that to fetch the existing target (use
m.client.Get into a corev1.Secret variable instead of returning on nil), load
the source (source := &corev1.Secret...) as now, then compare relevant fields
(Data, Type, Labels) between source and target and, if they differ, update the
target secret's Data/Type/Labels and call m.client.Update (or perform a
strategic merge Patch) to apply changes; keep the existing create path using
m.client.Create and still handle apierrors.IsAlreadyExists/IsNotFound
appropriately, handle update conflicts/errors and log when you perform a replica
update (same logging as current creation log).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b3ca5392-d443-4a3b-b772-d639e2259785
⛔ Files ignored due to path filters (1)
deploy/operator/go.sumis excluded by!**/*.sum
📒 Files selected for processing (15)
deploy/helm/charts/platform/README.mddeploy/helm/charts/platform/README.md.gotmpldeploy/helm/charts/platform/components/operator/templates/mpi-run-ssh-keygen-job.yamldeploy/helm/charts/platform/components/operator/values.yamldeploy/helm/charts/platform/values.yamldeploy/operator/cmd/main.godeploy/operator/go.moddeploy/operator/internal/controller/dynamographdeployment_controller.godeploy/operator/internal/secret/secret_replicator.godeploy/operator/internal/secret/secret_replicator_test.godeploy/operator/internal/secret/ssh_key_manager.godeploy/operator/internal/secret/ssh_key_manager_test.godocs/kubernetes/deployment/multinode-deployment.mddocs/kubernetes/dynamo-operator.mddocs/kubernetes/webhooks.md
💤 Files with no reviewable changes (4)
- deploy/operator/internal/secret/secret_replicator.go
- deploy/operator/internal/secret/secret_replicator_test.go
- deploy/helm/charts/platform/components/operator/values.yaml
- deploy/helm/charts/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml
…r reconciliation Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
…r reconciliation Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
…r reconciliation Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
…r reconciliation (#6940) Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
…r reconciliation (ai-dynamo#6940) Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Summary
Summary by CodeRabbit
New Features
Documentation
Chores