Skip to content

refactor: move MPI SSH key generation from Helm hook Job into operator reconciliation#6940

Merged
julienmancuso merged 10 commits into
mainfrom
jsm/dep-786
Mar 6, 2026
Merged

refactor: move MPI SSH key generation from Helm hook Job into operator reconciliation#6940
julienmancuso merged 10 commits into
mainfrom
jsm/dep-786

Conversation

@julienmancuso
Copy link
Copy Markdown
Contributor

@julienmancuso julienmancuso commented Mar 5, 2026

Summary

  • Replace the mpi-run-ssh-keygen-job.yaml Helm hook Job with native Go code in the operator that lazily generates SSH keypairs during DynamoGraphDeployment reconciliation, only when a multi-node service is detected
  • Add SSHKeyManager with EnsureAndReplicate method that creates the SSH key secret in the operator namespace and replicates it to deployment namespaces, skipping generation if a valid secret already exists
  • Remove deprecated dynamo.mpiRun.sshKeygen Helm values and update docs to reflect that SSH keys are now auto-managed by the operator

Summary by CodeRabbit

  • New Features

    • Automatic SSH key generation for multi-node deployments
    • Built-in certificate generation and rotation via operator at startup
  • Documentation

    • Updated certificate management guidance for operator-driven generation
    • Clarified cert-manager integration as optional
  • Chores

    • Removed legacy certificate generation configuration

…r reconciliation

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso julienmancuso requested a review from a team as a code owner March 5, 2026 16:39
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 5, 2026

Walkthrough

This pull request transitions SSH key generation and certificate management from Helm-based hooks to operator-controlled components. The operator now includes a built-in SSHKeyManager for SSH key pair generation and secret replication in multi-node deployments, while certificate management shifts from Helm-hook-based provisioning to operator-driven cert-controller generation and rotation. Helm charts and documentation are updated accordingly.

Changes

Cohort / File(s) Summary
Helm Chart & Values Documentation
deploy/helm/charts/platform/README.md, deploy/helm/charts/platform/README.md.gotmpl
Updated certificate generation descriptions from "auto-generated via Helm hooks" to "auto-generated by the operator's built-in cert-controller." Clarified that the operator generates and rotates certificates at startup, with no action required during upgrades.
Helm Chart Configuration
deploy/helm/charts/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml, deploy/helm/charts/platform/components/operator/values.yaml, deploy/helm/charts/platform/values.yaml
Removed entire SSH key generation Helm template (141 lines) and its RBAC/ServiceAccount resources. Eliminated sshKeygen configuration blocks from values. Removed webhook certificate validity and legacy certGenerator image configurations.
Operator SSH Key Management
deploy/operator/internal/secret/ssh_key_manager.go, deploy/operator/internal/secret/ssh_key_manager_test.go
Added new SSHKeyManager component (172 lines) that lazily generates 2048-bit RSA key pairs and replicates secrets to target namespaces. Includes KeyPairGenerator interface and concrete RSA implementation with 232 lines of comprehensive tests.
Operator Secret Replication Removal
deploy/operator/internal/secret/secret_replicator.go, deploy/operator/internal/secret/secret_replicator_test.go
Removed deprecated SecretReplicator type (93 lines) and its test suite (207 lines) that handled generic secret cross-namespace replication.
Operator Controller Wiring
deploy/operator/cmd/main.go, deploy/operator/internal/controller/dynamographdeployment_controller.go
Updated registerControllers signature to accept SSHKeyManager instead of SecretReplicator. Modified DynamoGraphDeploymentReconciler to call EnsureAndReplicate for multinode deployments with new error handling semantics.
Go Module Dependencies
deploy/operator/go.mod
Added golang.org/x/crypto v0.48.0 and updated indirect dependencies (golang.org/x/mod, golang.org/x/net, golang.org/x/sys, golang.org/x/term, golang.org/x/text, golang.org/x/tools).
Kubernetes Documentation
docs/kubernetes/deployment/multinode-deployment.md, docs/kubernetes/dynamo-operator.md, docs/kubernetes/webhooks.md
Clarified automatic SSH key generation by the operator for multi-node deployments. Updated certificate management features to emphasize "automatic certificate generation and rotation (default)" vs cert-manager integration. Expanded webhook documentation with detailed lifecycle and PKI integration guidance.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 The hooks have retired from the night,
The operator now holds the key so tight,
SSH keys spun fresh in crypto's light,
Certificates dancing through startup's might,
No more helm incantations required—the operator gets it right! 🔐✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description covers the key changes and objectives but lacks structured sections matching the template (Overview, Details, Where to start, Related Issues). Restructure the description to follow the template with Overview, Details, Where to start, and Related Issues sections for better clarity and consistency.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: moving MPI SSH key generation from Helm hooks into operator reconciliation, which aligns with the primary refactoring objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
deploy/operator/internal/secret/ssh_key_manager.go (1)

136-172: Consider whether replicas should be updated when source secret changes.

The current implementation skips replication if a secret already exists in the target namespace (line 138-139). If the source secret is ever rotated/regenerated, existing replicas would become stale.

This may be intentional (SSH keys are long-lived and shouldn't change), but worth confirming whether key rotation is a future requirement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/internal/secret/ssh_key_manager.go` around lines 136 - 172,
The replicateToNamespace function currently returns early if a target secret
exists; change that to fetch the existing target (use m.client.Get into a
corev1.Secret variable instead of returning on nil), load the source (source :=
&corev1.Secret...) as now, then compare relevant fields (Data, Type, Labels)
between source and target and, if they differ, update the target secret's
Data/Type/Labels and call m.client.Update (or perform a strategic merge Patch)
to apply changes; keep the existing create path using m.client.Create and still
handle apierrors.IsAlreadyExists/IsNotFound appropriately, handle update
conflicts/errors and log when you perform a replica update (same logging as
current creation log).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/kubernetes/webhooks.md`:
- Around line 196-203: Update the documentation steps to stop using the
hard-coded Secret name `<release>-webhook-server-cert` and instead reference the
configured webhook.certificateSecret.name (noting its default value), e.g., in
the descriptions for CertManager, CABundleInjector, and certificate rotation;
update the text around the CertManager, CABundleInjector, and webhook server
start so they explicitly say "reads/writes the Secret named by
webhook.certificateSecret.name (default: <release>-webhook-server-cert)" to make
it clear the name is configurable.

---

Nitpick comments:
In `@deploy/operator/internal/secret/ssh_key_manager.go`:
- Around line 136-172: The replicateToNamespace function currently returns early
if a target secret exists; change that to fetch the existing target (use
m.client.Get into a corev1.Secret variable instead of returning on nil), load
the source (source := &corev1.Secret...) as now, then compare relevant fields
(Data, Type, Labels) between source and target and, if they differ, update the
target secret's Data/Type/Labels and call m.client.Update (or perform a
strategic merge Patch) to apply changes; keep the existing create path using
m.client.Create and still handle apierrors.IsAlreadyExists/IsNotFound
appropriately, handle update conflicts/errors and log when you perform a replica
update (same logging as current creation log).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b3ca5392-d443-4a3b-b772-d639e2259785

📥 Commits

Reviewing files that changed from the base of the PR and between 7f1f307 and c80ada1.

⛔ Files ignored due to path filters (1)
  • deploy/operator/go.sum is excluded by !**/*.sum
📒 Files selected for processing (15)
  • deploy/helm/charts/platform/README.md
  • deploy/helm/charts/platform/README.md.gotmpl
  • deploy/helm/charts/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml
  • deploy/helm/charts/platform/components/operator/values.yaml
  • deploy/helm/charts/platform/values.yaml
  • deploy/operator/cmd/main.go
  • deploy/operator/go.mod
  • deploy/operator/internal/controller/dynamographdeployment_controller.go
  • deploy/operator/internal/secret/secret_replicator.go
  • deploy/operator/internal/secret/secret_replicator_test.go
  • deploy/operator/internal/secret/ssh_key_manager.go
  • deploy/operator/internal/secret/ssh_key_manager_test.go
  • docs/kubernetes/deployment/multinode-deployment.md
  • docs/kubernetes/dynamo-operator.md
  • docs/kubernetes/webhooks.md
💤 Files with no reviewable changes (4)
  • deploy/operator/internal/secret/secret_replicator.go
  • deploy/operator/internal/secret/secret_replicator_test.go
  • deploy/helm/charts/platform/components/operator/values.yaml
  • deploy/helm/charts/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml

Comment thread docs/kubernetes/webhooks.md Outdated
@github-actions github-actions Bot added documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes refactor labels Mar 5, 2026
…r reconciliation

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 5, 2026

Comment thread deploy/operator/internal/secret/ssh_key_manager.go Outdated
Comment thread deploy/operator/internal/secret/ssh_key_manager.go Outdated
…r reconciliation

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
…r reconciliation

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso julienmancuso merged commit fc2d01d into main Mar 6, 2026
86 checks passed
@julienmancuso julienmancuso deleted the jsm/dep-786 branch March 6, 2026 18:25
julienmancuso added a commit that referenced this pull request Mar 7, 2026
…r reconciliation (#6940)

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
yao531441 pushed a commit to yao531441/dynamo that referenced this pull request May 13, 2026
…r reconciliation (ai-dynamo#6940)

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation refactor size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants