feat(operator): add GMSCheckpointSpec for loader/saver GMS clients by galletas1712 · Pull Request #9641 · ai-dynamo/dynamo

galletas1712 · 2026-05-15T21:36:03Z

Summary

Adds the GMS checkpoint client override API and wires it through the v1beta1 controller paths so custom GMS saver/loader client containers work for both auto-created checkpoints and restore-time services.

Users can now configure:

gpuMemoryService.checkpoint.saver for the gms-saver checkpoint client used by checkpoint Jobs.
gpuMemoryService.checkpoint.loader for the gms-loader checkpoint client used by restore-target pods.

If no override is provided, the operator builds the same default GMS saver/loader client containers as before.

Size / review notes

Most of the apparent diff size is generated CRD YAML: the new GMS client spec includes Kubernetes EnvVar and VolumeMount types, so controller-gen expands those schemas in both the operator CRDs and copied Helm CRDs.

The hand-written logic is concentrated in:

API shape + conversion: api/v1alpha1/common.go, api/v1beta1/common.go, api/v1alpha1/shared_spec_conversion.go
Checkpoint client layering: internal/checkpoint/gms.go
DGD/DCD controller propagation: internal/controller/gms_checkpoint_overrides.go plus the two reconcile call sites
Validation/tests

What changed

API / CRDs
- Added GPUMemoryServiceSpec.Checkpoint in both v1alpha1 and v1beta1.
- Added GMSCheckpointSpec with independent loader and saver specs.
- Added GMSClientSpec fields for image, full command, envs, envFromSecret, and volumeMounts.
- Added v1alpha1 <-> v1beta1 conversion and regenerated deepcopy + operator/Helm CRDs.
GMS checkpoint client injection
- EnsureGMSRestoreSidecars layers checkpoint.loader onto the default gms-loader client container.
- EnsureGMSCheckpointJobSidecars layers checkpoint.saver onto the default gms-saver client container.
- The operator still constructs the default container first, then overlays user-provided fields.
- GMS_SOCKET_DIR remains operator-owned so the injected clients always use the operator-managed UDS mount.
DGD auto checkpoint flow
- When a DGD in auto checkpoint mode creates a DynamoCheckpoint, the generated CR now carries the service's GMS saver override.
- Loader overrides are intentionally stripped from the generated DynamoCheckpoint, because checkpoint Jobs only save; restore-time loader configuration belongs to the consuming service.
Restore flow
- When a service restores from a GMS-enabled checkpoint, the resolved checkpoint info is overlaid with the service's gpuMemoryService.checkpoint.loader override before pod generation.
- This is applied in both DGD/Grove and standalone DCD/non-Grove controller paths.
- A service-level loader override does not turn a non-GMS checkpoint into a GMS restore; the referenced/resolved DynamoCheckpoint must still be GMS-enabled.
Admission validation
- Shared service validation rejects gpuMemoryService.checkpoint.{loader,saver} unless GMS is enabled.
- Standalone DynamoCheckpoint validation rejects checkpoint.loader, because a DynamoCheckpoint Job only saves.
Tests
- Added/updated API conversion tests for the new v1beta1 fields.
- Added controller coverage for auto-created checkpoint saver propagation and restore-time service loader propagation.
- Added checkpoint client tests for default behavior, override layering, env/envFromSecret/volumeMount behavior, and operator-owned env handling.
- Added webhook validation coverage for the new rules.

Example schemas

1. DGD auto checkpoint + custom saver

User submits a DGD with checkpoint mode: Auto, GMS enabled, and a service-level saver override:

apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeployment
metadata:
  name: llama-auto-save
spec:
  backendFramework: vllm
  components:
    - name: worker
      type: worker
      podTemplate:
        spec:
          containers:
            - name: main
              image: my-vllm-worker:latest
      experimental:
        checkpoint:
          mode: Auto
          identity:
            model: meta-llama/Llama-3.1-8B-Instruct
            backendFramework: vllm
            tensorParallelSize: 1
            pipelineParallelSize: 1
        gpuMemoryService:
          mode: IntraPod
          checkpoint:
            saver:
              image: my-gms-saver:latest
              command: ["/bin/gms-save"]
              envs:
                - name: GMS_TRANSFER_BACKEND
                  value: nixl-gds
              envFromSecret: gms-save-secret
              volumeMounts:
                - name: extra-save-config
                  mountPath: /etc/gms-save

The DGD controller auto-creates a DynamoCheckpoint carrying the saver config. The loader is not copied into this CR because the checkpoint Job only saves:

apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
  name: checkpoint-<identity-hash>
spec:
  identity:
    model: meta-llama/Llama-3.1-8B-Instruct
    backendFramework: vllm
    tensorParallelSize: 1
    pipelineParallelSize: 1
  gpuMemoryService:
    enabled: true
    mode: intraPod
    checkpoint:
      saver:
        image: my-gms-saver:latest
        command: ["/bin/gms-save"]
        envs:
          - name: GMS_TRANSFER_BACKEND
            value: nixl-gds
        envFromSecret: gms-save-secret
        volumeMounts:
          - name: extra-save-config
            mountPath: /etc/gms-save
  job:
    podTemplateSpec:
      spec:
        containers:
          - name: main
            image: my-vllm-worker:latest

2. DGD restore + custom service loader

User submits a DGD that restores from a GMS-enabled checkpoint and sets a service-level loader override:

apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeployment
metadata:
  name: llama-restore
spec:
  backendFramework: vllm
  components:
    - name: worker
      type: worker
      podTemplate:
        spec:
          containers:
            - name: main
              image: my-vllm-worker:latest
      experimental:
        checkpoint:
          mode: Manual
          checkpointRef: llama-gms-checkpoint
        gpuMemoryService:
          mode: IntraPod
          checkpoint:
            loader:
              image: my-gms-loader:latest
              command: ["/bin/gms-load"]
              envs:
                - name: GMS_TRANSFER_BACKEND
                  value: nixl-gds
              envFromSecret: gms-load-secret
              volumeMounts:
                - name: extra-load-config
                  mountPath: /etc/gms-load

The referenced checkpoint must be GMS-enabled. It may have a saver override from creation time, but the restore loader is read from the consuming service above:

apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
  name: llama-gms-checkpoint
spec:
  identity:
    model: meta-llama/Llama-3.1-8B-Instruct
    backendFramework: vllm
  gpuMemoryService:
    enabled: true
    mode: intraPod
    checkpoint:
      saver:
        image: my-gms-saver:latest
  job:
    podTemplateSpec:
      spec:
        containers:
          - name: main
            image: my-vllm-worker:latest
status:
  phase: Ready

At pod-generation time, the controller resolves llama-gms-checkpoint, sees that it is GMS-enabled, then overlays the DGD service's loader override onto the generated gms-loader client container.

3. Manual DynamoCheckpoint with custom saver

For manual checkpoint creation, users can put saver directly on the DynamoCheckpoint:

apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
  name: llama-manual-save
spec:
  identity:
    model: meta-llama/Llama-3.1-8B-Instruct
    backendFramework: vllm
  gpuMemoryService:
    enabled: true
    mode: intraPod
    checkpoint:
      saver:
        image: my-gms-saver:latest
        command: ["/bin/gms-save"]
  job:
    podTemplateSpec:
      spec:
        containers:
          - name: main
            image: my-vllm-worker:latest

gpuMemoryService.checkpoint.loader is rejected on DynamoCheckpoint; loader config belongs on the DGD/DCD service that restores from the checkpoint.

Design choices

Saver lives on DynamoCheckpoint: save-time behavior is reconciled by the checkpoint controller from the DynamoCheckpoint CR, so DGD auto mode stamps the saver override onto the generated CR.
Loader lives on the consuming service: restore-time behavior depends on the workload pod doing the restore, so the service's loader override is overlaid after resolving the checkpoint.
v1beta1 plumbing is required: DGD/DCD controllers reconcile v1beta1 objects even when users submit v1alpha1 YAML, so the new fields need v1beta1 types and conversion to survive to reconcile time.
Additive / opt-in only: nil or empty override specs leave operator-generated client containers unchanged.
Operator-owned identities stay fixed: container names remain gms-loader and gms-saver; users customize fields on those containers rather than replacing the containers wholesale.
Full command override: command is the entire argv. There is no separate args field and no hidden python3 -m ... prefix when a custom command is provided.
Layer on defaults: the operator keeps adding the GMS socket mount/env, DRA claim, checkpoint PVC mount, and default checkpoint directory; user configuration is layered on top except for operator-owned GMS_SOCKET_DIR.

Out of scope / follow-ups

This PR intentionally does not clean up the older GMS checkpoint storage path. Follow-up work can remove or redesign the remaining built-in assumptions, including:

GMS_CHECKPOINT_DIR injection and resolveGMSArtifactDir().
GMS use of DiscoverAndResolveStorage() / InjectCheckpointVolume() for the operator-owned checkpoint PVC path.
Python-side CLI/env fallback and transfer-backend support.
Any broader cleanup around the temporary GMS snapshot gate.

Validation

make manifests
uv run --no-project --with 'pydantic>=2.11.4,<2.13' --with pyyaml --with black make generate-pydantic
go test -buildvcs=false $(go list -buildvcs=false ./... | grep -v /e2e | grep -v /test | grep -v /cmd) from deploy/operator
go test ./internal/controller ./api/v1alpha1 ./api/v1beta1 ./internal/checkpoint ./internal/webhook/validation ./internal/dynamo from deploy/operator

Note: go test ./... includes e2e tests and failed locally because kubectl attempted to use an expired Teleport kube context; the non-e2e operator package tests above pass.

Summary by CodeRabbit

Release Notes

New Features
- Added GPU Memory Service checkpoint client customization support, enabling override of loader (restore) and saver (checkpoint) client containers with custom images, commands, environment variables, and volume mounts.

Purely additive schema change: introduces GMSCheckpointSpec (loader + saver) and GMSSidecarSpec (image, command, envs, envFromSecret, volumeMounts) and nests an optional Checkpoint pointer on GPUMemoryServiceSpec. No operator code yet reads these fields; behavior is byte-identical to before. Follow-up commits wire the injection path (applyGMSSidecarSpec helper layered on the existing defaults), add the webhook D3 rule (checkpoint.{loader,saver} requires gpuMemoryService.enabled=true), and reject checkpoint.loader on DynamoCheckpoint (Jobs only save). Per locked design choice C2 there is no separate Args field — Command is the full container argv. Container names stay operator-owned (gms-loader / gms-saver) for compatibility with the idempotent removeGMSManagedContainers strip path landed in #9514. Regenerated artifacts: - zz_generated.deepcopy.go: DeepCopy{Into} for the two new structs; GPUMemoryServiceSpec.DeepCopyInto now handles the Checkpoint pointer. - config/crd/bases/{dynamocheckpoints,dynamocomponentdeployments, dynamographdeployments}.yaml: new optional checkpoint schema branch. - deploy/helm/charts/.../crds/: synced copies from the same make target. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

…r/saver Adds applyGMSSidecarSpec, which takes the operator-built default loader or saver container and overlays user-supplied fields (image, command, envs, envFromSecret, volumeMounts) when GPUMemoryService.Checkpoint.{Loader,Saver} is non-nil. When nil, behavior is byte-identical to before: GMS_CHECKPOINT_DIR is still set, the checkpoint PVC is still mounted, the default python3 -m gpu_memory_service.cli.snapshot.{loader,saver} command is still used, and the container name stays operator-owned. Merge semantics mirror dynamo.mergeFrontendSidecarDefaults: - Image and Command are full overrides (when non-empty). - Envs are merged with user-wins on name collision (local mergeEnvVars duplicates dynamo.MergeEnvs to avoid an import cycle: dynamo already imports checkpoint). - EnvFromSecret appends an envFrom source. - VolumeMounts append to (not replace) operator mounts so the gms-intrapod-control mount survives. Callers updated: - internal/checkpoint/podspec.go: passes info.GPUMemoryService.Checkpoint to EnsureGMSRestoreSidecars on the restore path. - internal/controller/checkpoint_job.go: passes ckpt.Spec.GPUMemoryService.Checkpoint to EnsureGMSCheckpointJobSidecars on the standalone DynamoCheckpoint job path. Pre-existing tests in internal/checkpoint and internal/controller pass unchanged. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

…and reject loader on DynamoCheckpoint Two locked design rules from the GMS sidecar pluggability plan: D3 (DGD + DynamoCheckpoint via SharedSpecValidator): gpuMemoryService.checkpoint.loader or .saver is only meaningful when gpuMemoryService.enabled=true, because the operator only injects the loader/saver sidecars on the GMS path. Enforced as a new method validateGMSCheckpointSidecars on SharedSpecValidator. Empty checkpoint: {} (no loader, no saver) is still accepted — the field is purely additive and a literal empty object opts into nothing. DynamoCheckpoint-specific (rejects loader on Jobs): A DynamoCheckpoint is a save-side Job; restore is reconciled onto worker pods from the consuming service's ServiceCheckpointConfig. Setting checkpoint.loader on a DynamoCheckpoint is always a configuration error. Added validateDynamoCheckpointCheckpointSidecars in dynamocheckpoint_handler.go and chained behind the existing GMS-snapshot env-gate via a new validateDynamoCheckpoint entry point. This composes with (does not replace) the GMS snapshot feature-gate check from #8829 (ValidateGMSSnapshotGate / DYN_OPERATOR_ALLOW_GMS_SNAPSHOT): the gate answers "is GMS+snapshot admissible at all?"; D3 and the loader rejection answer "given it is admissible, does the spec internally make sense?". Tests: - shared_test.go: 4 new D3 cases (empty checkpoint accepted, loader with enabled=true accepted, loader without enabled rejected, saver without enabled rejected). - dynamocheckpoint_test.go (new): table-driven cases for the standalone handler — loader always rejected, saver requires enabled=true, plus a composition test that proves the loader rule fires even with the GMS snapshot env-gate satisfied. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

New gms_test.go exercises the merge semantics added by applyGMSSidecarSpec. The existing checkpoint_test.go is left untouched — this file owns the new opt-in cases. Loader path (EnsureGMSRestoreSidecars): - Nil checkpoint spec: byte-identical to pre-PR behavior (default image, default command, GMS_CHECKPOINT_DIR set, checkpoint PVC mounted). - Image override: user image wins; command/env/mounts unchanged. - Command override: user argv fully replaces default; no implicit python3 -m prefix (locked C2). - Envs merge: user wins on name collision with operator-set vars; new vars appended; GMS_SOCKET_DIR preserved. - VolumeMounts append: operator gms-intrapod-control and checkpoint-PVC mounts both preserved; user mount appears alongside. - EnvFromSecret: appended as a single envFrom source. Saver path (EnsureGMSCheckpointJobSidecars): - Nil checkpoint spec: byte-identical default saver injection. - Saver override: full Image+Command+Envs+VolumeMounts+EnvFromSecret merge; cross-check that Loader on the same spec is never accidentally read on the Job path. applyGMSSidecarSpec direct unit tests: - nil spec is a no-op (returns base unchanged). - empty &GMSSidecarSpec{} is also a no-op (every field empty falls through; this covers the literal "checkpoint.loader: {}" footgun cited in the design plan). - empty EnvFromSecret string is ignored — does not render a malformed envFrom source despite the lean-config stance, because an empty secret name is not a legal Kubernetes reference. All tests run with the existing test scaffolding (testHash, findContainer, testPodSpec analog). No envtest dependency. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

copy-pr-bot · 2026-05-18T21:54:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

coderabbitai · 2026-05-18T22:15:48Z

Walkthrough

This PR extends the GPU Memory Service (GMS) API to support optional checkpoint sidecar overrides (loader for restore, saver for save Jobs). Users can now customize GMS sidecar image, command, environment variables, and volume mounts without modifying operator defaults. Changes span API schema definition, CRD validation, conversion logic, injection helpers, controller integration, and admission validation.

Changes

GPU Memory Service Checkpoint Sidecar Override Configuration

Layer / File(s)	Summary
API type definitions for checkpoint overrides `deploy/operator/api/v1alpha1/common.go`, `deploy/operator/api/v1beta1/common.go`	Added `GMSCheckpointSpec` (with optional `Loader` and `Saver` sidecar overrides) and `GMSSidecarSpec` (with optional `Image`, `Command`, `Envs`, `EnvFromSecret`, `VolumeMounts`) to both API versions. Added `Checkpoint` field to `GPUMemoryServiceSpec`.
CRD schema validation for checkpoint configuration `deploy/helm/charts/platform/components/operator/crds/dynamocheckpoints.yaml`, `deploy/operator/config/crd/bases/dynamocheckpoints.yaml`, `deploy/helm/charts/.../crds/dynamocomponentdeployments.yaml`, `deploy/operator/config/crd/bases/dynamocomponentdeployments.yaml`, `deploy/helm/charts/.../crds/dynamographdeployments.yaml`, `deploy/operator/config/crd/bases/dynamographdeployments.yaml`	Declared `spec.gpuMemoryService.checkpoint` schema with nested `loader` and `saver` override configuration in all three CRD types (both helm chart and operator config base paths), validating command arrays, envFromSecret references, envs with full `valueFrom` sources (configMapKeyRef, fieldRef, fileKeyRef, resourceFieldRef, secretKeyRef), image overrides, and volumeMounts.
Version conversion and deep-copy support `deploy/operator/api/v1alpha1/shared_spec_conversion.go`, `deploy/operator/api/v1alpha1/zz_generated.deepcopy.go`, `deploy/operator/api/v1beta1/zz_generated.deepcopy.go`, `deploy/operator/api/v1alpha1/conversion_field_coverage_test.go`, `deploy/operator/api/v1alpha1/dynamocomponentdeployment_conversion_test.go`, `deploy/operator/api/v1alpha1/dynamographdeployment_conversion_test.go`	Implemented `ConvertFromGMSCheckpointSpec`, `ConvertToGMSCheckpointSpec`, and sidecar conversion helpers with proper deep-copying of slice fields. Added autogenerated `DeepCopyInto` and `DeepCopy` methods for checkpoint and sidecar types. Extended fixture data and added comprehensive v1alpha1↔v1beta1 round-trip test validating checkpoint loader/saver payload preservation.
Sidecar override injection logic `deploy/operator/internal/checkpoint/gms.go`, `deploy/operator/internal/checkpoint/gms_test.go`	Extended `EnsureGMSRestoreSidecars` and `EnsureGMSCheckpointJobSidecars` signatures to accept optional `*GMSCheckpointSpec`. Implemented `applyGMSSidecarSpec` helper to layer override image, command, envs (with user-wins merging except for operator-owned `GMS_SOCKET_DIR`), envFrom secrets, and volume mounts onto default containers. Added 11 unit tests covering nil spec defaults, image/command overrides, env merging semantics, volume mount appending, and edge cases.
Controller integration and overlay helpers `deploy/operator/internal/checkpoint/podspec.go`, `deploy/operator/internal/controller/checkpoint_job.go`, `deploy/operator/internal/controller/gms_checkpoint_overrides.go`, `deploy/operator/internal/controller/dynamocomponentdeployment_controller.go`, `deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go`, `deploy/operator/internal/controller/dynamographdeployment_controller.go`, `deploy/operator/internal/controller/dynamographdeployment_controller_test.go`	Wired checkpoint spec into pod creation by passing it to sidecar ensure functions. Added `overlayServiceGMSRestoreLoader` helper to compose service-level GMS loader overrides onto resolved checkpoint configs, and `gmsSpecForAutoCheckpointSave` to strip loader overrides from auto-save configs (loader is restore-only). Updated DynamoComponentDeployment and DynamoGraphDeployment controllers to apply loader overlays during checkpoint resolution. Added 3 integration tests validating loader/saver override preservation through controller logic.
Admission validation for checkpoint sidecar configuration `deploy/operator/internal/webhook/validation/dynamocheckpoint_handler.go`, `deploy/operator/internal/webhook/validation/dynamocheckpoint_test.go`, `deploy/operator/internal/webhook/validation/shared.go`, `deploy/operator/internal/webhook/validation/shared_test.go`	Added `validateDynamoCheckpointCheckpointSidecars` enforcing that loaders are rejected on `DynamoCheckpoint` resources (loader is for restore phase only) and savers require `gpuMemoryService.enabled=true`. Added `validateGMSCheckpointSidecars` to `SharedSpecValidator` for general loader/saver enable checks. Composed validation into `validateDynamoCheckpoint` orchestrator. Added 8 unit tests covering accept/reject logic and composition with GMS snapshot feature gate.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, covering overview, details, size notes, design choices, examples, validation steps, and references related issues. However, it does not explicitly include a 'Related Issues' section with action keywords as specified in the template.
Title check	✅ Passed	The PR title 'feat(operator): add GMSCheckpointSpec for loader/saver checkpoint clients' accurately describes the main feature addition, using conventional commit format. It directly references the primary API type and use-case added in the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamocheckpoints.yaml`:
- Around line 76-83: The CRD currently exposes checkpoint.loader but the webhook
forbids it for DynamoCheckpoint; add a CRD-level CEL validation to reject it by
adding an x-kubernetes-validations (CEL) rule under the DynamoCheckpoint CRD's
openAPIV3Schema that asserts spec.gpuMemoryService.checkpoint.loader is null (or
equivalent “not set”) and provides a clear message; target the DynamoCheckpoint
kind/schema in nvidia.com_dynamocheckpoints.yaml and reference the property path
spec.gpuMemoryService.checkpoint.loader and the property name loader to ensure
the CRD schema and kubectl explain/OpenAPI docs match the webhook behavior.
- Around line 97-245: The GMSSidecarSpec's envs and volumeMounts lists are
missing Kubernetes list-map markers; update the envs array schema (property
name: envs under GMSSidecarSpec) to include x-kubernetes-list-type: map and
x-kubernetes-list-map-keys: ["name"], and update the volumeMounts array schema
(property name: volumeMounts under GMSSidecarSpec) to include
x-kubernetes-list-type: map and x-kubernetes-list-map-keys: ["mountPath"]; apply
the same markers to the other referenced sections (the other GMSSidecarSpec-like
arrays noted in the review) and then regenerate the CRD output so
server-side-apply and dup detection align with native Container.env and
Container.volumeMounts semantics.

In `@deploy/operator/config/crd/bases/nvidia.com_dynamocheckpoints.yaml`:
- Around line 76-84: Update the CRD description for checkpoint.loader to clearly
state that spec.gpuMemoryService.checkpoint.loader is not accepted on
DynamoCheckpoint and will be rejected by the webhook; instead indicate it must
be configured on the consuming service (e.g., set on the service-specific
resource or injection point). Edit the description under
properties.checkpoint.loader to explicitly mention "Not valid on
DynamoCheckpoint (rejected by webhook); set on the consuming service instead"
and, if applicable, reference the exact field name
spec.gpuMemoryService.checkpoint.loader and the DynamoCheckpoint validation
behavior so kubectl explain and generated docs are accurate.
- Around line 97-245: The envs and volumeMounts override schema blocks (the envs
array item schema and the volumeMounts array item schema used for loader and
saver) are missing Kubernetes list-map markers; add x-kubernetes-list-type:
"map" and x-kubernetes-list-map-keys: ["name"] to the envs item object schema,
and x-kubernetes-list-type: "map" and x-kubernetes-list-map-keys: ["mountPath"]
to the volumeMounts item object schema (apply to both loader and saver
occurrences) so server-side-apply and duplicate-key semantics match native Pod
fields.

In `@deploy/operator/internal/checkpoint/gms.go`:
- Line 11: The merge currently sorts merged env vars (see the removed "sort"
import) which breaks Kubernetes env expansion order; change the merging logic so
it preserves the base env slice order and applies overrides by key in-place
(replace existing entries when keys match, append new keys to the end) instead
of alphabetically sorting the result; update the function that builds/returns
the merged env slice (the env-merge code path where "sort" was used) to perform
keyed replace/append while keeping original ordering and remove any sorting
step.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1704540c-a343-44e5-9b65-1344603eeaef

📥 Commits

Reviewing files that changed from the base of the PR and between 1a86523 and 3c34ea8.

📒 Files selected for processing (27)

deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamocheckpoints.yaml
deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamocomponentdeployments.yaml
deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamographdeployments.yaml
deploy/operator/api/v1alpha1/common.go
deploy/operator/api/v1alpha1/conversion_field_coverage_test.go
deploy/operator/api/v1alpha1/dynamocomponentdeployment_conversion_test.go
deploy/operator/api/v1alpha1/dynamographdeployment_conversion_test.go
deploy/operator/api/v1alpha1/shared_spec_conversion.go
deploy/operator/api/v1alpha1/zz_generated.deepcopy.go
deploy/operator/api/v1beta1/common.go
deploy/operator/api/v1beta1/zz_generated.deepcopy.go
deploy/operator/config/crd/bases/nvidia.com_dynamocheckpoints.yaml
deploy/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
deploy/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
deploy/operator/internal/checkpoint/gms.go
deploy/operator/internal/checkpoint/gms_test.go
deploy/operator/internal/checkpoint/podspec.go
deploy/operator/internal/controller/checkpoint_job.go
deploy/operator/internal/controller/dynamocomponentdeployment_controller.go
deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go
deploy/operator/internal/controller/dynamographdeployment_controller.go
deploy/operator/internal/controller/dynamographdeployment_controller_test.go
deploy/operator/internal/controller/gms_checkpoint_overrides.go
deploy/operator/internal/webhook/validation/dynamocheckpoint_handler.go
deploy/operator/internal/webhook/validation/dynamocheckpoint_test.go
deploy/operator/internal/webhook/validation/shared.go
deploy/operator/internal/webhook/validation/shared_test.go

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

galletas1712 · 2026-05-19T00:31:18Z

/ok to test 06575fa

github-actions · 2026-05-19T00:33:45Z

🌿 Fern Docs Preview: https://nvidia-preview-f13a63e4-a2a5-4d62-a6a5-5d4286747d18.docs.buildwithfern.com/dynamo/dev

julienmancuso · 2026-05-19T00:22:43Z

+type GMSCheckpointSpec struct {
+	// loader configures the client that loads checkpoint artifacts on restore.
+	// +optional
+	Loader *GMSClientSpec `json:"loader,omitempty"`


wondering if we shouldn't use PodTemplateSpec here
what if you need to add annotation / labels / tolerations, ..... to the loader and/or the saver in the future ?

I was debating this as well. It's because GMSClientSpec is shaped for a sidecar container than a sidecar pod right now. The gms-saver should always be a sidecar container (since it is launched inside a checkpoint Job and isn't managed by Grove), which means for the checkpoint side we only allows intrapod GMS. While the gms-loader could be launched either as sidecar container or as another pod.

So on second thought, maybe something like this works better?

type GMSCheckpointSpec struct { Loader *GMSCheckpointLoaderSpec `json:"loader,omitempty"` Saver *GMSCheckpointSaverSpec `json:"saver,omitempty"` } type GMSCheckpointLoaderSpec struct { GMSClientSpec `json:",inline"` // Future: // PodTemplate *corev1.PodTemplateSpec `json:"podTemplate,omitempty"` } type GMSCheckpointSaverSpec struct { GMSClientSpec `json:",inline"` }

But then I'm not sure if this is too many layers of indirection.

pod or container? Reads like container right now.

yes container, not pod

saver is always a container, but the loader can be either a container or a pod depending on whether we use intrapod or interpod GMS.

julienmancuso · 2026-05-19T00:36:24Z

+	// Loader applies to restore pods; Saver applies to checkpoint Jobs.
+	// Requires Enabled=true (enforced by webhook).
+	// +optional
+	Checkpoint *GMSCheckpointSpec `json:"checkpoint,omitempty"`


I was going to ask : why do you add this to v1alpha1 ?
but that's because this GpuMemoryServiceSpec is used in the DynamoCheckpoint CR which is v1alpha1 only ...

Also we should always add to all versions. And reusing types is also fine.

sttts · 2026-05-19T08:59:56Z

+	}
+	if !v.spec.GPUMemoryService.Enabled {
+		return fmt.Errorf(
+			"%s.gpuMemoryService.checkpoint requires gpuMemoryService.enabled=true",


couldn't this be expressed via CEL without any code?

julienmancuso · 2026-05-19T15:58:44Z

 		return nil, err
 	}

+	// D3: checkpoint.{loader,saver} requires gpuMemoryService.enabled=true.


what does D3 means ?

pull-request-size Bot added the size/XXL label May 15, 2026

github-actions Bot added feat deployment::k8s Relates to dynamo deployment in kubernetes and removed feat labels May 15, 2026

galletas1712 added 4 commits May 18, 2026 14:21

galletas1712 force-pushed the schwinns/gms-checkpoint-spec-additive branch from 747f2ee to 1dd8530 Compare May 18, 2026 21:53

fix(operator): wire GMS checkpoint overrides through DGD flows

3c34ea8

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

galletas1712 force-pushed the schwinns/gms-checkpoint-spec-additive branch from 1dd8530 to 3c34ea8 Compare May 18, 2026 22:04

galletas1712 marked this pull request as ready for review May 18, 2026 22:09

galletas1712 requested a review from a team as a code owner May 18, 2026 22:09

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

api: rename GMS checkpoint client spec

43306f4

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

galletas1712 changed the title ~~feat(operator): add GMSCheckpointSpec for loader/saver sidecar overrides~~ feat(operator): add GMSCheckpointSpec for loader/saver checkpoint clients May 18, 2026

api: generalize GMS client spec name

3d36641

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

galletas1712 changed the title ~~feat(operator): add GMSCheckpointSpec for loader/saver checkpoint clients~~ feat(operator): add GMSCheckpointSpec for loader/saver GMS clients May 18, 2026

galletas1712 added 2 commits May 18, 2026 16:27

fix(operator): preserve GMS client env ordering

dec1161

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

docs(operator): regenerate API reference for GMS checkpoint clients

06575fa

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

github-actions Bot added the documentation Improvements or additions to documentation label May 19, 2026

copy-pr-bot Bot temporarily deployed to GITLAB May 19, 2026 00:31 Inactive

julienmancuso reviewed May 19, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB May 19, 2026 02:10 Inactive

sttts reviewed May 19, 2026

View reviewed changes

hhzhang16 approved these changes May 19, 2026

View reviewed changes

julienmancuso reviewed May 19, 2026

View reviewed changes

Conversation

galletas1712 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Size / review notes

What changed

Example schemas

1. DGD auto checkpoint + custom saver

2. DGD restore + custom service loader

3. Manual DynamoCheckpoint with custom saver

Design choices

Out of scope / follow-ups

Validation

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

galletas1712 commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

galletas1712 commented May 15, 2026 •

edited

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading