Skip to content

feat(operator): add GMSCheckpointSpec for loader/saver GMS clients#9641

Open
galletas1712 wants to merge 9 commits into
mainfrom
schwinns/gms-checkpoint-spec-additive
Open

feat(operator): add GMSCheckpointSpec for loader/saver GMS clients#9641
galletas1712 wants to merge 9 commits into
mainfrom
schwinns/gms-checkpoint-spec-additive

Conversation

@galletas1712
Copy link
Copy Markdown
Contributor

@galletas1712 galletas1712 commented May 15, 2026

Summary

Adds the GMS checkpoint client override API and wires it through the v1beta1 controller paths so custom GMS saver/loader client containers work for both auto-created checkpoints and restore-time services.

Users can now configure:

  • gpuMemoryService.checkpoint.saver for the gms-saver checkpoint client used by checkpoint Jobs.
  • gpuMemoryService.checkpoint.loader for the gms-loader checkpoint client used by restore-target pods.

If no override is provided, the operator builds the same default GMS saver/loader client containers as before.

Size / review notes

Most of the apparent diff size is generated CRD YAML: the new GMS client spec includes Kubernetes EnvVar and VolumeMount types, so controller-gen expands those schemas in both the operator CRDs and copied Helm CRDs.

The hand-written logic is concentrated in:

  • API shape + conversion: api/v1alpha1/common.go, api/v1beta1/common.go, api/v1alpha1/shared_spec_conversion.go
  • Checkpoint client layering: internal/checkpoint/gms.go
  • DGD/DCD controller propagation: internal/controller/gms_checkpoint_overrides.go plus the two reconcile call sites
  • Validation/tests

What changed

  • API / CRDs

    • Added GPUMemoryServiceSpec.Checkpoint in both v1alpha1 and v1beta1.
    • Added GMSCheckpointSpec with independent loader and saver specs.
    • Added GMSClientSpec fields for image, full command, envs, envFromSecret, and volumeMounts.
    • Added v1alpha1 <-> v1beta1 conversion and regenerated deepcopy + operator/Helm CRDs.
  • GMS checkpoint client injection

    • EnsureGMSRestoreSidecars layers checkpoint.loader onto the default gms-loader client container.
    • EnsureGMSCheckpointJobSidecars layers checkpoint.saver onto the default gms-saver client container.
    • The operator still constructs the default container first, then overlays user-provided fields.
    • GMS_SOCKET_DIR remains operator-owned so the injected clients always use the operator-managed UDS mount.
  • DGD auto checkpoint flow

    • When a DGD in auto checkpoint mode creates a DynamoCheckpoint, the generated CR now carries the service's GMS saver override.
    • Loader overrides are intentionally stripped from the generated DynamoCheckpoint, because checkpoint Jobs only save; restore-time loader configuration belongs to the consuming service.
  • Restore flow

    • When a service restores from a GMS-enabled checkpoint, the resolved checkpoint info is overlaid with the service's gpuMemoryService.checkpoint.loader override before pod generation.
    • This is applied in both DGD/Grove and standalone DCD/non-Grove controller paths.
    • A service-level loader override does not turn a non-GMS checkpoint into a GMS restore; the referenced/resolved DynamoCheckpoint must still be GMS-enabled.
  • Admission validation

    • Shared service validation rejects gpuMemoryService.checkpoint.{loader,saver} unless GMS is enabled.
    • Standalone DynamoCheckpoint validation rejects checkpoint.loader, because a DynamoCheckpoint Job only saves.
  • Tests

    • Added/updated API conversion tests for the new v1beta1 fields.
    • Added controller coverage for auto-created checkpoint saver propagation and restore-time service loader propagation.
    • Added checkpoint client tests for default behavior, override layering, env/envFromSecret/volumeMount behavior, and operator-owned env handling.
    • Added webhook validation coverage for the new rules.

Example schemas

1. DGD auto checkpoint + custom saver

User submits a DGD with checkpoint mode: Auto, GMS enabled, and a service-level saver override:

apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeployment
metadata:
  name: llama-auto-save
spec:
  backendFramework: vllm
  components:
    - name: worker
      type: worker
      podTemplate:
        spec:
          containers:
            - name: main
              image: my-vllm-worker:latest
      experimental:
        checkpoint:
          mode: Auto
          identity:
            model: meta-llama/Llama-3.1-8B-Instruct
            backendFramework: vllm
            tensorParallelSize: 1
            pipelineParallelSize: 1
        gpuMemoryService:
          mode: IntraPod
          checkpoint:
            saver:
              image: my-gms-saver:latest
              command: ["/bin/gms-save"]
              envs:
                - name: GMS_TRANSFER_BACKEND
                  value: nixl-gds
              envFromSecret: gms-save-secret
              volumeMounts:
                - name: extra-save-config
                  mountPath: /etc/gms-save

The DGD controller auto-creates a DynamoCheckpoint carrying the saver config. The loader is not copied into this CR because the checkpoint Job only saves:

apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
  name: checkpoint-<identity-hash>
spec:
  identity:
    model: meta-llama/Llama-3.1-8B-Instruct
    backendFramework: vllm
    tensorParallelSize: 1
    pipelineParallelSize: 1
  gpuMemoryService:
    enabled: true
    mode: intraPod
    checkpoint:
      saver:
        image: my-gms-saver:latest
        command: ["/bin/gms-save"]
        envs:
          - name: GMS_TRANSFER_BACKEND
            value: nixl-gds
        envFromSecret: gms-save-secret
        volumeMounts:
          - name: extra-save-config
            mountPath: /etc/gms-save
  job:
    podTemplateSpec:
      spec:
        containers:
          - name: main
            image: my-vllm-worker:latest

2. DGD restore + custom service loader

User submits a DGD that restores from a GMS-enabled checkpoint and sets a service-level loader override:

apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeployment
metadata:
  name: llama-restore
spec:
  backendFramework: vllm
  components:
    - name: worker
      type: worker
      podTemplate:
        spec:
          containers:
            - name: main
              image: my-vllm-worker:latest
      experimental:
        checkpoint:
          mode: Manual
          checkpointRef: llama-gms-checkpoint
        gpuMemoryService:
          mode: IntraPod
          checkpoint:
            loader:
              image: my-gms-loader:latest
              command: ["/bin/gms-load"]
              envs:
                - name: GMS_TRANSFER_BACKEND
                  value: nixl-gds
              envFromSecret: gms-load-secret
              volumeMounts:
                - name: extra-load-config
                  mountPath: /etc/gms-load

The referenced checkpoint must be GMS-enabled. It may have a saver override from creation time, but the restore loader is read from the consuming service above:

apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
  name: llama-gms-checkpoint
spec:
  identity:
    model: meta-llama/Llama-3.1-8B-Instruct
    backendFramework: vllm
  gpuMemoryService:
    enabled: true
    mode: intraPod
    checkpoint:
      saver:
        image: my-gms-saver:latest
  job:
    podTemplateSpec:
      spec:
        containers:
          - name: main
            image: my-vllm-worker:latest
status:
  phase: Ready

At pod-generation time, the controller resolves llama-gms-checkpoint, sees that it is GMS-enabled, then overlays the DGD service's loader override onto the generated gms-loader client container.

3. Manual DynamoCheckpoint with custom saver

For manual checkpoint creation, users can put saver directly on the DynamoCheckpoint:

apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
  name: llama-manual-save
spec:
  identity:
    model: meta-llama/Llama-3.1-8B-Instruct
    backendFramework: vllm
  gpuMemoryService:
    enabled: true
    mode: intraPod
    checkpoint:
      saver:
        image: my-gms-saver:latest
        command: ["/bin/gms-save"]
  job:
    podTemplateSpec:
      spec:
        containers:
          - name: main
            image: my-vllm-worker:latest

gpuMemoryService.checkpoint.loader is rejected on DynamoCheckpoint; loader config belongs on the DGD/DCD service that restores from the checkpoint.

Design choices

  • Saver lives on DynamoCheckpoint: save-time behavior is reconciled by the checkpoint controller from the DynamoCheckpoint CR, so DGD auto mode stamps the saver override onto the generated CR.
  • Loader lives on the consuming service: restore-time behavior depends on the workload pod doing the restore, so the service's loader override is overlaid after resolving the checkpoint.
  • v1beta1 plumbing is required: DGD/DCD controllers reconcile v1beta1 objects even when users submit v1alpha1 YAML, so the new fields need v1beta1 types and conversion to survive to reconcile time.
  • Additive / opt-in only: nil or empty override specs leave operator-generated client containers unchanged.
  • Operator-owned identities stay fixed: container names remain gms-loader and gms-saver; users customize fields on those containers rather than replacing the containers wholesale.
  • Full command override: command is the entire argv. There is no separate args field and no hidden python3 -m ... prefix when a custom command is provided.
  • Layer on defaults: the operator keeps adding the GMS socket mount/env, DRA claim, checkpoint PVC mount, and default checkpoint directory; user configuration is layered on top except for operator-owned GMS_SOCKET_DIR.

Out of scope / follow-ups

This PR intentionally does not clean up the older GMS checkpoint storage path. Follow-up work can remove or redesign the remaining built-in assumptions, including:

  • GMS_CHECKPOINT_DIR injection and resolveGMSArtifactDir().
  • GMS use of DiscoverAndResolveStorage() / InjectCheckpointVolume() for the operator-owned checkpoint PVC path.
  • Python-side CLI/env fallback and transfer-backend support.
  • Any broader cleanup around the temporary GMS snapshot gate.

Validation

  • make manifests
  • uv run --no-project --with 'pydantic>=2.11.4,<2.13' --with pyyaml --with black make generate-pydantic
  • go test -buildvcs=false $(go list -buildvcs=false ./... | grep -v /e2e | grep -v /test | grep -v /cmd) from deploy/operator
  • go test ./internal/controller ./api/v1alpha1 ./api/v1beta1 ./internal/checkpoint ./internal/webhook/validation ./internal/dynamo from deploy/operator

Note: go test ./... includes e2e tests and failed locally because kubectl attempted to use an expired Teleport kube context; the non-e2e operator package tests above pass.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added GPU Memory Service checkpoint client customization support, enabling override of loader (restore) and saver (checkpoint) client containers with custom images, commands, environment variables, and volume mounts.

Review Change Stack

@github-actions github-actions Bot added feat deployment::k8s Relates to dynamo deployment in kubernetes and removed feat labels May 15, 2026
Purely additive schema change: introduces GMSCheckpointSpec (loader + saver)
and GMSSidecarSpec (image, command, envs, envFromSecret, volumeMounts) and
nests an optional Checkpoint pointer on GPUMemoryServiceSpec.
No operator code yet reads these fields; behavior is byte-identical to
before. Follow-up commits wire the injection path (applyGMSSidecarSpec
helper layered on the existing defaults), add the webhook D3 rule
(checkpoint.{loader,saver} requires gpuMemoryService.enabled=true), and
reject checkpoint.loader on DynamoCheckpoint (Jobs only save).
Per locked design choice C2 there is no separate Args field — Command is
the full container argv. Container names stay operator-owned
(gms-loader / gms-saver) for compatibility with the idempotent
removeGMSManagedContainers strip path landed in #9514.
Regenerated artifacts:
- zz_generated.deepcopy.go: DeepCopy{Into} for the two new structs;
  GPUMemoryServiceSpec.DeepCopyInto now handles the Checkpoint pointer.
- config/crd/bases/{dynamocheckpoints,dynamocomponentdeployments,
  dynamographdeployments}.yaml: new optional checkpoint schema branch.
- deploy/helm/charts/.../crds/: synced copies from the same make target.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
…r/saver

Adds applyGMSSidecarSpec, which takes the operator-built default loader or
saver container and overlays user-supplied fields (image, command, envs,
envFromSecret, volumeMounts) when GPUMemoryService.Checkpoint.{Loader,Saver}
is non-nil. When nil, behavior is byte-identical to before: GMS_CHECKPOINT_DIR
is still set, the checkpoint PVC is still mounted, the default
python3 -m gpu_memory_service.cli.snapshot.{loader,saver} command is still
used, and the container name stays operator-owned.
Merge semantics mirror dynamo.mergeFrontendSidecarDefaults:
- Image and Command are full overrides (when non-empty).
- Envs are merged with user-wins on name collision (local mergeEnvVars
  duplicates dynamo.MergeEnvs to avoid an import cycle: dynamo already
  imports checkpoint).
- EnvFromSecret appends an envFrom source.
- VolumeMounts append to (not replace) operator mounts so the
  gms-intrapod-control mount survives.
Callers updated:
- internal/checkpoint/podspec.go: passes info.GPUMemoryService.Checkpoint
  to EnsureGMSRestoreSidecars on the restore path.
- internal/controller/checkpoint_job.go: passes
  ckpt.Spec.GPUMemoryService.Checkpoint to EnsureGMSCheckpointJobSidecars
  on the standalone DynamoCheckpoint job path.
Pre-existing tests in internal/checkpoint and internal/controller pass
unchanged.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
…and reject loader on DynamoCheckpoint

Two locked design rules from the GMS sidecar pluggability plan:
D3 (DGD + DynamoCheckpoint via SharedSpecValidator):
    gpuMemoryService.checkpoint.loader or .saver is only meaningful when
    gpuMemoryService.enabled=true, because the operator only injects the
    loader/saver sidecars on the GMS path. Enforced as a new method
    validateGMSCheckpointSidecars on SharedSpecValidator. Empty
    checkpoint: {} (no loader, no saver) is still accepted — the field is
    purely additive and a literal empty object opts into nothing.
DynamoCheckpoint-specific (rejects loader on Jobs):
    A DynamoCheckpoint is a save-side Job; restore is reconciled onto worker
    pods from the consuming service's ServiceCheckpointConfig. Setting
    checkpoint.loader on a DynamoCheckpoint is always a configuration error.
    Added validateDynamoCheckpointCheckpointSidecars in
    dynamocheckpoint_handler.go and chained behind the existing GMS-snapshot
    env-gate via a new validateDynamoCheckpoint entry point.
This composes with (does not replace) the GMS snapshot feature-gate check
from #8829 (ValidateGMSSnapshotGate / DYN_OPERATOR_ALLOW_GMS_SNAPSHOT): the
gate answers "is GMS+snapshot admissible at all?"; D3 and the loader
rejection answer "given it is admissible, does the spec internally make
sense?".
Tests:
- shared_test.go: 4 new D3 cases (empty checkpoint accepted, loader with
  enabled=true accepted, loader without enabled rejected, saver without
  enabled rejected).
- dynamocheckpoint_test.go (new): table-driven cases for the standalone
  handler — loader always rejected, saver requires enabled=true, plus a
  composition test that proves the loader rule fires even with the GMS
  snapshot env-gate satisfied.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
New gms_test.go exercises the merge semantics added by
applyGMSSidecarSpec. The existing checkpoint_test.go is left untouched —
this file owns the new opt-in cases.
Loader path (EnsureGMSRestoreSidecars):
- Nil checkpoint spec: byte-identical to pre-PR behavior (default image,
  default command, GMS_CHECKPOINT_DIR set, checkpoint PVC mounted).
- Image override: user image wins; command/env/mounts unchanged.
- Command override: user argv fully replaces default; no implicit
  python3 -m prefix (locked C2).
- Envs merge: user wins on name collision with operator-set vars; new
  vars appended; GMS_SOCKET_DIR preserved.
- VolumeMounts append: operator gms-intrapod-control and checkpoint-PVC
  mounts both preserved; user mount appears alongside.
- EnvFromSecret: appended as a single envFrom source.
Saver path (EnsureGMSCheckpointJobSidecars):
- Nil checkpoint spec: byte-identical default saver injection.
- Saver override: full Image+Command+Envs+VolumeMounts+EnvFromSecret
  merge; cross-check that Loader on the same spec is never accidentally
  read on the Job path.
applyGMSSidecarSpec direct unit tests:
- nil spec is a no-op (returns base unchanged).
- empty &GMSSidecarSpec{} is also a no-op (every field empty falls
  through; this covers the literal "checkpoint.loader: {}" footgun cited
  in the design plan).
- empty EnvFromSecret string is ignored — does not render a malformed
  envFrom source despite the lean-config stance, because an empty secret
  name is not a legal Kubernetes reference.
All tests run with the existing test scaffolding (testHash, findContainer,
testPodSpec analog). No envtest dependency.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
@galletas1712 galletas1712 force-pushed the schwinns/gms-checkpoint-spec-additive branch from 747f2ee to 1dd8530 Compare May 18, 2026 21:53
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
@galletas1712 galletas1712 force-pushed the schwinns/gms-checkpoint-spec-additive branch from 1dd8530 to 3c34ea8 Compare May 18, 2026 22:04
@galletas1712 galletas1712 marked this pull request as ready for review May 18, 2026 22:09
@galletas1712 galletas1712 requested a review from a team as a code owner May 18, 2026 22:09
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Walkthrough

This PR extends the GPU Memory Service (GMS) API to support optional checkpoint sidecar overrides (loader for restore, saver for save Jobs). Users can now customize GMS sidecar image, command, environment variables, and volume mounts without modifying operator defaults. Changes span API schema definition, CRD validation, conversion logic, injection helpers, controller integration, and admission validation.

Changes

GPU Memory Service Checkpoint Sidecar Override Configuration

Layer / File(s) Summary
API type definitions for checkpoint overrides
deploy/operator/api/v1alpha1/common.go, deploy/operator/api/v1beta1/common.go
Added GMSCheckpointSpec (with optional Loader and Saver sidecar overrides) and GMSSidecarSpec (with optional Image, Command, Envs, EnvFromSecret, VolumeMounts) to both API versions. Added Checkpoint field to GPUMemoryServiceSpec.
CRD schema validation for checkpoint configuration
deploy/helm/charts/platform/components/operator/crds/*dynamocheckpoints.yaml, deploy/operator/config/crd/bases/*dynamocheckpoints.yaml, deploy/helm/charts/.../crds/*dynamocomponentdeployments.yaml, deploy/operator/config/crd/bases/*dynamocomponentdeployments.yaml, deploy/helm/charts/.../crds/*dynamographdeployments.yaml, deploy/operator/config/crd/bases/*dynamographdeployments.yaml
Declared spec.gpuMemoryService.checkpoint schema with nested loader and saver override configuration in all three CRD types (both helm chart and operator config base paths), validating command arrays, envFromSecret references, envs with full valueFrom sources (configMapKeyRef, fieldRef, fileKeyRef, resourceFieldRef, secretKeyRef), image overrides, and volumeMounts.
Version conversion and deep-copy support
deploy/operator/api/v1alpha1/shared_spec_conversion.go, deploy/operator/api/v1alpha1/zz_generated.deepcopy.go, deploy/operator/api/v1beta1/zz_generated.deepcopy.go, deploy/operator/api/v1alpha1/conversion_field_coverage_test.go, deploy/operator/api/v1alpha1/dynamocomponentdeployment_conversion_test.go, deploy/operator/api/v1alpha1/dynamographdeployment_conversion_test.go
Implemented ConvertFromGMSCheckpointSpec, ConvertToGMSCheckpointSpec, and sidecar conversion helpers with proper deep-copying of slice fields. Added autogenerated DeepCopyInto and DeepCopy methods for checkpoint and sidecar types. Extended fixture data and added comprehensive v1alpha1↔v1beta1 round-trip test validating checkpoint loader/saver payload preservation.
Sidecar override injection logic
deploy/operator/internal/checkpoint/gms.go, deploy/operator/internal/checkpoint/gms_test.go
Extended EnsureGMSRestoreSidecars and EnsureGMSCheckpointJobSidecars signatures to accept optional *GMSCheckpointSpec. Implemented applyGMSSidecarSpec helper to layer override image, command, envs (with user-wins merging except for operator-owned GMS_SOCKET_DIR), envFrom secrets, and volume mounts onto default containers. Added 11 unit tests covering nil spec defaults, image/command overrides, env merging semantics, volume mount appending, and edge cases.
Controller integration and overlay helpers
deploy/operator/internal/checkpoint/podspec.go, deploy/operator/internal/controller/checkpoint_job.go, deploy/operator/internal/controller/gms_checkpoint_overrides.go, deploy/operator/internal/controller/dynamocomponentdeployment_controller.go, deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go, deploy/operator/internal/controller/dynamographdeployment_controller.go, deploy/operator/internal/controller/dynamographdeployment_controller_test.go
Wired checkpoint spec into pod creation by passing it to sidecar ensure functions. Added overlayServiceGMSRestoreLoader helper to compose service-level GMS loader overrides onto resolved checkpoint configs, and gmsSpecForAutoCheckpointSave to strip loader overrides from auto-save configs (loader is restore-only). Updated DynamoComponentDeployment and DynamoGraphDeployment controllers to apply loader overlays during checkpoint resolution. Added 3 integration tests validating loader/saver override preservation through controller logic.
Admission validation for checkpoint sidecar configuration
deploy/operator/internal/webhook/validation/dynamocheckpoint_handler.go, deploy/operator/internal/webhook/validation/dynamocheckpoint_test.go, deploy/operator/internal/webhook/validation/shared.go, deploy/operator/internal/webhook/validation/shared_test.go
Added validateDynamoCheckpointCheckpointSidecars enforcing that loaders are rejected on DynamoCheckpoint resources (loader is for restore phase only) and savers require gpuMemoryService.enabled=true. Added validateGMSCheckpointSidecars to SharedSpecValidator for general loader/saver enable checks. Composed validation into validateDynamoCheckpoint orchestrator. Added 8 unit tests covering accept/reject logic and composition with GMS snapshot feature gate.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering overview, details, size notes, design choices, examples, validation steps, and references related issues. However, it does not explicitly include a 'Related Issues' section with action keywords as specified in the template.
Title check ✅ Passed The PR title 'feat(operator): add GMSCheckpointSpec for loader/saver checkpoint clients' accurately describes the main feature addition, using conventional commit format. It directly references the primary API type and use-case added in the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamocheckpoints.yaml`:
- Around line 76-83: The CRD currently exposes checkpoint.loader but the webhook
forbids it for DynamoCheckpoint; add a CRD-level CEL validation to reject it by
adding an x-kubernetes-validations (CEL) rule under the DynamoCheckpoint CRD's
openAPIV3Schema that asserts spec.gpuMemoryService.checkpoint.loader is null (or
equivalent “not set”) and provides a clear message; target the DynamoCheckpoint
kind/schema in nvidia.com_dynamocheckpoints.yaml and reference the property path
spec.gpuMemoryService.checkpoint.loader and the property name loader to ensure
the CRD schema and kubectl explain/OpenAPI docs match the webhook behavior.
- Around line 97-245: The GMSSidecarSpec's envs and volumeMounts lists are
missing Kubernetes list-map markers; update the envs array schema (property
name: envs under GMSSidecarSpec) to include x-kubernetes-list-type: map and
x-kubernetes-list-map-keys: ["name"], and update the volumeMounts array schema
(property name: volumeMounts under GMSSidecarSpec) to include
x-kubernetes-list-type: map and x-kubernetes-list-map-keys: ["mountPath"]; apply
the same markers to the other referenced sections (the other GMSSidecarSpec-like
arrays noted in the review) and then regenerate the CRD output so
server-side-apply and dup detection align with native Container.env and
Container.volumeMounts semantics.

In `@deploy/operator/config/crd/bases/nvidia.com_dynamocheckpoints.yaml`:
- Around line 76-84: Update the CRD description for checkpoint.loader to clearly
state that spec.gpuMemoryService.checkpoint.loader is not accepted on
DynamoCheckpoint and will be rejected by the webhook; instead indicate it must
be configured on the consuming service (e.g., set on the service-specific
resource or injection point). Edit the description under
properties.checkpoint.loader to explicitly mention "Not valid on
DynamoCheckpoint (rejected by webhook); set on the consuming service instead"
and, if applicable, reference the exact field name
spec.gpuMemoryService.checkpoint.loader and the DynamoCheckpoint validation
behavior so kubectl explain and generated docs are accurate.
- Around line 97-245: The envs and volumeMounts override schema blocks (the envs
array item schema and the volumeMounts array item schema used for loader and
saver) are missing Kubernetes list-map markers; add x-kubernetes-list-type:
"map" and x-kubernetes-list-map-keys: ["name"] to the envs item object schema,
and x-kubernetes-list-type: "map" and x-kubernetes-list-map-keys: ["mountPath"]
to the volumeMounts item object schema (apply to both loader and saver
occurrences) so server-side-apply and duplicate-key semantics match native Pod
fields.

In `@deploy/operator/internal/checkpoint/gms.go`:
- Line 11: The merge currently sorts merged env vars (see the removed "sort"
import) which breaks Kubernetes env expansion order; change the merging logic so
it preserves the base env slice order and applies overrides by key in-place
(replace existing entries when keys match, append new keys to the end) instead
of alphabetically sorting the result; update the function that builds/returns
the merged env slice (the env-merge code path where "sort" was used) to perform
keyed replace/append while keeping original ordering and remove any sorting
step.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1704540c-a343-44e5-9b65-1344603eeaef

📥 Commits

Reviewing files that changed from the base of the PR and between 1a86523 and 3c34ea8.

📒 Files selected for processing (27)
  • deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamocheckpoints.yaml
  • deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamographdeployments.yaml
  • deploy/operator/api/v1alpha1/common.go
  • deploy/operator/api/v1alpha1/conversion_field_coverage_test.go
  • deploy/operator/api/v1alpha1/dynamocomponentdeployment_conversion_test.go
  • deploy/operator/api/v1alpha1/dynamographdeployment_conversion_test.go
  • deploy/operator/api/v1alpha1/shared_spec_conversion.go
  • deploy/operator/api/v1alpha1/zz_generated.deepcopy.go
  • deploy/operator/api/v1beta1/common.go
  • deploy/operator/api/v1beta1/zz_generated.deepcopy.go
  • deploy/operator/config/crd/bases/nvidia.com_dynamocheckpoints.yaml
  • deploy/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/operator/internal/checkpoint/gms.go
  • deploy/operator/internal/checkpoint/gms_test.go
  • deploy/operator/internal/checkpoint/podspec.go
  • deploy/operator/internal/controller/checkpoint_job.go
  • deploy/operator/internal/controller/dynamocomponentdeployment_controller.go
  • deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go
  • deploy/operator/internal/controller/dynamographdeployment_controller.go
  • deploy/operator/internal/controller/dynamographdeployment_controller_test.go
  • deploy/operator/internal/controller/gms_checkpoint_overrides.go
  • deploy/operator/internal/webhook/validation/dynamocheckpoint_handler.go
  • deploy/operator/internal/webhook/validation/dynamocheckpoint_test.go
  • deploy/operator/internal/webhook/validation/shared.go
  • deploy/operator/internal/webhook/validation/shared_test.go

Comment thread deploy/operator/config/crd/bases/nvidia.com_dynamocheckpoints.yaml
Comment thread deploy/operator/config/crd/bases/nvidia.com_dynamocheckpoints.yaml
Comment thread deploy/operator/internal/checkpoint/gms.go Outdated
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
@galletas1712 galletas1712 changed the title feat(operator): add GMSCheckpointSpec for loader/saver sidecar overrides feat(operator): add GMSCheckpointSpec for loader/saver checkpoint clients May 18, 2026
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
@galletas1712 galletas1712 changed the title feat(operator): add GMSCheckpointSpec for loader/saver checkpoint clients feat(operator): add GMSCheckpointSpec for loader/saver GMS clients May 18, 2026
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 19, 2026
@galletas1712
Copy link
Copy Markdown
Contributor Author

/ok to test 06575fa

@github-actions
Copy link
Copy Markdown
Contributor

type GMSCheckpointSpec struct {
// loader configures the client that loads checkpoint artifacts on restore.
// +optional
Loader *GMSClientSpec `json:"loader,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we shouldn't use PodTemplateSpec here
what if you need to add annotation / labels / tolerations, ..... to the loader and/or the saver in the future ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating this as well. It's because GMSClientSpec is shaped for a sidecar container than a sidecar pod right now. The gms-saver should always be a sidecar container (since it is launched inside a checkpoint Job and isn't managed by Grove), which means for the checkpoint side we only allows intrapod GMS. While the gms-loader could be launched either as sidecar container or as another pod.

So on second thought, maybe something like this works better?

type GMSCheckpointSpec struct {
    Loader *GMSCheckpointLoaderSpec `json:"loader,omitempty"`
    Saver  *GMSCheckpointSaverSpec  `json:"saver,omitempty"`
}

type GMSCheckpointLoaderSpec struct {
    GMSClientSpec `json:",inline"`

    // Future:
    // PodTemplate *corev1.PodTemplateSpec `json:"podTemplate,omitempty"`
}

type GMSCheckpointSaverSpec struct {
    GMSClientSpec `json:",inline"`
}

But then I'm not sure if this is too many layers of indirection.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod or container? Reads like container right now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes container, not pod

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saver is always a container, but the loader can be either a container or a pod depending on whether we use intrapod or interpod GMS.

// Loader applies to restore pods; Saver applies to checkpoint Jobs.
// Requires Enabled=true (enforced by webhook).
// +optional
Checkpoint *GMSCheckpointSpec `json:"checkpoint,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to ask : why do you add this to v1alpha1 ?
but that's because this GpuMemoryServiceSpec is used in the DynamoCheckpoint CR which is v1alpha1 only ...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should always add to all versions. And reusing types is also fine.

}
if !v.spec.GPUMemoryService.Enabled {
return fmt.Errorf(
"%s.gpuMemoryService.checkpoint requires gpuMemoryService.enabled=true",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't this be expressed via CEL without any code?

return nil, err
}

// D3: checkpoint.{loader,saver} requires gpuMemoryService.enabled=true.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does D3 means ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants