feat(operator): add GMSCheckpointSpec for loader/saver GMS clients#9641
feat(operator): add GMSCheckpointSpec for loader/saver GMS clients#9641galletas1712 wants to merge 9 commits into
Conversation
Purely additive schema change: introduces GMSCheckpointSpec (loader + saver)
and GMSSidecarSpec (image, command, envs, envFromSecret, volumeMounts) and
nests an optional Checkpoint pointer on GPUMemoryServiceSpec.
No operator code yet reads these fields; behavior is byte-identical to
before. Follow-up commits wire the injection path (applyGMSSidecarSpec
helper layered on the existing defaults), add the webhook D3 rule
(checkpoint.{loader,saver} requires gpuMemoryService.enabled=true), and
reject checkpoint.loader on DynamoCheckpoint (Jobs only save).
Per locked design choice C2 there is no separate Args field — Command is
the full container argv. Container names stay operator-owned
(gms-loader / gms-saver) for compatibility with the idempotent
removeGMSManagedContainers strip path landed in #9514.
Regenerated artifacts:
- zz_generated.deepcopy.go: DeepCopy{Into} for the two new structs;
GPUMemoryServiceSpec.DeepCopyInto now handles the Checkpoint pointer.
- config/crd/bases/{dynamocheckpoints,dynamocomponentdeployments,
dynamographdeployments}.yaml: new optional checkpoint schema branch.
- deploy/helm/charts/.../crds/: synced copies from the same make target.
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
…r/saver
Adds applyGMSSidecarSpec, which takes the operator-built default loader or
saver container and overlays user-supplied fields (image, command, envs,
envFromSecret, volumeMounts) when GPUMemoryService.Checkpoint.{Loader,Saver}
is non-nil. When nil, behavior is byte-identical to before: GMS_CHECKPOINT_DIR
is still set, the checkpoint PVC is still mounted, the default
python3 -m gpu_memory_service.cli.snapshot.{loader,saver} command is still
used, and the container name stays operator-owned.
Merge semantics mirror dynamo.mergeFrontendSidecarDefaults:
- Image and Command are full overrides (when non-empty).
- Envs are merged with user-wins on name collision (local mergeEnvVars
duplicates dynamo.MergeEnvs to avoid an import cycle: dynamo already
imports checkpoint).
- EnvFromSecret appends an envFrom source.
- VolumeMounts append to (not replace) operator mounts so the
gms-intrapod-control mount survives.
Callers updated:
- internal/checkpoint/podspec.go: passes info.GPUMemoryService.Checkpoint
to EnsureGMSRestoreSidecars on the restore path.
- internal/controller/checkpoint_job.go: passes
ckpt.Spec.GPUMemoryService.Checkpoint to EnsureGMSCheckpointJobSidecars
on the standalone DynamoCheckpoint job path.
Pre-existing tests in internal/checkpoint and internal/controller pass
unchanged.
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
…and reject loader on DynamoCheckpoint
Two locked design rules from the GMS sidecar pluggability plan:
D3 (DGD + DynamoCheckpoint via SharedSpecValidator):
gpuMemoryService.checkpoint.loader or .saver is only meaningful when
gpuMemoryService.enabled=true, because the operator only injects the
loader/saver sidecars on the GMS path. Enforced as a new method
validateGMSCheckpointSidecars on SharedSpecValidator. Empty
checkpoint: {} (no loader, no saver) is still accepted — the field is
purely additive and a literal empty object opts into nothing.
DynamoCheckpoint-specific (rejects loader on Jobs):
A DynamoCheckpoint is a save-side Job; restore is reconciled onto worker
pods from the consuming service's ServiceCheckpointConfig. Setting
checkpoint.loader on a DynamoCheckpoint is always a configuration error.
Added validateDynamoCheckpointCheckpointSidecars in
dynamocheckpoint_handler.go and chained behind the existing GMS-snapshot
env-gate via a new validateDynamoCheckpoint entry point.
This composes with (does not replace) the GMS snapshot feature-gate check
from #8829 (ValidateGMSSnapshotGate / DYN_OPERATOR_ALLOW_GMS_SNAPSHOT): the
gate answers "is GMS+snapshot admissible at all?"; D3 and the loader
rejection answer "given it is admissible, does the spec internally make
sense?".
Tests:
- shared_test.go: 4 new D3 cases (empty checkpoint accepted, loader with
enabled=true accepted, loader without enabled rejected, saver without
enabled rejected).
- dynamocheckpoint_test.go (new): table-driven cases for the standalone
handler — loader always rejected, saver requires enabled=true, plus a
composition test that proves the loader rule fires even with the GMS
snapshot env-gate satisfied.
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
New gms_test.go exercises the merge semantics added by
applyGMSSidecarSpec. The existing checkpoint_test.go is left untouched —
this file owns the new opt-in cases.
Loader path (EnsureGMSRestoreSidecars):
- Nil checkpoint spec: byte-identical to pre-PR behavior (default image,
default command, GMS_CHECKPOINT_DIR set, checkpoint PVC mounted).
- Image override: user image wins; command/env/mounts unchanged.
- Command override: user argv fully replaces default; no implicit
python3 -m prefix (locked C2).
- Envs merge: user wins on name collision with operator-set vars; new
vars appended; GMS_SOCKET_DIR preserved.
- VolumeMounts append: operator gms-intrapod-control and checkpoint-PVC
mounts both preserved; user mount appears alongside.
- EnvFromSecret: appended as a single envFrom source.
Saver path (EnsureGMSCheckpointJobSidecars):
- Nil checkpoint spec: byte-identical default saver injection.
- Saver override: full Image+Command+Envs+VolumeMounts+EnvFromSecret
merge; cross-check that Loader on the same spec is never accidentally
read on the Job path.
applyGMSSidecarSpec direct unit tests:
- nil spec is a no-op (returns base unchanged).
- empty &GMSSidecarSpec{} is also a no-op (every field empty falls
through; this covers the literal "checkpoint.loader: {}" footgun cited
in the design plan).
- empty EnvFromSecret string is ignored — does not render a malformed
envFrom source despite the lean-config stance, because an empty secret
name is not a legal Kubernetes reference.
All tests run with the existing test scaffolding (testHash, findContainer,
testPodSpec analog). No envtest dependency.
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
747f2ee to
1dd8530
Compare
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
1dd8530 to
3c34ea8
Compare
WalkthroughThis PR extends the GPU Memory Service (GMS) API to support optional checkpoint sidecar overrides (loader for restore, saver for save Jobs). Users can now customize GMS sidecar image, command, environment variables, and volume mounts without modifying operator defaults. Changes span API schema definition, CRD validation, conversion logic, injection helpers, controller integration, and admission validation. ChangesGPU Memory Service Checkpoint Sidecar Override Configuration
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamocheckpoints.yaml`:
- Around line 76-83: The CRD currently exposes checkpoint.loader but the webhook
forbids it for DynamoCheckpoint; add a CRD-level CEL validation to reject it by
adding an x-kubernetes-validations (CEL) rule under the DynamoCheckpoint CRD's
openAPIV3Schema that asserts spec.gpuMemoryService.checkpoint.loader is null (or
equivalent “not set”) and provides a clear message; target the DynamoCheckpoint
kind/schema in nvidia.com_dynamocheckpoints.yaml and reference the property path
spec.gpuMemoryService.checkpoint.loader and the property name loader to ensure
the CRD schema and kubectl explain/OpenAPI docs match the webhook behavior.
- Around line 97-245: The GMSSidecarSpec's envs and volumeMounts lists are
missing Kubernetes list-map markers; update the envs array schema (property
name: envs under GMSSidecarSpec) to include x-kubernetes-list-type: map and
x-kubernetes-list-map-keys: ["name"], and update the volumeMounts array schema
(property name: volumeMounts under GMSSidecarSpec) to include
x-kubernetes-list-type: map and x-kubernetes-list-map-keys: ["mountPath"]; apply
the same markers to the other referenced sections (the other GMSSidecarSpec-like
arrays noted in the review) and then regenerate the CRD output so
server-side-apply and dup detection align with native Container.env and
Container.volumeMounts semantics.
In `@deploy/operator/config/crd/bases/nvidia.com_dynamocheckpoints.yaml`:
- Around line 76-84: Update the CRD description for checkpoint.loader to clearly
state that spec.gpuMemoryService.checkpoint.loader is not accepted on
DynamoCheckpoint and will be rejected by the webhook; instead indicate it must
be configured on the consuming service (e.g., set on the service-specific
resource or injection point). Edit the description under
properties.checkpoint.loader to explicitly mention "Not valid on
DynamoCheckpoint (rejected by webhook); set on the consuming service instead"
and, if applicable, reference the exact field name
spec.gpuMemoryService.checkpoint.loader and the DynamoCheckpoint validation
behavior so kubectl explain and generated docs are accurate.
- Around line 97-245: The envs and volumeMounts override schema blocks (the envs
array item schema and the volumeMounts array item schema used for loader and
saver) are missing Kubernetes list-map markers; add x-kubernetes-list-type:
"map" and x-kubernetes-list-map-keys: ["name"] to the envs item object schema,
and x-kubernetes-list-type: "map" and x-kubernetes-list-map-keys: ["mountPath"]
to the volumeMounts item object schema (apply to both loader and saver
occurrences) so server-side-apply and duplicate-key semantics match native Pod
fields.
In `@deploy/operator/internal/checkpoint/gms.go`:
- Line 11: The merge currently sorts merged env vars (see the removed "sort"
import) which breaks Kubernetes env expansion order; change the merging logic so
it preserves the base env slice order and applies overrides by key in-place
(replace existing entries when keys match, append new keys to the end) instead
of alphabetically sorting the result; update the function that builds/returns
the merged env slice (the env-merge code path where "sort" was used) to perform
keyed replace/append while keeping original ordering and remove any sorting
step.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1704540c-a343-44e5-9b65-1344603eeaef
📒 Files selected for processing (27)
deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamocheckpoints.yamldeploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamocomponentdeployments.yamldeploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamographdeployments.yamldeploy/operator/api/v1alpha1/common.godeploy/operator/api/v1alpha1/conversion_field_coverage_test.godeploy/operator/api/v1alpha1/dynamocomponentdeployment_conversion_test.godeploy/operator/api/v1alpha1/dynamographdeployment_conversion_test.godeploy/operator/api/v1alpha1/shared_spec_conversion.godeploy/operator/api/v1alpha1/zz_generated.deepcopy.godeploy/operator/api/v1beta1/common.godeploy/operator/api/v1beta1/zz_generated.deepcopy.godeploy/operator/config/crd/bases/nvidia.com_dynamocheckpoints.yamldeploy/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yamldeploy/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/operator/internal/checkpoint/gms.godeploy/operator/internal/checkpoint/gms_test.godeploy/operator/internal/checkpoint/podspec.godeploy/operator/internal/controller/checkpoint_job.godeploy/operator/internal/controller/dynamocomponentdeployment_controller.godeploy/operator/internal/controller/dynamocomponentdeployment_controller_test.godeploy/operator/internal/controller/dynamographdeployment_controller.godeploy/operator/internal/controller/dynamographdeployment_controller_test.godeploy/operator/internal/controller/gms_checkpoint_overrides.godeploy/operator/internal/webhook/validation/dynamocheckpoint_handler.godeploy/operator/internal/webhook/validation/dynamocheckpoint_test.godeploy/operator/internal/webhook/validation/shared.godeploy/operator/internal/webhook/validation/shared_test.go
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
|
/ok to test 06575fa |
| type GMSCheckpointSpec struct { | ||
| // loader configures the client that loads checkpoint artifacts on restore. | ||
| // +optional | ||
| Loader *GMSClientSpec `json:"loader,omitempty"` |
There was a problem hiding this comment.
wondering if we shouldn't use PodTemplateSpec here
what if you need to add annotation / labels / tolerations, ..... to the loader and/or the saver in the future ?
There was a problem hiding this comment.
I was debating this as well. It's because GMSClientSpec is shaped for a sidecar container than a sidecar pod right now. The gms-saver should always be a sidecar container (since it is launched inside a checkpoint Job and isn't managed by Grove), which means for the checkpoint side we only allows intrapod GMS. While the gms-loader could be launched either as sidecar container or as another pod.
So on second thought, maybe something like this works better?
type GMSCheckpointSpec struct {
Loader *GMSCheckpointLoaderSpec `json:"loader,omitempty"`
Saver *GMSCheckpointSaverSpec `json:"saver,omitempty"`
}
type GMSCheckpointLoaderSpec struct {
GMSClientSpec `json:",inline"`
// Future:
// PodTemplate *corev1.PodTemplateSpec `json:"podTemplate,omitempty"`
}
type GMSCheckpointSaverSpec struct {
GMSClientSpec `json:",inline"`
}But then I'm not sure if this is too many layers of indirection.
There was a problem hiding this comment.
pod or container? Reads like container right now.
There was a problem hiding this comment.
yes container, not pod
There was a problem hiding this comment.
saver is always a container, but the loader can be either a container or a pod depending on whether we use intrapod or interpod GMS.
| // Loader applies to restore pods; Saver applies to checkpoint Jobs. | ||
| // Requires Enabled=true (enforced by webhook). | ||
| // +optional | ||
| Checkpoint *GMSCheckpointSpec `json:"checkpoint,omitempty"` |
There was a problem hiding this comment.
I was going to ask : why do you add this to v1alpha1 ?
but that's because this GpuMemoryServiceSpec is used in the DynamoCheckpoint CR which is v1alpha1 only ...
There was a problem hiding this comment.
Also we should always add to all versions. And reusing types is also fine.
| } | ||
| if !v.spec.GPUMemoryService.Enabled { | ||
| return fmt.Errorf( | ||
| "%s.gpuMemoryService.checkpoint requires gpuMemoryService.enabled=true", |
There was a problem hiding this comment.
couldn't this be expressed via CEL without any code?
| return nil, err | ||
| } | ||
|
|
||
| // D3: checkpoint.{loader,saver} requires gpuMemoryService.enabled=true. |
There was a problem hiding this comment.
what does D3 means ?
Summary
Adds the GMS checkpoint client override API and wires it through the v1beta1 controller paths so custom GMS saver/loader client containers work for both auto-created checkpoints and restore-time services.
Users can now configure:
gpuMemoryService.checkpoint.saverfor thegms-savercheckpoint client used by checkpoint Jobs.gpuMemoryService.checkpoint.loaderfor thegms-loadercheckpoint client used by restore-target pods.If no override is provided, the operator builds the same default GMS saver/loader client containers as before.
Size / review notes
Most of the apparent diff size is generated CRD YAML: the new GMS client spec includes Kubernetes
EnvVarandVolumeMounttypes, so controller-gen expands those schemas in both the operator CRDs and copied Helm CRDs.The hand-written logic is concentrated in:
api/v1alpha1/common.go,api/v1beta1/common.go,api/v1alpha1/shared_spec_conversion.gointernal/checkpoint/gms.gointernal/controller/gms_checkpoint_overrides.goplus the two reconcile call sitesWhat changed
API / CRDs
GPUMemoryServiceSpec.Checkpointin both v1alpha1 and v1beta1.GMSCheckpointSpecwith independentloaderandsaverspecs.GMSClientSpecfields forimage, fullcommand,envs,envFromSecret, andvolumeMounts.GMS checkpoint client injection
EnsureGMSRestoreSidecarslayerscheckpoint.loaderonto the defaultgms-loaderclient container.EnsureGMSCheckpointJobSidecarslayerscheckpoint.saveronto the defaultgms-saverclient container.GMS_SOCKET_DIRremains operator-owned so the injected clients always use the operator-managed UDS mount.DGD auto checkpoint flow
DynamoCheckpoint, the generated CR now carries the service's GMS saver override.DynamoCheckpoint, because checkpoint Jobs only save; restore-time loader configuration belongs to the consuming service.Restore flow
gpuMemoryService.checkpoint.loaderoverride before pod generation.DynamoCheckpointmust still be GMS-enabled.Admission validation
gpuMemoryService.checkpoint.{loader,saver}unless GMS is enabled.DynamoCheckpointvalidation rejectscheckpoint.loader, because aDynamoCheckpointJob only saves.Tests
Example schemas
1. DGD auto checkpoint + custom saver
User submits a DGD with checkpoint
mode: Auto, GMS enabled, and a service-levelsaveroverride:The DGD controller auto-creates a
DynamoCheckpointcarrying the saver config. The loader is not copied into this CR because the checkpoint Job only saves:2. DGD restore + custom service loader
User submits a DGD that restores from a GMS-enabled checkpoint and sets a service-level
loaderoverride:The referenced checkpoint must be GMS-enabled. It may have a saver override from creation time, but the restore loader is read from the consuming service above:
At pod-generation time, the controller resolves
llama-gms-checkpoint, sees that it is GMS-enabled, then overlays the DGD service'sloaderoverride onto the generatedgms-loaderclient container.3. Manual DynamoCheckpoint with custom saver
For manual checkpoint creation, users can put
saverdirectly on theDynamoCheckpoint:gpuMemoryService.checkpoint.loaderis rejected onDynamoCheckpoint; loader config belongs on the DGD/DCD service that restores from the checkpoint.Design choices
DynamoCheckpoint: save-time behavior is reconciled by the checkpoint controller from theDynamoCheckpointCR, so DGD auto mode stamps the saver override onto the generated CR.gms-loaderandgms-saver; users customize fields on those containers rather than replacing the containers wholesale.commandis the entire argv. There is no separateargsfield and no hiddenpython3 -m ...prefix when a custom command is provided.GMS_SOCKET_DIR.Out of scope / follow-ups
This PR intentionally does not clean up the older GMS checkpoint storage path. Follow-up work can remove or redesign the remaining built-in assumptions, including:
GMS_CHECKPOINT_DIRinjection andresolveGMSArtifactDir().DiscoverAndResolveStorage()/InjectCheckpointVolume()for the operator-owned checkpoint PVC path.Validation
make manifestsuv run --no-project --with 'pydantic>=2.11.4,<2.13' --with pyyaml --with black make generate-pydanticgo test -buildvcs=false $(go list -buildvcs=false ./... | grep -v /e2e | grep -v /test | grep -v /cmd)fromdeploy/operatorgo test ./internal/controller ./api/v1alpha1 ./api/v1beta1 ./internal/checkpoint ./internal/webhook/validation ./internal/dynamofromdeploy/operatorNote:
go test ./...includes e2e tests and failed locally because kubectl attempted to use an expired Teleport kube context; the non-e2e operator package tests above pass.Summary by CodeRabbit
Release Notes