docs: clarify GPU Memory Service status by athreesh · Pull Request #9119 · ai-dynamo/dynamo

athreesh · 2026-05-04T22:41:29Z

Summary

Add a GPU Memory Service Kubernetes docs page that explains when to use GMS today, when not to use it, and how it relates to Snapshot and failover.
Update Snapshot docs to stop telling users to enable spec.gpuMemoryService.enabled, since Snapshot plus GMS is currently rejected by admission because of GPU driver restore issues.
Link the new page from the Kubernetes docs nav and README, and update the DynamoCheckpoint sample comment to keep GMS checkpoint capture disabled.

Why

The current Snapshot docs made GMS checkpoint/restore look user-ready even though the operator now blocks that path. The new page makes the experimental status explicit and gives users a clearer decision point for GMS, Snapshot, and failover.

Validation

git diff --check
git diff --cached --check
ruby -e "require 'yaml'; YAML.load_file('docs/index.yml'); puts 'docs/index.yml ok'"
verified internal doc targets exist for gpu-memory-service.md, snapshot.md, and api-reference.md

Note: fern is not installed in this local environment, so I did not run fern check or fern docs broken-links.

Summary by CodeRabbit

Documentation
- Added comprehensive GPU Memory Service documentation explaining same-node recovery behavior and current limitations
- Updated navigation to include GPU Memory Service deployment guide
- Clarified that GPU Memory Service is temporarily disabled with Snapshot due to GPU driver restore issues
Chores
- Updated sample configuration with guidance notes for GPU Memory Service

copy-pr-bot · 2026-05-04T22:41:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-04T23:00:18Z

Walkthrough

This PR documents the experimental GPU Memory Service (GMS) feature by adding a comprehensive new documentation page, updating related documentation to clarify GMS limitations and its temporary incompatibility with Snapshot due to GPU driver restore issues, adding navigation links, and updating a sample configuration comment to reflect that GMS checkpoint/restore is temporarily disabled.

Changes

GPU Memory Service Documentation & Integration

Layer / File(s)	Summary
Core Feature Documentation `docs/kubernetes/gpu-memory-service.md`	New page explaining GMS purpose (keeping GPU-resident weights across lifecycle changes on same node), failure recovery flow with diagram, decision guide, prerequisites, limitations, API placement for both `v1alpha1` and `v1beta1`, and example manifests for basic GMS and active/passive failover usage.
Integration with Existing Docs `docs/kubernetes/snapshot.md`, `docs/fault-tolerance/README.md`	`snapshot.md` updated to prohibit GMS enabled with Snapshot, mark failover restore as experimental GMS-only, and note GMS restore is disabled due to driver issues. `fault-tolerance/README.md` gains GPU Memory entry in fault-tolerance table and dedicated GMS section clarifying same-node recovery scope and non-coverage of hardware loss, in-flight requests, and KV cache.
Navigation & Discovery `docs/index.yml`, `docs/kubernetes/README.md`	Navigation entries added for "GPU Memory Service" page in Kubernetes Deployment Guide and Additional Resources lists.
Configuration Samples `deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml`	Sample `DynamoCheckpoint` spec comment clarifies that GMS checkpoint/restore is temporarily disabled due to GPU driver restore issues and should remain false.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'docs: clarify GPU Memory Service status' directly and concisely summarizes the main objective: updating documentation to clarify the GPU Memory Service feature status and guidance.
Description check	✅ Passed	The PR description includes all required sections: summary of changes, rationale (Why), and validation performed, with clear details on what was changed and why.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/kubernetes/gpu-memory-service.md`:
- Around line 21-42: The fenced ASCII diagram block in the GPU memory service
doc is missing a language identifier and triggers markdownlint MD040; update the
opening code fence for the diagram (the triple-backtick that precedes the ASCII
art block) to include a language tag such as "text" (e.g., ```text) so the
fenced code block is explicitly marked and the linter warning is resolved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6babb8cf-35c4-4de5-8738-090dbc250860

📥 Commits

Reviewing files that changed from the base of the PR and between a039628 and 9f52639.

📒 Files selected for processing (6)

deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml
docs/fault-tolerance/README.md
docs/index.yml
docs/kubernetes/README.md
docs/kubernetes/gpu-memory-service.md
docs/kubernetes/snapshot.md

athreesh · 2026-05-04T23:07:56Z

/ok to test 5aeec39

github-actions · 2026-05-04T23:10:06Z

🌿 Fern Docs Preview: https://nvidia-preview-b958f686-fc62-4dbd-b276-c69d88df5d9f.docs.buildwithfern.com/dynamo/dev

Signed-off-by: athreesh <anish.maddipoti@utexas.edu>

athreesh · 2026-05-04T23:51:00Z

/ok to test 74825fb

Signed-off-by: athreesh <anish.maddipoti@utexas.edu>

pull-request-size Bot added the size/L label May 4, 2026

github-actions Bot added documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes labels May 4, 2026

athreesh changed the title ~~[codex] docs: clarify GPU Memory Service status~~ docs: clarify GPU Memory Service status May 4, 2026

github-actions Bot added the docs label May 4, 2026

athreesh force-pushed the codex/gms-docs branch 7 times, most recently from d59e567 to 9f52639 Compare May 4, 2026 22:56

athreesh requested review from galletas1712 and mohammedabdulwahhab May 4, 2026 22:57

athreesh marked this pull request as ready for review May 4, 2026 22:57

athreesh requested a review from a team as a code owner May 4, 2026 22:57

dynamo-ops approved these changes May 4, 2026

View reviewed changes

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread docs/kubernetes/gpu-memory-service.md Outdated

athreesh force-pushed the codex/gms-docs branch from 9f52639 to 5aeec39 Compare May 4, 2026 23:07

galletas1712 reviewed May 4, 2026

View reviewed changes

Comment thread docs/kubernetes/snapshot.md Outdated

Comment thread docs/kubernetes/snapshot.md Outdated

Comment thread docs/kubernetes/gpu-memory-service.md Outdated

docs: clarify GPU Memory Service status

74825fb

Signed-off-by: athreesh <anish.maddipoti@utexas.edu>

athreesh force-pushed the codex/gms-docs branch from 5aeec39 to 74825fb Compare May 4, 2026 23:39

galletas1712 approved these changes May 4, 2026

View reviewed changes

mohammedabdulwahhab approved these changes May 4, 2026

View reviewed changes

athreesh enabled auto-merge (squash) May 4, 2026 23:50

copy-pr-bot Bot temporarily deployed to GITLAB May 4, 2026 23:51 Inactive

athreesh merged commit 26645cc into main May 4, 2026
57 checks passed

athreesh deleted the codex/gms-docs branch May 4, 2026 23:55

copy-pr-bot Bot had a problem deploying to GITLAB May 5, 2026 00:07 Failure

keivenchang pushed a commit that referenced this pull request May 5, 2026

docs: clarify GPU Memory Service status (#9119)

61a16ab

Signed-off-by: athreesh <anish.maddipoti@utexas.edu>

keivenchang mentioned this pull request May 5, 2026

test(revalidate): docs: clarify GPU Memory Service status #9119 #9137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: clarify GPU Memory Service status#9119

docs: clarify GPU Memory Service status#9119
athreesh merged 1 commit into
mainfrom
codex/gms-docs

athreesh commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

athreesh commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

athreesh commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

athreesh commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

athreesh commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

athreesh commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

athreesh commented May 4, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading