Skip to content

docs: clarify GPU Memory Service status#9119

Merged
athreesh merged 1 commit into
mainfrom
codex/gms-docs
May 4, 2026
Merged

docs: clarify GPU Memory Service status#9119
athreesh merged 1 commit into
mainfrom
codex/gms-docs

Conversation

@athreesh
Copy link
Copy Markdown
Contributor

@athreesh athreesh commented May 4, 2026

Summary

  • Add a GPU Memory Service Kubernetes docs page that explains when to use GMS today, when not to use it, and how it relates to Snapshot and failover.
  • Update Snapshot docs to stop telling users to enable spec.gpuMemoryService.enabled, since Snapshot plus GMS is currently rejected by admission because of GPU driver restore issues.
  • Link the new page from the Kubernetes docs nav and README, and update the DynamoCheckpoint sample comment to keep GMS checkpoint capture disabled.

Why

The current Snapshot docs made GMS checkpoint/restore look user-ready even though the operator now blocks that path. The new page makes the experimental status explicit and gives users a clearer decision point for GMS, Snapshot, and failover.

Validation

  • git diff --check
  • git diff --cached --check
  • ruby -e "require 'yaml'; YAML.load_file('docs/index.yml'); puts 'docs/index.yml ok'"
  • verified internal doc targets exist for gpu-memory-service.md, snapshot.md, and api-reference.md

Note: fern is not installed in this local environment, so I did not run fern check or fern docs broken-links.

Summary by CodeRabbit

  • Documentation

    • Added comprehensive GPU Memory Service documentation explaining same-node recovery behavior and current limitations
    • Updated navigation to include GPU Memory Service deployment guide
    • Clarified that GPU Memory Service is temporarily disabled with Snapshot due to GPU driver restore issues
  • Chores

    • Updated sample configuration with guidance notes for GPU Memory Service

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes labels May 4, 2026
@athreesh athreesh changed the title [codex] docs: clarify GPU Memory Service status docs: clarify GPU Memory Service status May 4, 2026
@github-actions github-actions Bot added the docs label May 4, 2026
@athreesh athreesh force-pushed the codex/gms-docs branch 7 times, most recently from d59e567 to 9f52639 Compare May 4, 2026 22:56
@athreesh athreesh marked this pull request as ready for review May 4, 2026 22:57
@athreesh athreesh requested a review from a team as a code owner May 4, 2026 22:57
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Walkthrough

This PR documents the experimental GPU Memory Service (GMS) feature by adding a comprehensive new documentation page, updating related documentation to clarify GMS limitations and its temporary incompatibility with Snapshot due to GPU driver restore issues, adding navigation links, and updating a sample configuration comment to reflect that GMS checkpoint/restore is temporarily disabled.

Changes

GPU Memory Service Documentation & Integration

Layer / File(s) Summary
Core Feature Documentation
docs/kubernetes/gpu-memory-service.md
New page explaining GMS purpose (keeping GPU-resident weights across lifecycle changes on same node), failure recovery flow with diagram, decision guide, prerequisites, limitations, API placement for both v1alpha1 and v1beta1, and example manifests for basic GMS and active/passive failover usage.
Integration with Existing Docs
docs/kubernetes/snapshot.md, docs/fault-tolerance/README.md
snapshot.md updated to prohibit GMS enabled with Snapshot, mark failover restore as experimental GMS-only, and note GMS restore is disabled due to driver issues. fault-tolerance/README.md gains GPU Memory entry in fault-tolerance table and dedicated GMS section clarifying same-node recovery scope and non-coverage of hardware loss, in-flight requests, and KV cache.
Navigation & Discovery
docs/index.yml, docs/kubernetes/README.md
Navigation entries added for "GPU Memory Service" page in Kubernetes Deployment Guide and Additional Resources lists.
Configuration Samples
deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml
Sample DynamoCheckpoint spec comment clarifies that GMS checkpoint/restore is temporarily disabled due to GPU driver restore issues and should remain false.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'docs: clarify GPU Memory Service status' directly and concisely summarizes the main objective: updating documentation to clarify the GPU Memory Service feature status and guidance.
Description check ✅ Passed The PR description includes all required sections: summary of changes, rationale (Why), and validation performed, with clear details on what was changed and why.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/kubernetes/gpu-memory-service.md`:
- Around line 21-42: The fenced ASCII diagram block in the GPU memory service
doc is missing a language identifier and triggers markdownlint MD040; update the
opening code fence for the diagram (the triple-backtick that precedes the ASCII
art block) to include a language tag such as "text" (e.g., ```text) so the
fenced code block is explicitly marked and the linter warning is resolved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6babb8cf-35c4-4de5-8738-090dbc250860

📥 Commits

Reviewing files that changed from the base of the PR and between a039628 and 9f52639.

📒 Files selected for processing (6)
  • deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml
  • docs/fault-tolerance/README.md
  • docs/index.yml
  • docs/kubernetes/README.md
  • docs/kubernetes/gpu-memory-service.md
  • docs/kubernetes/snapshot.md

Comment thread docs/kubernetes/gpu-memory-service.md Outdated
@athreesh
Copy link
Copy Markdown
Contributor Author

athreesh commented May 4, 2026

/ok to test 5aeec39

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Comment thread docs/kubernetes/snapshot.md Outdated
Comment thread docs/kubernetes/snapshot.md Outdated
Comment thread docs/kubernetes/gpu-memory-service.md Outdated
Signed-off-by: athreesh <anish.maddipoti@utexas.edu>
@athreesh athreesh enabled auto-merge (squash) May 4, 2026 23:50
@athreesh
Copy link
Copy Markdown
Contributor Author

athreesh commented May 4, 2026

/ok to test 74825fb

@athreesh athreesh merged commit 26645cc into main May 4, 2026
57 checks passed
@athreesh athreesh deleted the codex/gms-docs branch May 4, 2026 23:55
keivenchang pushed a commit that referenced this pull request May 5, 2026
Signed-off-by: athreesh <anish.maddipoti@utexas.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes docs documentation Improvements or additions to documentation size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants