docs: clarify GPU Memory Service status#9119
Conversation
d59e567 to
9f52639
Compare
WalkthroughThis PR documents the experimental GPU Memory Service (GMS) feature by adding a comprehensive new documentation page, updating related documentation to clarify GMS limitations and its temporary incompatibility with Snapshot due to GPU driver restore issues, adding navigation links, and updating a sample configuration comment to reflect that GMS checkpoint/restore is temporarily disabled. ChangesGPU Memory Service Documentation & Integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/kubernetes/gpu-memory-service.md`:
- Around line 21-42: The fenced ASCII diagram block in the GPU memory service
doc is missing a language identifier and triggers markdownlint MD040; update the
opening code fence for the diagram (the triple-backtick that precedes the ASCII
art block) to include a language tag such as "text" (e.g., ```text) so the
fenced code block is explicitly marked and the linter warning is resolved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6babb8cf-35c4-4de5-8738-090dbc250860
📒 Files selected for processing (6)
deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yamldocs/fault-tolerance/README.mddocs/index.ymldocs/kubernetes/README.mddocs/kubernetes/gpu-memory-service.mddocs/kubernetes/snapshot.md
|
/ok to test 5aeec39 |
Signed-off-by: athreesh <anish.maddipoti@utexas.edu>
|
/ok to test 74825fb |
Signed-off-by: athreesh <anish.maddipoti@utexas.edu>
Summary
spec.gpuMemoryService.enabled, since Snapshot plus GMS is currently rejected by admission because of GPU driver restore issues.Why
The current Snapshot docs made GMS checkpoint/restore look user-ready even though the operator now blocks that path. The new page makes the experimental status explicit and gives users a clearer decision point for GMS, Snapshot, and failover.
Validation
git diff --checkgit diff --cached --checkruby -e "require 'yaml'; YAML.load_file('docs/index.yml'); puts 'docs/index.yml ok'"gpu-memory-service.md,snapshot.md, andapi-reference.mdNote:
fernis not installed in this local environment, so I did not runfern checkorfern docs broken-links.Summary by CodeRabbit
Documentation
Chores