Conversation
Signed-off-by: vinhn <vinhn@nvidia.com>
📝 WalkthroughWalkthroughDocumentation update that replaces a Kubernetes section placeholder with a comprehensive guide covering NemoRL training job deployment on Kubernetes using Ray with NVIDIA GPUs, including setup, configuration, and operational procedures. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @docs/cluster.md:
- Line 331: The manifest's image field is hardcoded to
nvcr.io/nvidian/nemo-rl:latest which will mismatch the build tag
nvcr.io/${NGC_ORG}/nemo-rl:latest and cause ImagePullBackOff; update the image
entry to use the same placeholder used during build (e.g.,
nvcr.io/${NGC_ORG}/nemo-rl:latest or a clear placeholder like
<YOUR_NGC_ORG>/nemo-rl:latest) and add a short IMPORTANT note above the YAML
telling users to replace <YOUR_NGC_ORG> (or set NGC_ORG) prior to applying, or
alternatively document using envsubst for variable substitution so the
deployment image matches the built image.
- Line 325: YAML contains markdown-style resource names like
"[nvidia.com/gpu](https://nvidia.com/gpu)" which is invalid; replace each
occurrence with the plain resource name nvidia.com/gpu (e.g., change the key
from "[nvidia.com/gpu](https://nvidia.com/gpu)" to "nvidia.com/gpu") in the
cluster and worker specs and apply the same replacement for all other
occurrences of the markdown link form in the file.
- Line 306: The cluster config pins rayVersion: '2.49.2', which has a critical
CVE (ShadowRay); update the version string (rayVersion) to '2.52.0' or later in
the cluster configuration and any other places that reference rayVersion (e.g.,
docs/cluster.md and matched entries in pyproject.toml or deployment manifests),
or alternatively add notes/instructions to enforce strict network/API access
controls for the jobs/dashboard if you cannot upgrade; ensure consistency across
all files referencing the rayVersion symbol.
🧹 Nitpick comments (2)
docs/cluster.md (2)
346-346: Network interface configuration requires validation.The
bond0interface (lines 346, 412) is not universal across all Kubernetes clusters. While line 278 mentions checking with the admin, users might miss this note and experience NCCL communication failures that are difficult to debug.💡 Add more prominent configuration note
Consider adding a more visible warning directly in the YAML comments:
env: - name: NVIDIA_VISIBLE_DEVICES value: "all" + # IMPORTANT: Verify the correct network interface with your cluster admin + # Common values: bond0, eth0, ib0 (for InfiniBand) + # Run 'ip addr' or 'ifconfig' on a node to identify available interfaces - name: NCCL_SOCKET_IFNAME value: bond0 - name: NCCL_SHM_DISABLEAlso applies to: 412-412
203-203: Consider using proper heading for better document structure.Line 203 uses bold emphasis for "Login to the Registry" which could be a proper heading (e.g.,
#### Login to the Registry) for better document structure and navigation.📝 Convert to proper heading
### 2. Build and Push the Docker Container We will use the NVIDIA cloud registry (`nvcr.io`) for this guide. From your client machine: -**Login to the Registry** +#### Login to the Registry ```bash # Set up Docker and nvcr.io with your NGC_API_KEY docker login nvcr.ioApply similar changes to "Build and Push" on line 212 if desired.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/cluster.md
🧰 Additional context used
📓 Path-based instructions (2)
docs/**/*.md
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section
Files:
docs/cluster.md
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
docs/cluster.md
🪛 LanguageTool
docs/cluster.md
[style] ~235-~235: ‘exact same’ might be wordy. Consider a shorter alternative.
Context: ... the Head node and Worker nodes see the exact same files (code, data, checkpoints). This p...
(EN_WORDINESS_PREMIUM_EXACT_SAME)
🪛 markdownlint-cli2 (0.18.1)
docs/cluster.md
203-203: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (2)
docs/cluster.md (2)
517-560: Helpful utility for PVC debugging.The busybox helper pod is a practical addition that allows users to inspect and manage PVC contents without spinning up expensive GPU pods. The implementation is clean and the usage examples are clear.
1-3: Thedocs/index.mdfile has been properly updated with the cluster setup guide. The newdocs/cluster.mddocument is referenced in the "Environment Start" section of the index and appears in the grid card for cluster setup under "Training and Generation," satisfying the coding guideline requirement.
Signed-off-by: vinhn <vinhn@nvidia.com>
Signed-off-by: vinhn <vinhn@nvidia.com>
Signed-off-by: vinhn <vinhn@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: vinhn <vinhn@nvidia.com>
Signed-off-by: vinhn <vinhn@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: vinhn <vinhn@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: vinhn <vinhn@nvidia.com>
Signed-off-by: vinhn <vinhn@nvidia.com>
Signed-off-by: vinhn <vinhn@nvidia.com>
The Automodel submodule now tracks the fix/gemma4-moe-gate-double-norm branch on the shuangy fork, which is rebased on upstream main (bd942f20) and carries only the single MoE-gate double-norm fix plus its regression tests. This drops the three transformers 5.5 compat patches that have since landed upstream (NVIDIA-NeMo#1734, NVIDIA-NeMo#1769, NVIDIA-NeMo#1764) and collapses our carry-stack from four patches down to one. gemma4-support is preserved on the fork as an A/B fallback — flip .gitmodules branch + re-checkout the submodule to swap. Signed-off-by: Shuang Yu <shuangy@nvidia.com>
The Automodel submodule now tracks the fix/gemma4-moe-gate-double-norm branch on the shuangy fork, which is rebased on upstream main (bd942f20) and carries only the single MoE-gate double-norm fix plus its regression tests. This drops the three transformers 5.5 compat patches that have since landed upstream (NVIDIA-NeMo#1734, NVIDIA-NeMo#1769, NVIDIA-NeMo#1764) and collapses our carry-stack from four patches down to one. gemma4-support is preserved on the fork as an A/B fallback — flip .gitmodules branch + re-checkout the submodule to swap. Signed-off-by: Shuang Yu <shuangy@nvidia.com>
The Automodel submodule now tracks the fix/gemma4-moe-gate-double-norm branch on the shuangy fork, which is rebased on upstream main (bd942f20) and carries only the single MoE-gate double-norm fix plus its regression tests. This drops the three transformers 5.5 compat patches that have since landed upstream (NVIDIA-NeMo#1734, NVIDIA-NeMo#1769, NVIDIA-NeMo#1764) and collapses our carry-stack from four patches down to one. gemma4-support is preserved on the fork as an A/B fallback — flip .gitmodules branch + re-checkout the submodule to swap. Signed-off-by: Shuang Yu <shuangy@nvidia.com>
What does this PR do ?
Adding k8 setup and job execution guide
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.