docs: Adding k8 guide by vinhngx · Pull Request #1764 · NVIDIA-NeMo/RL

vinhngx · 2026-01-13T05:32:34Z

What does this PR do ?

Adding k8 setup and job execution guide

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Documentation
- Added comprehensive Kubernetes deployment guide including prerequisites, container management, shared storage setup, Ray cluster configuration with detailed YAML examples, job submission procedures, and monitoring for NVIDIA GPU-accelerated NemoRL training.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: vinhn <vinhn@nvidia.com>

coderabbitai · 2026-01-13T05:37:15Z

📝 Walkthrough

Walkthrough

Documentation update that replaces a Kubernetes section placeholder with a comprehensive guide covering NemoRL training job deployment on Kubernetes using Ray with NVIDIA GPUs, including setup, configuration, and operational procedures.

Changes

Cohort / File(s)	Summary
Documentation `docs/cluster.md`	Replaced TBD Kubernetes section with detailed migration guide including prerequisites, cluster setup phases (storage, Ray cluster, workload deployment), YAML configurations, management commands, and monitoring procedures

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

documentation

Suggested reviewers

lbliii
terrykong

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	Pull request contains documentation-only changes adding a Kubernetes deployment guide, which does not affect functionality or require testing.
Title check	✅ Passed	The title 'docs: Adding k8 guide' accurately describes the main change: adding Kubernetes documentation to the repository.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In @docs/cluster.md:
- Line 331: The manifest's image field is hardcoded to
nvcr.io/nvidian/nemo-rl:latest which will mismatch the build tag
nvcr.io/${NGC_ORG}/nemo-rl:latest and cause ImagePullBackOff; update the image
entry to use the same placeholder used during build (e.g.,
nvcr.io/${NGC_ORG}/nemo-rl:latest or a clear placeholder like
<YOUR_NGC_ORG>/nemo-rl:latest) and add a short IMPORTANT note above the YAML
telling users to replace <YOUR_NGC_ORG> (or set NGC_ORG) prior to applying, or
alternatively document using envsubst for variable substitution so the
deployment image matches the built image.
- Line 325: YAML contains markdown-style resource names like
"[nvidia.com/gpu](https://nvidia.com/gpu)" which is invalid; replace each
occurrence with the plain resource name nvidia.com/gpu (e.g., change the key
from "[nvidia.com/gpu](https://nvidia.com/gpu)" to "nvidia.com/gpu") in the
cluster and worker specs and apply the same replacement for all other
occurrences of the markdown link form in the file.
- Line 306: The cluster config pins rayVersion: '2.49.2', which has a critical
CVE (ShadowRay); update the version string (rayVersion) to '2.52.0' or later in
the cluster configuration and any other places that reference rayVersion (e.g.,
docs/cluster.md and matched entries in pyproject.toml or deployment manifests),
or alternatively add notes/instructions to enforce strict network/API access
controls for the jobs/dashboard if you cannot upgrade; ensure consistency across
all files referencing the rayVersion symbol.

🧹 Nitpick comments (2)

docs/cluster.md (2)
346-346: Network interface configuration requires validation.

The bond0 interface (lines 346, 412) is not universal across all Kubernetes clusters. While line 278 mentions checking with the admin, users might miss this note and experience NCCL communication failures that are difficult to debug.
💡 Add more prominent configuration note

Consider adding a more visible warning directly in the YAML comments:
           env:
             - name: NVIDIA_VISIBLE_DEVICES
               value: "all"
+            # IMPORTANT: Verify the correct network interface with your cluster admin
+            # Common values: bond0, eth0, ib0 (for InfiniBand)
+            # Run 'ip addr' or 'ifconfig' on a node to identify available interfaces
             - name: NCCL_SOCKET_IFNAME
               value: bond0
             - name: NCCL_SHM_DISABLE
Also applies to: 412-412

203-203: Consider using proper heading for better document structure.

Line 203 uses bold emphasis for "Login to the Registry" which could be a proper heading (e.g., #### Login to the Registry) for better document structure and navigation.
📝 Convert to proper heading
 ### 2. Build and Push the Docker Container
 We will use the NVIDIA cloud registry (`nvcr.io`) for this guide. From your client machine:
 
-**Login to the Registry**
+#### Login to the Registry
 ```bash
 # Set up Docker and nvcr.io with your NGC_API_KEY
 docker login nvcr.io
Apply similar changes to "Build and Push" on line 212 if desired.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2e4265 and 8d0564d.

📒 Files selected for processing (1)

docs/cluster.md

🧰 Additional context used

📓 Path-based instructions (2)

docs/**/*.md

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section

Files:

docs/cluster.md

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

docs/cluster.md

🪛 LanguageTool

docs/cluster.md

[style] ~235-~235: ‘exact same’ might be wordy. Consider a shorter alternative.
Context: ... the Head node and Worker nodes see the exact same files (code, data, checkpoints). This p...

(EN_WORDINESS_PREMIUM_EXACT_SAME)

🪛 markdownlint-cli2 (0.18.1)

docs/cluster.md

203-203: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (2)

docs/cluster.md (2)

517-560: Helpful utility for PVC debugging.

The busybox helper pod is a practical addition that allows users to inspect and manage PVC contents without spinning up expensive GPU pods. The implementation is clean and the usage examples are clear.

1-3: The docs/index.md file has been properly updated with the cluster setup guide. The new docs/cluster.md document is referenced in the "Environment Start" section of the index and appears in the grid card for cluster setup under "Training and Generation," satisfying the coding guideline requirement.

Signed-off-by: vinhn <vinhn@nvidia.com>

Signed-off-by: vinhn <vinhn@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: vinhn <vinhn@nvidia.com>

Signed-off-by: vinhn <vinhn@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Signed-off-by: vinhn <vinhn@nvidia.com>

The Automodel submodule now tracks the fix/gemma4-moe-gate-double-norm branch on the shuangy fork, which is rebased on upstream main (bd942f20) and carries only the single MoE-gate double-norm fix plus its regression tests. This drops the three transformers 5.5 compat patches that have since landed upstream (NVIDIA-NeMo#1734, NVIDIA-NeMo#1769, NVIDIA-NeMo#1764) and collapses our carry-stack from four patches down to one. gemma4-support is preserved on the fork as an A/B fallback — flip .gitmodules branch + re-checkout the submodule to swap. Signed-off-by: Shuang Yu <shuangy@nvidia.com>

adding k8 guide

8d0564d

Signed-off-by: vinhn <vinhn@nvidia.com>

vinhngx requested a review from a team as a code owner January 13, 2026 05:32

github-actions Bot added the Documentation Improvements or additions to documentation label Jan 13, 2026

coderabbitai Bot reviewed Jan 13, 2026

View reviewed changes

Comment thread docs/cluster.md

Comment thread docs/cluster.md

Comment thread docs/cluster.md Outdated

Comment thread docs/cluster.md

github-actions Bot added the community-request label Jan 13, 2026

fix code rabbit comments

0550c33

Signed-off-by: vinhn <vinhn@nvidia.com>

vinhngx changed the title ~~[Doc] Adding k8 guide~~ docs Adding k8 guide Jan 13, 2026

vinhngx changed the title ~~docs Adding k8 guide~~ docs: Adding k8 guide Jan 13, 2026

vinhngx and others added 2 commits January 13, 2026 11:35

add note on cluster network interface

719d6c1

Signed-off-by: vinhn <vinhn@nvidia.com>

Merge branch 'main' into main

e6b5f20

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 15, 2026

shashank3959 requested a review from lbliii January 16, 2026 21:36

terrykong approved these changes Jan 20, 2026

View reviewed changes

terrykong added the CI:docs Run doctest label Jan 20, 2026

chtruong814 temporarily deployed to nemo-ci January 20, 2026 06:58 — with GitHub Actions Inactive

vinhngx temporarily deployed to nemo-ci January 20, 2026 06:58 — with GitHub Actions Inactive

terrykong enabled auto-merge (squash) January 20, 2026 06:58

terrykong temporarily deployed to nemo-ci January 20, 2026 06:58 — with GitHub Actions Inactive

vinhngx temporarily deployed to nemo-ci January 20, 2026 07:02 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci January 20, 2026 07:30 — with GitHub Actions Inactive

chtruong814 temporarily deployed to nemo-ci January 20, 2026 07:30 — with GitHub Actions Inactive

terrykong merged commit fe84d53 into NVIDIA-NeMo:main Jan 20, 2026
46 of 47 checks passed

chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 20, 2026

yfw pushed a commit that referenced this pull request Feb 9, 2026

docs: Adding k8 guide (#1764)

64e1610

Signed-off-by: vinhn <vinhn@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

xavier-owkin pushed a commit to owkin/Owkin-NeMo-RL that referenced this pull request Feb 10, 2026

docs: Adding k8 guide (NVIDIA-NeMo#1764)

43f1bc0

Signed-off-by: vinhn <vinhn@nvidia.com>

yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 12, 2026

docs: Adding k8 guide (NVIDIA-NeMo#1764)

09df21f

Signed-off-by: vinhn <vinhn@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026

docs: Adding k8 guide (NVIDIA-NeMo#1764)

9c97e47

Signed-off-by: vinhn <vinhn@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

docs: Adding k8 guide (#1764)

6e19c53

Signed-off-by: vinhn <vinhn@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

docs: Adding k8 guide (#1764)

e5bbabe

Signed-off-by: vinhn <vinhn@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 9, 2026

docs: Adding k8 guide (#1764)

7070247

Signed-off-by: vinhn <vinhn@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Adding k8 guide#1764

docs: Adding k8 guide#1764
terrykong merged 4 commits intoNVIDIA-NeMo:mainfrom
vinhngx:main

vinhngx commented Jan 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vinhngx commented Jan 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vinhngx commented Jan 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 13, 2026 •

edited

Loading