cp: `perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (1715)` into `r0.5.0` by chtruong814 · Pull Request #1716 · NVIDIA-NeMo/RL

chtruong814 · 2026-01-05T20:22:42Z

beep boop [🤖]: Hi @guyueh1 👋,

we've cherry picked #1715 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Chores
- Updated tensor parallelism configuration in the example LLM performance recipe. The tensor parallel size setting has been increased from 16 to 32 for improved distributed computation capabilities.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai · 2026-01-05T20:27:27Z

📝 Walkthrough

Walkthrough

A configuration parameter update in a GRPO DeepSeek V3 example recipe, increasing the tensor parallelism size from 16 to 32 in the vLLM configuration for distributed model serving.

Changes

Cohort / File(s)	Summary
Configuration Update `examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml`	Updated `generation.vllm_cfg.tensor_parallel_size` from 16 to 32, increasing tensor parallelism for multi-GPU model serving.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Suggested labels

r0.5.0

Suggested reviewers

guyueh1
terrykong

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	Performance recipe configuration change lacks documented performance metrics and validation results in PR description.	Add before-and-after throughput metrics, hardware specifications, and test environment details to PR description or reference PR #1715 with performance results.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main change: updating tensor parallel size from 16 to 32 in the deepseek GB200 performance recipe configuration.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6526fe9 and 22c029d.

📒 Files selected for processing (1)

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml

🧰 Additional context used

📓 Path-based instructions (2)

examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml

🧠 Learnings (1)

📓 Common learnings

Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: sphinx-build / Build docs
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (1)

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1)

14-14: LGTM! Clean cherry-pick for GB200 performance optimization.

The increase in tensor parallelism from 16 to 32 aligns with the PR objectives for GB200 sync benchmark performance tuning. With 128 total GPUs (32 nodes × 4 GPUs), TP=32 for vLLM generation is mathematically valid and should provide better model sharding for inference workloads on this hardware configuration.

Since this is a cherry-pick of PR #1715, the change has already been reviewed and validated.

Optionally, verify that this configuration performs as expected on the target GB200 hardware with the benchmarking workload to ensure the performance improvement is realized in the r0.5.0 release.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…chmark (1715)` into `r0.5.0` (NVIDIA-NeMo#1716) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>

perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (

22c029d

#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

chtruong814 requested a review from a team as a code owner January 5, 2026 20:22

chtruong814 requested a review from guyueh1 January 5, 2026 20:22

chtruong814 added cherry-pick Run CICD labels Jan 5, 2026

terrykong approved these changes Jan 5, 2026

View reviewed changes

terrykong added the CI:L0 Run doctests and unit tests label Jan 5, 2026

terrykong enabled auto-merge (squash) January 5, 2026 20:23

terrykong temporarily deployed to nemo-ci January 5, 2026 20:23 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci January 5, 2026 21:11 — with GitHub Actions Inactive

terrykong merged commit 51df3e7 into r0.5.0 Jan 5, 2026
64 of 71 checks passed

terrykong deleted the cherry-pick-1715-r0.5.0 branch January 5, 2026 23:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (1715)` into `r0.5.0`#1716

cp: `perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (1715)` into `r0.5.0`#1716
terrykong merged 1 commit intor0.5.0from
cherry-pick-1715-r0.5.0

chtruong814 commented Jan 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 5, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chtruong814 commented Jan 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 5, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chtruong814 commented Jan 5, 2026 •

edited by coderabbitai Bot

Loading