Skip to content

cp: perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (1715) into r0.5.0#1716

Merged
terrykong merged 1 commit intor0.5.0from
cherry-pick-1715-r0.5.0
Jan 5, 2026
Merged

cp: perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (1715) into r0.5.0#1716
terrykong merged 1 commit intor0.5.0from
cherry-pick-1715-r0.5.0

Conversation

@chtruong814
Copy link
Copy Markdown
Contributor

@chtruong814 chtruong814 commented Jan 5, 2026

beep boop [🤖]: Hi @guyueh1 👋,

we've cherry picked #1715 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Chores
    • Updated tensor parallelism configuration in the example LLM performance recipe. The tensor parallel size setting has been increased from 16 to 32 for improved distributed computation capabilities.

✏️ Tip: You can customize this high-level summary in your review settings.

#1715)

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@terrykong terrykong added the CI:L0 Run doctests and unit tests label Jan 5, 2026
@terrykong terrykong enabled auto-merge (squash) January 5, 2026 20:23
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

A configuration parameter update in a GRPO DeepSeek V3 example recipe, increasing the tensor parallelism size from 16 to 32 in the vLLM configuration for distributed model serving.

Changes

Cohort / File(s) Summary
Configuration Update
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
Updated generation.vllm_cfg.tensor_parallel_size from 16 to 32, increasing tensor parallelism for multi-GPU model serving.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Suggested labels

r0.5.0

Suggested reviewers

  • guyueh1
  • terrykong

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning Performance recipe configuration change lacks documented performance metrics and validation results in PR description. Add before-and-after throughput metrics, hardware specifications, and test environment details to PR description or reference PR #1715 with performance results.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main change: updating tensor parallel size from 16 to 32 in the deepseek GB200 performance recipe configuration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6526fe9 and 22c029d.

📒 Files selected for processing (1)
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
🧰 Additional context used
📓 Path-based instructions (2)
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
🧠 Learnings (1)
📓 Common learnings
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (1)
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1)

14-14: LGTM! Clean cherry-pick for GB200 performance optimization.

The increase in tensor parallelism from 16 to 32 aligns with the PR objectives for GB200 sync benchmark performance tuning. With 128 total GPUs (32 nodes × 4 GPUs), TP=32 for vLLM generation is mathematically valid and should provide better model sharding for inference workloads on this hardware configuration.

Since this is a cherry-pick of PR #1715, the change has already been reviewed and validated.

Optionally, verify that this configuration performs as expected on the target GB200 hardware with the benchmarking workload to ensure the performance improvement is realized in the r0.5.0 release.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@terrykong terrykong merged commit 51df3e7 into r0.5.0 Jan 5, 2026
64 of 71 checks passed
@terrykong terrykong deleted the cherry-pick-1715-r0.5.0 branch January 5, 2026 23:19
avenkateshha pushed a commit to avenkateshha/RL that referenced this pull request Apr 10, 2026
…chmark (1715)` into `r0.5.0` (NVIDIA-NeMo#1716)

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L0 Run doctests and unit tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants