perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark by guyueh1 · Pull Request #1715 · NVIDIA-NeMo/RL

guyueh1 · 2026-01-05T16:52:55Z

What does this PR do ?

There seems to be issues with optimizer offloading on GB200, so in the deepseek colocated benchmark, OOM happens in refit unexpectedly; I had to tune up the vLLM TP size to avoid OOM.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Chores
- Updated performance optimization configuration to enhance parallel processing efficiency.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

coderabbitai · 2026-01-05T16:58:04Z

📝 Walkthrough

Walkthrough

A single YAML configuration file is updated to increase the tensor parallelism parameter for VLLM generation from 16 to 32 in a DeepSeek v3 performance recipe configuration.

Changes

Cohort / File(s)	Summary
VLLM Configuration `examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml`	Updated `generation.vllm_cfg.tensor_parallel_size` from 16 to 32, increasing tensor parallelism for distributed model inference

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

perf: perf script change for qwen30b-a3b #1526: Modifies vllm_cfg.tensor_parallel_size parameters in example LLM performance config YAMLs with similar scaling adjustments.

Suggested labels

Performance, GB200

Suggested reviewers

terrykong

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR makes significant performance configuration change (TP: 16→32) without documented test results or performance benchmarks as required.	Update PR description to include comprehensive performance benchmarks comparing TP=16 and TP=32 on GB200, with throughput, latency, tokens/sec/GPU, and memory metrics.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: updating the tensor parallelism size from 16 to 32 in the deepseek GB200 sync benchmark configuration.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a26e53b and 837c890.

📒 Files selected for processing (1)

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml

🧰 Additional context used

📓 Path-based instructions (2)

examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml

🧠 Learnings (1)

📓 Common learnings

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: build-container / main
GitHub Check: sphinx-build / Build docs
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

NVIDIA-NeMo#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>

NVIDIA-NeMo#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Change TP 16->32 for deepseek GB200 sync benchmark

837c890

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 requested a review from a team as a code owner January 5, 2026 16:52

guyueh1 self-assigned this Jan 5, 2026

guyueh1 added r0.5.0 CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Jan 5, 2026

guyueh1 temporarily deployed to nemo-ci January 5, 2026 16:53 — with GitHub Actions Inactive

coderabbitai Bot reviewed Jan 5, 2026

View reviewed changes

Comment thread examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml

guyueh1 temporarily deployed to nemo-ci January 5, 2026 17:19 — with GitHub Actions Inactive

terrykong approved these changes Jan 5, 2026

View reviewed changes

terrykong enabled auto-merge (squash) January 5, 2026 18:11

guyueh1 temporarily deployed to nemo-ci January 5, 2026 19:01 — with GitHub Actions Inactive

terrykong merged commit d549154 into NVIDIA-NeMo:main Jan 5, 2026
55 of 57 checks passed

chtruong814 pushed a commit that referenced this pull request Jan 5, 2026

perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (

22c029d

#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai Bot mentioned this pull request Jan 5, 2026

cp: perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (1715) into r0.5.0 #1716

Merged

parthmannan pushed a commit to parthmannan/RL that referenced this pull request Jan 15, 2026

perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (

3e4bdcf

NVIDIA-NeMo#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (

f492845

#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (

34078c6

#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 9, 2026

perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark (

5bcfe10

#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark#1715

perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmark#1715
terrykong merged 1 commit intoNVIDIA-NeMo:mainfrom
guyueh1:fix_gb200_dpsk_oom

guyueh1 commented Jan 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guyueh1 commented Jan 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guyueh1 commented Jan 5, 2026 •

edited by coderabbitai Bot

Loading