cp: `docs: v0.5 performance results update (1772)` into `r0.5.0` by chtruong814 · Pull Request #1800 · NVIDIA-NeMo/RL

chtruong814 · 2026-01-21T06:31:39Z

beep boop [🤖]: Hi @guyueh1 👋,

we've cherry picked #1772 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Release Notes

Documentation
- Updated performance documentation to v0.5
- Added benchmark results for H100 and GB200 GPUs with BF16/FP8 precision configurations
- Reorganized performance metrics table with expanded model coverage (GRPO, DeepSeek V3, Qwen, LLAMA variants)
- Added references to YAML recipes for benchmark reproduction

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai · 2026-01-21T06:35:28Z

📝 Walkthrough

Walkthrough

Version label updated from v0.4 to v0.5, recipe references added, and benchmark table restructured with new columns (Algorithm, Model). New benchmark sections introduced for H100 and GB200 with BF16 and FP8 precision variants, along with corresponding performance data entries.

Changes

Cohort / File(s)	Summary
Performance Summary Documentation `docs/about/performance-summary.md`	Version label updated to v0.5; table structure reworked with explicit Algorithm and Model columns; new fields added (Generation, Training, Tokens/sec/GPU, Total Step time); benchmark sections expanded for H100 BF16/FP8 and GB200 BF16/FP8 with system and precision details; data entries updated for GRPO, DeepSeek V3, Qwen, and LLAMA models; note section formatting adjusted.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

docs: Create performance-summary.md for NeMo RL #1560: Introduces the initial benchmark scaffold and baseline metrics in performance-summary.md that this PR restructures and expands with additional benchmark data and table reorganization.

Suggested labels

documentation, CI:docs, r0.5.0

Suggested reviewers

guyueh1
snowmanwwg
terrykong

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly indicates this is a cherry-pick of a documentation update for v0.5 performance results into the r0.5.0 branch, which aligns with the actual changeset updating performance benchmark data.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	PR contains documentation-only changes to performance summary with benchmark data updates and table restructuring, without code modifications or features affecting numerics/convergence.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/about/performance-summary.md (1)

36-36: Update the outdated recipe link to match v0.5.

This section references r0.4.0 while the page is v0.5. Please update to r0.5.0 (or remove the duplicate link) to avoid conflicting guidance.

🤖 Fix all issues with AI agents

In `@docs/about/performance-summary.md`:
- Around line 54-66: The Training tuple schema is inconsistent: the header shows
five fields "[TP,CP,EP,PP,VPP]" but rows contain varying arity (e.g.
"[1,1,1,1,1,2,n/a]" vs "[1,1,16,16,n/a]"); normalize every Training cell to the
same canonical tuple length and update the header to reflect the true parameter
list (add any missing parameter name such as "VP" or "VPP" as appropriate),
ensuring the order matches across all rows—apply this fix to the three
performance tables (the two tables around the shown diff and the GB200 table)
and make sure example rows like "[1,1,1,1,1,2,n/a]" and "[1,1,16,16,n/a]" are
rewritten to the agreed schema.

🧹 Nitpick comments (1)

docs/about/performance-summary.md (1)

19-19: Minor grammar tweak for clarity.

Consider: “NeMo RL has two training backends… this performance summary currently only shows numbers from the Megatron backend.”

coderabbitai · 2026-01-21T06:35:31Z

+| Algorithm | Model     |On/Off policy|T-Max Sequence Length|G-Average Seq len|#-GPUs|G-GBS|T-GBS|Generation [TP,PP]|Training [TP,CP,EP,PP,VPP]|Tokens / sec / GPU|Total Step time(s)|
+|---------  |-------    |--------     |-----                |-----            |------|---- |---- |----              |----                      |---               |---|
+| GRPO      |LLAMA3.1_8B|On policy    |4,096                |1,019            |16    |2,048|512  |[1,1]             |[1,1,1,1,1,2,n/a]         |1,581             | 92.8|
+| GRPO      |LLAMA3.1_8B|1-step Off   |4,096                |1,123            |16    |2,048|512  |[1,1]             |[1,1,1,1,1,1,n/a]         |2,478             | 64.8|
+| GRPO      |DeepSeek V3|On policy    |1,536                |744              |256   |512  |512  |[32,1]            |[1,1,16,16,n/a]           |12.7              | 134|
+| GRPO      |DeepSeek V3|1-step Off   |1,536                |738              |512   |512  |512  |[32,1]            |[1,1,16,16,n/a]           |13.1              | 64.9|
+| DAPO      |DeepSeek V3|On policy    |1,536                |974              |512   |512  |512  |[64,1]            |[8,4,32,8,n/a]            |2.45              | 974|
+| GRPO      |Qwen3-235B |On policy    |8,192                |5,700            |128   |512  |512  |[16,1]            |[2,2,16,8,n/a]            |54.1              | 431|
+| GRPO      |Qwen3-235B |1-step Off   |8,192                |5,707            |256   |512  |512  |[8,1]             |[4,1,16,8,n/a]            |58.7              | 203|
+| GRPO      |Qwen3-30B3A|On policy    |4,096                |3,196            |32    |2,048|512  |[2,1]             |[1,1,8,1,n/a]             |1066               | 198|
+| GRPO      |Qwen3-30B3A|1-step Off   |4,096                |3,201            |32    |2,048|512  |[2,1]             |[1,1,8,2,n/a]             |1391               | 154|
+| GRPO      |Qwen3-32B  |On policy    |4,096                |3,251            |32    |2,048|512  |[4,1]             |[4,1,1,4,n/a]             |571               | 376|
+| GRPO      |Qwen3-32B  |1-step Off   |4,096                |3,252            |64    |2,048|512  |[4,1]             |[4,1,1,4,n/a]             |538               | 200|


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and view the relevant lines if [ -f "docs/about/performance-summary.md" ]; then echo "=== File found, viewing lines 50-100 ===" sed -n '50,100p' docs/about/performance-summary.md else echo "File not found at that path" fd -type f -name "performance-summary.md" fi

Repository: NVIDIA-NeMo/RL

Length of output: 5673

Fix the Training tuple schema mismatch in the performance tables.

The header lists 5 fields [TP,CP,EP,PP,VPP], but rows vary between 5 and 7 elements. For example, [1,1,1,1,1,2,n/a] has 7 elements while [1,1,16,16,n/a] has 5, creating ambiguity about parameter meanings and order. Normalize all rows to the same arity and ensure the header accurately reflects the schema (including any missing parameters like VP).

This applies to all three performance tables: lines 54–66, 85–96, and the GB200 table.

🤖 Prompt for AI Agents

In `@docs/about/performance-summary.md` around lines 54 - 66, The Training tuple schema is inconsistent: the header shows five fields "[TP,CP,EP,PP,VPP]" but rows contain varying arity (e.g. "[1,1,1,1,1,2,n/a]" vs "[1,1,16,16,n/a]"); normalize every Training cell to the same canonical tuple length and update the header to reflect the true parameter list (add any missing parameter name such as "VP" or "VPP" as appropriate), ensuring the order matches across all rows—apply this fix to the three performance tables (the two tables around the shown diff and the GB200 table) and make sure example rows like "[1,1,1,1,1,2,n/a]" and "[1,1,16,16,n/a]" are rewritten to the agreed schema.

…DIA-NeMo#1800) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>

docs: v0.5 performance results update (#1772)

d79ac4b

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

chtruong814 requested a review from a team as a code owner January 21, 2026 06:31

chtruong814 requested a review from guyueh1 January 21, 2026 06:31

chtruong814 added cherry-pick Run CICD labels Jan 21, 2026

github-actions Bot added the Documentation Improvements or additions to documentation label Jan 21, 2026

chtruong814 temporarily deployed to nemo-ci January 21, 2026 06:31 — with GitHub Actions Inactive

terrykong approved these changes Jan 21, 2026

View reviewed changes

terrykong enabled auto-merge (squash) January 21, 2026 06:32

terrykong added the CI:docs Run doctest label Jan 21, 2026

terrykong temporarily deployed to nemo-ci January 21, 2026 06:33 — with GitHub Actions Inactive

chtruong814 had a problem deploying to nemo-ci January 21, 2026 06:35 — with GitHub Actions Failure

coderabbitai Bot reviewed Jan 21, 2026

View reviewed changes

chtruong814 had a problem deploying to nemo-ci January 21, 2026 06:38 — with GitHub Actions Failure

terrykong temporarily deployed to nemo-ci January 21, 2026 06:45 — with GitHub Actions Inactive

terrykong merged commit 42a4c8d into r0.5.0 Jan 21, 2026
57 of 61 checks passed

terrykong deleted the cherry-pick-1772-r0.5.0 branch January 21, 2026 06:56

chtruong814 temporarily deployed to nemo-ci January 21, 2026 07:02 — with GitHub Actions Inactive

coderabbitai Bot mentioned this pull request Feb 6, 2026

docs: Fix a step time number for deepseek #1890

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `docs: v0.5 performance results update (1772)` into `r0.5.0`#1800

cp: `docs: v0.5 performance results update (1772)` into `r0.5.0`#1800
terrykong merged 1 commit intor0.5.0from
cherry-pick-1772-r0.5.0

chtruong814 commented Jan 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 21, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chtruong814 commented Jan 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jan 21, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chtruong814 commented Jan 21, 2026 •

edited by coderabbitai Bot

Loading