test: Perf recipe for v0.5 by guyueh1 · Pull Request #1661 · NVIDIA-NeMo/RL

guyueh1 · 2025-12-19T00:07:40Z

What does this PR do ?

Add new performance tests for v0.5

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added example configurations for DeepSeek V3, Llama 3.1, and Qwen3 models with varying cluster sizes and optimization strategies.
- Introduced FP8 quantization example configurations for enhanced performance.
Chores
- Simplified existing Megatron FP8 configuration by removing unnecessary parameters.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

coderabbitai · 2025-12-19T00:13:14Z

📝 Walkthrough

Walkthrough

This PR adds new LLM performance recipe configurations for various model architectures (DeepSeek, Qwen, Llama) across different node and GPU counts, and modifies an existing Megatron FP8 configuration to remove explicit sequence packing and quantization-related settings.

Changes

Cohort / File(s)	Summary
Moonlight FP8 configuration cleanup `examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml`	Removed sequence_packing configuration block and FP8 quantization-related settings (expert_parallel_size, quantization_ignored_layer_kws).
DeepSeek-v3 performance configs `examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml`, `grpo-deepseek-v3-64n4g-async-1off.yaml`, `grpo-deepseek-v3-64n8g-fp8-async-1off.yaml`	Added new configuration variants with adjusted pipeline/expert parallelism (pipeline_model_parallel_size=8, expert_model_parallel_size=16), tensor_parallel_size=16 for vllm, checkpoint directories, and FP8 quantization settings for 32-node and 64-node deployments.
Llama-3.1-8B performance configs `examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml`, `grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml`	Added megatron_cfg with pipeline_model_parallel_size=1 and FP8 quantization configuration (fp8_cfg, env_vars for block scaling).
Qwen-3 performance configs `examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml`, `grpo-qwen3-235b-32n4g-async-1off.yaml`	Added new configuration variants with adjusted pipeline parallelism (pipeline_model_parallel_size=4), tensor_parallel_size=8 for vllm, checkpoint directories, and cluster settings across 16 and 32 nodes with 4 GPUs per node.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Changes are primarily homogeneous configuration parameter adjustments across multiple files
Configuration-only modifications with repetitive patterns (parallelism settings, checkpoint paths, logging configuration)
No logic changes or control flow modifications to evaluate
Suggested focus: Verify correctness of parallelism arithmetic and consistency of device allocation across node/GPU combinations

Possibly related PRs

perf: perf script change for qwen30b-a3b #1526: Overlapping modifications to LLM performance YAML configs with model parallelism settings (tensor/pipeline parallel sizes) and vllm configuration adjustments.
feat: Onboard perf recipes in tests #1322: Related changes to LLM performance recipe configs under examples/configs/recipes/llm/performance with DeepSeek variants and Megatron/FP8 settings.
feat: add config_cli.py and refactor configs + config pre-commit #1024: Related configuration cleanup removing sequence_packing and quantization-related settings from Megatron variant configs.

Suggested labels

CI:docs, Run CICD

Suggested reviewers

terrykong
parthchadha

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR adds multiple new performance recipe configurations for v0.5 (major changes) but PR description contains no test results, performance measurements, or validation evidence.	Add documentation in PR description with test results demonstrating each configuration runs successfully, performance measurements/benchmarks, confirmation of no regressions, and validation context.
Title check	❓ Inconclusive	The title 'test: Perf recipe for v0.5' is partially related to the changeset. It mentions 'Perf recipe' which aligns with the performance recipe configurations being added, but the 'test:' prefix is misleading since these are configuration files, not test files. The main changes involve adding new YAML performance recipe configurations and modifying one existing configuration file.	Consider revising the title to better reflect the nature of the changes. For example: 'Add performance recipe configurations for v0.5' or 'perf: Add performance recipe configs for v0.5' would more accurately describe the addition of YAML configuration files.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (1)

17-20: LGTM! Logger configuration maintains excellent path consistency.

The log directory and wandb run name consistently use the same identifier as the checkpoint directory and filename, making it easy to track and correlate outputs across different components.

Optional: Add newline at end of file.

Consider adding a trailing newline at the end of the file, as this is a common convention in many style guides.

examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (1)

11-13: Consider single-node tensor parallelism for generation efficiency.

Setting tensor_parallel_size: 8 with 4 GPUs per node means each vllm generation instance spans 2 nodes, which introduces inter-node communication overhead. For potentially better generation performance, consider tensor_parallel_size: 4 to keep each instance within a single node.

However, if this configuration is specifically designed to test cross-node generation performance, the current setting is appropriate.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48dbb37 and c35d0f4.

📒 Files selected for processing (8)

examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml (0 hunks)
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml (1 hunks)

💤 Files with no reviewable changes (1)

examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml

🧰 Additional context used

📓 Path-based instructions (2)

examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml

🧠 Learnings (5)

📓 Common learnings

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

📚 Learning: 2025-11-24T17:24:41.976Z

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes

Applied to files:

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml

📚 Learning: 2025-11-24T17:24:41.976Z

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes

Applied to files:

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml

📚 Learning: 2025-09-18T14:20:36.297Z

Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-8b-base-2n8g-fsdp2tp2.v1.yaml:113-120
Timestamp: 2025-09-18T14:20:36.297Z
Learning: In distillation workflows, the teacher policy does not perform generation - it only does inference/logprob computation on sequences generated by the student policy. Therefore, teacher generation configuration mismatches (like vLLM tensor parallelism settings) and colocation concerns are not relevant.

Applied to files:

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml

📚 Learning: 2025-11-24T17:24:41.976Z

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Applied to files:

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

🔇 Additional comments (10)

examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml (1)

1-1: Parent configuration file exists and inheritance is valid.

Verification confirms that the parent file ./grpo-qwen3-235b-32n8g-async-1off.yaml exists in the same directory and the inheritance chain is properly structured. The file follows the expected naming conventions for NVIDIA NeMo RL performance recipe variants.

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (2)

2-16: LGTM! FP8 configuration is well-structured for performance testing.

The FP8 configuration appropriately sets up both Megatron training (e4m3 format, blockwise recipe) and VLLM inference (fp8 precision) with consistent settings. The checkpoint directory naming matches the configuration filename, maintaining good organization.

1-1: Base configuration file exists and is correctly referenced in the repository at the same directory level.

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1)

12-13: The parallelism configuration is valid and intentionally designed for this async GRPO variant. The base config uses pipeline_model_parallel_size: 2, which the async-1off variant deliberately overrides to 1. With no tensor parallelism (tensor_model_parallel_size implicitly 1), this results in 16-way data parallelism—a standard and efficient strategy for training the Llama 3.1 8B model across 16 GPUs. The complete parallelism strategy is properly inherited from the base config and intentionally modified for the async variant.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml (1)

1-21: LGTM!

The FP8 configuration is well-structured and consistent. The file properly inherits from the base async-1off configuration and overlays FP8-specific settings. All paths (checkpoint_dir, log_dir, wandb.name) consistently use the "fp8" identifier matching the filename.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1)

1-21: LGTM!

The configuration properly adapts the 32n8g setup for 4 GPUs per node. The parallelism settings are appropriate for the 128 total GPUs (32 nodes × 4 GPUs), and all identifiers consistently use "32n4g" across paths.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml (1)

1-16: Configuration structure looks good aside from logger naming.

The parallelism settings, cluster topology, and other configurations are appropriate for a 64-node, 4-GPU-per-node setup. Once the logger naming inconsistency is fixed, this configuration will be complete and consistent.

Also applies to: 20-22
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (3)
2-3: LGTM!

The checkpoint directory, logging paths, wandb configuration, and cluster settings are all consistent with the filename pattern and correctly specify 16 nodes with 4 GPUs per node (64 total GPUs).

Also applies to: 14-20

4-13: Complete parallelism configuration is correct for 64 GPUs.

The 16n4g configuration properly inherits and overrides settings from the 16n8g base:

Training: TP=2, PP=4, CP=2 (inherited), EP=16 (inherited, MoE only), resulting in implicit DP=4. Calculation: 4 × 2 × 4 × 2 = 64 GPUs.

Generation: tensor_parallel_size: 8 (halved from 16) means 64/8=8 vllm instances, correctly scaled for the reduced cluster size.

The configuration is mathematically sound. EP applies only to MoE layers while DP applies to non-MoE layers (per MoE design), so the full parallelism equation accounts for all GPU allocation.

1-1: Fix critical parallelism configuration incompatibility.

The base configuration file exists and is correctly referenced. However, the parallelism settings are incompatible with the reduced GPU count. The reviewed file inherits tensor_model_parallel_size: 2, context_parallel_size: 2, and expert_model_parallel_size: 16 from the base config, which requires 256 GPU slots (2 × 2 × 16 × 4 = 256). With only 64 GPUs available (16 nodes × 4 GPUs), these parallelism parameters must be reduced proportionally. Override tensor_model_parallel_size, context_parallel_size, and expert_model_parallel_size to values that sum to 64 or lower.
⛔ Skipped due to learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes

coderabbitai · 2025-12-19T00:13:17Z

+  log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off
+  wandb:
+    name: grpo-deepseek-v3-64n4g-async-32T32G-1off


⚠️ Potential issue | 🟠 Major

Fix inconsistent naming in logger configuration.

The log_dir and wandb.name use "32T32G" but this configuration is for 64 nodes with 4 GPUs per node (64n4g), not 32×32. This is inconsistent with:

The filename: grpo-deepseek-v3-64n4g-async-1off.yaml

The checkpoint_dir: results/grpo-deepseek-v3-64n4g-async-1off

The actual cluster topology: 64 nodes × 4 GPUs = 256 GPUs

This appears to be a copy-paste error that will cause confusion when tracking experiments and logs.

🔎 Proposed fix

logger: - log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off + log_dir: logs/grpo-deepseek-v3-64n4g-async-1off wandb: - name: grpo-deepseek-v3-64n4g-async-32T32G-1off + name: grpo-deepseek-v3-64n4g-async-1off

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off

wandb:

name: grpo-deepseek-v3-64n4g-async-32T32G-1off

logger:

log_dir: logs/grpo-deepseek-v3-64n4g-async-1off

wandb:

name: grpo-deepseek-v3-64n4g-async-1off

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 · 2025-12-19T21:32:12Z

close, duplicate of in #1667

Perf recipe for v0.5

c35d0f4

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 requested review from a team as code owners December 19, 2025 00:07

coderabbitai Bot reviewed Dec 19, 2025

View reviewed changes

terrykong added the r0.5.0 label Dec 19, 2025

guyueh1 added 2 commits December 19, 2025 10:11

Add tests to performance.txt

7f12854

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Revert a change to moonlight16b fp8

7977eaa

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 requested a review from a team as a code owner December 19, 2025 18:12

guyueh1 changed the title ~~feat: Perf recipe for v0.5~~ test: Perf recipe for v0.5 Dec 19, 2025

fix deepseek fp8 recipe

fedf770

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 closed this Dec 19, 2025

guyueh1 temporarily deployed to nemo-ci December 19, 2025 22:04 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci December 19, 2025 22:07 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Perf recipe for v0.5#1661

test: Perf recipe for v0.5#1661
guyueh1 wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
guyueh1:perf_recipe_for_v0.5

guyueh1 commented Dec 19, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Dec 19, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Dec 19, 2025

Uh oh!

guyueh1 commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guyueh1 commented Dec 19, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

guyueh1 commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guyueh1 commented Dec 19, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Dec 19, 2025 •

edited

Loading