Skip to content

test: Perf recipe for v0.5#1661

Closed
guyueh1 wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
guyueh1:perf_recipe_for_v0.5
Closed

test: Perf recipe for v0.5#1661
guyueh1 wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
guyueh1:perf_recipe_for_v0.5

Conversation

@guyueh1
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 commented Dec 19, 2025

What does this PR do ?

Add new performance tests for v0.5

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added example configurations for DeepSeek V3, Llama 3.1, and Qwen3 models with varying cluster sizes and optimization strategies.
    • Introduced FP8 quantization example configurations for enhanced performance.
  • Chores

    • Simplified existing Megatron FP8 configuration by removing unnecessary parameters.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 requested review from a team as code owners December 19, 2025 00:07
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 19, 2025

📝 Walkthrough

Walkthrough

This PR adds new LLM performance recipe configurations for various model architectures (DeepSeek, Qwen, Llama) across different node and GPU counts, and modifies an existing Megatron FP8 configuration to remove explicit sequence packing and quantization-related settings.

Changes

Cohort / File(s) Summary
Moonlight FP8 configuration cleanup
examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml
Removed sequence_packing configuration block and FP8 quantization-related settings (expert_parallel_size, quantization_ignored_layer_kws).
DeepSeek-v3 performance configs
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml, grpo-deepseek-v3-64n4g-async-1off.yaml, grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
Added new configuration variants with adjusted pipeline/expert parallelism (pipeline_model_parallel_size=8, expert_model_parallel_size=16), tensor_parallel_size=16 for vllm, checkpoint directories, and FP8 quantization settings for 32-node and 64-node deployments.
Llama-3.1-8B performance configs
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml, grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
Added megatron_cfg with pipeline_model_parallel_size=1 and FP8 quantization configuration (fp8_cfg, env_vars for block scaling).
Qwen-3 performance configs
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml, grpo-qwen3-235b-32n4g-async-1off.yaml
Added new configuration variants with adjusted pipeline parallelism (pipeline_model_parallel_size=4), tensor_parallel_size=8 for vllm, checkpoint directories, and cluster settings across 16 and 32 nodes with 4 GPUs per node.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Changes are primarily homogeneous configuration parameter adjustments across multiple files
  • Configuration-only modifications with repetitive patterns (parallelism settings, checkpoint paths, logging configuration)
  • No logic changes or control flow modifications to evaluate
  • Suggested focus: Verify correctness of parallelism arithmetic and consistency of device allocation across node/GPU combinations

Possibly related PRs

Suggested labels

CI:docs, Run CICD

Suggested reviewers

  • terrykong
  • parthchadha

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR adds multiple new performance recipe configurations for v0.5 (major changes) but PR description contains no test results, performance measurements, or validation evidence. Add documentation in PR description with test results demonstrating each configuration runs successfully, performance measurements/benchmarks, confirmation of no regressions, and validation context.
Title check ❓ Inconclusive The title 'test: Perf recipe for v0.5' is partially related to the changeset. It mentions 'Perf recipe' which aligns with the performance recipe configurations being added, but the 'test:' prefix is misleading since these are configuration files, not test files. The main changes involve adding new YAML performance recipe configurations and modifying one existing configuration file. Consider revising the title to better reflect the nature of the changes. For example: 'Add performance recipe configurations for v0.5' or 'perf: Add performance recipe configs for v0.5' would more accurately describe the addition of YAML configuration files.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (1)

17-20: LGTM! Logger configuration maintains excellent path consistency.

The log directory and wandb run name consistently use the same identifier as the checkpoint directory and filename, making it easy to track and correlate outputs across different components.

Optional: Add newline at end of file.

Consider adding a trailing newline at the end of the file, as this is a common convention in many style guides.

examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (1)

11-13: Consider single-node tensor parallelism for generation efficiency.

Setting tensor_parallel_size: 8 with 4 GPUs per node means each vllm generation instance spans 2 nodes, which introduces inter-node communication overhead. For potentially better generation performance, consider tensor_parallel_size: 4 to keep each instance within a single node.

However, if this configuration is specifically designed to test cross-node generation performance, the current setting is appropriate.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48dbb37 and c35d0f4.

📒 Files selected for processing (8)
  • examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml (0 hunks)
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml (1 hunks)
💤 Files with no reviewable changes (1)
  • examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml
🧰 Additional context used
📓 Path-based instructions (2)
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
📚 Learning: 2025-09-18T14:20:36.297Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-8b-base-2n8g-fsdp2tp2.v1.yaml:113-120
Timestamp: 2025-09-18T14:20:36.297Z
Learning: In distillation workflows, the teacher policy does not perform generation - it only does inference/logprob computation on sequences generated by the student policy. Therefore, teacher generation configuration mismatches (like vLLM tensor parallelism settings) and colocation concerns are not relevant.

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (10)
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml (1)

1-1: Parent configuration file exists and inheritance is valid.

Verification confirms that the parent file ./grpo-qwen3-235b-32n8g-async-1off.yaml exists in the same directory and the inheritance chain is properly structured. The file follows the expected naming conventions for NVIDIA NeMo RL performance recipe variants.

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (2)

2-16: LGTM! FP8 configuration is well-structured for performance testing.

The FP8 configuration appropriately sets up both Megatron training (e4m3 format, blockwise recipe) and VLLM inference (fp8 precision) with consistent settings. The checkpoint directory naming matches the configuration filename, maintaining good organization.


1-1: Base configuration file exists and is correctly referenced in the repository at the same directory level.

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1)

12-13: The parallelism configuration is valid and intentionally designed for this async GRPO variant. The base config uses pipeline_model_parallel_size: 2, which the async-1off variant deliberately overrides to 1. With no tensor parallelism (tensor_model_parallel_size implicitly 1), this results in 16-way data parallelism—a standard and efficient strategy for training the Llama 3.1 8B model across 16 GPUs. The complete parallelism strategy is properly inherited from the base config and intentionally modified for the async variant.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml (1)

1-21: LGTM!

The FP8 configuration is well-structured and consistent. The file properly inherits from the base async-1off configuration and overlays FP8-specific settings. All paths (checkpoint_dir, log_dir, wandb.name) consistently use the "fp8" identifier matching the filename.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1)

1-21: LGTM!

The configuration properly adapts the 32n8g setup for 4 GPUs per node. The parallelism settings are appropriate for the 128 total GPUs (32 nodes × 4 GPUs), and all identifiers consistently use "32n4g" across paths.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml (1)

1-16: Configuration structure looks good aside from logger naming.

The parallelism settings, cluster topology, and other configurations are appropriate for a 64-node, 4-GPU-per-node setup. Once the logger naming inconsistency is fixed, this configuration will be complete and consistent.

Also applies to: 20-22

examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (3)

2-3: LGTM!

The checkpoint directory, logging paths, wandb configuration, and cluster settings are all consistent with the filename pattern and correctly specify 16 nodes with 4 GPUs per node (64 total GPUs).

Also applies to: 14-20


4-13: Complete parallelism configuration is correct for 64 GPUs.

The 16n4g configuration properly inherits and overrides settings from the 16n8g base:

  • Training: TP=2, PP=4, CP=2 (inherited), EP=16 (inherited, MoE only), resulting in implicit DP=4. Calculation: 4 × 2 × 4 × 2 = 64 GPUs.
  • Generation: tensor_parallel_size: 8 (halved from 16) means 64/8=8 vllm instances, correctly scaled for the reduced cluster size.

The configuration is mathematically sound. EP applies only to MoE layers while DP applies to non-MoE layers (per MoE design), so the full parallelism equation accounts for all GPU allocation.


1-1: Fix critical parallelism configuration incompatibility.

The base configuration file exists and is correctly referenced. However, the parallelism settings are incompatible with the reduced GPU count. The reviewed file inherits tensor_model_parallel_size: 2, context_parallel_size: 2, and expert_model_parallel_size: 16 from the base config, which requires 256 GPU slots (2 × 2 × 16 × 4 = 256). With only 64 GPUs available (16 nodes × 4 GPUs), these parallelism parameters must be reduced proportionally. Override tensor_model_parallel_size, context_parallel_size, and expert_model_parallel_size to values that sum to 64 or lower.

⛔ Skipped due to learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes

Comment on lines +17 to +19
log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off
wandb:
name: grpo-deepseek-v3-64n4g-async-32T32G-1off
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix inconsistent naming in logger configuration.

The log_dir and wandb.name use "32T32G" but this configuration is for 64 nodes with 4 GPUs per node (64n4g), not 32×32. This is inconsistent with:

  • The filename: grpo-deepseek-v3-64n4g-async-1off.yaml
  • The checkpoint_dir: results/grpo-deepseek-v3-64n4g-async-1off
  • The actual cluster topology: 64 nodes × 4 GPUs = 256 GPUs

This appears to be a copy-paste error that will cause confusion when tracking experiments and logs.

🔎 Proposed fix
 logger:
-  log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off
+  log_dir: logs/grpo-deepseek-v3-64n4g-async-1off
   wandb:
-    name: grpo-deepseek-v3-64n4g-async-32T32G-1off
+    name: grpo-deepseek-v3-64n4g-async-1off
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off
wandb:
name: grpo-deepseek-v3-64n4g-async-32T32G-1off
logger:
log_dir: logs/grpo-deepseek-v3-64n4g-async-1off
wandb:
name: grpo-deepseek-v3-64n4g-async-1off

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 requested a review from a team as a code owner December 19, 2025 18:12
@guyueh1 guyueh1 changed the title feat: Perf recipe for v0.5 test: Perf recipe for v0.5 Dec 19, 2025
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented Dec 19, 2025

close, duplicate of in #1667

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants