Conversation
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds new LLM performance recipe configurations for various model architectures (DeepSeek, Qwen, Llama) across different node and GPU counts, and modifies an existing Megatron FP8 configuration to remove explicit sequence packing and quantization-related settings. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (1)
17-20: LGTM! Logger configuration maintains excellent path consistency.The log directory and wandb run name consistently use the same identifier as the checkpoint directory and filename, making it easy to track and correlate outputs across different components.
Optional: Add newline at end of file.
Consider adding a trailing newline at the end of the file, as this is a common convention in many style guides.
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (1)
11-13: Consider single-node tensor parallelism for generation efficiency.Setting
tensor_parallel_size: 8with 4 GPUs per node means each vllm generation instance spans 2 nodes, which introduces inter-node communication overhead. For potentially better generation performance, considertensor_parallel_size: 4to keep each instance within a single node.However, if this configuration is specifically designed to test cross-node generation performance, the current setting is appropriate.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml(0 hunks)examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml(1 hunks)
💤 Files with no reviewable changes (1)
- examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml
🧰 Additional context used
📓 Path-based instructions (2)
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Files:
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yamlexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yamlexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes
Applied to files:
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yamlexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes
Applied to files:
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml
📚 Learning: 2025-09-18T14:20:36.297Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-8b-base-2n8g-fsdp2tp2.v1.yaml:113-120
Timestamp: 2025-09-18T14:20:36.297Z
Learning: In distillation workflows, the teacher policy does not perform generation - it only does inference/logprob computation on sequences generated by the student policy. Therefore, teacher generation configuration mismatches (like vLLM tensor parallelism settings) and colocation concerns are not relevant.
Applied to files:
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Applied to files:
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (10)
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml (1)
1-1: Parent configuration file exists and inheritance is valid.Verification confirms that the parent file
./grpo-qwen3-235b-32n8g-async-1off.yamlexists in the same directory and the inheritance chain is properly structured. The file follows the expected naming conventions for NVIDIA NeMo RL performance recipe variants.examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (2)
2-16: LGTM! FP8 configuration is well-structured for performance testing.The FP8 configuration appropriately sets up both Megatron training (e4m3 format, blockwise recipe) and VLLM inference (fp8 precision) with consistent settings. The checkpoint directory naming matches the configuration filename, maintaining good organization.
1-1: Base configuration file exists and is correctly referenced in the repository at the same directory level.examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1)
12-13: The parallelism configuration is valid and intentionally designed for this async GRPO variant. The base config usespipeline_model_parallel_size: 2, which the async-1off variant deliberately overrides to1. With no tensor parallelism (tensor_model_parallel_sizeimplicitly 1), this results in 16-way data parallelism—a standard and efficient strategy for training the Llama 3.1 8B model across 16 GPUs. The complete parallelism strategy is properly inherited from the base config and intentionally modified for the async variant.examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml (1)
1-21: LGTM!The FP8 configuration is well-structured and consistent. The file properly inherits from the base async-1off configuration and overlays FP8-specific settings. All paths (checkpoint_dir, log_dir, wandb.name) consistently use the "fp8" identifier matching the filename.
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1)
1-21: LGTM!The configuration properly adapts the 32n8g setup for 4 GPUs per node. The parallelism settings are appropriate for the 128 total GPUs (32 nodes × 4 GPUs), and all identifiers consistently use "32n4g" across paths.
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml (1)
1-16: Configuration structure looks good aside from logger naming.The parallelism settings, cluster topology, and other configurations are appropriate for a 64-node, 4-GPU-per-node setup. Once the logger naming inconsistency is fixed, this configuration will be complete and consistent.
Also applies to: 20-22
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (3)
2-3: LGTM!The checkpoint directory, logging paths, wandb configuration, and cluster settings are all consistent with the filename pattern and correctly specify 16 nodes with 4 GPUs per node (64 total GPUs).
Also applies to: 14-20
4-13: Complete parallelism configuration is correct for 64 GPUs.The 16n4g configuration properly inherits and overrides settings from the 16n8g base:
- Training: TP=2, PP=4, CP=2 (inherited), EP=16 (inherited, MoE only), resulting in implicit DP=4. Calculation: 4 × 2 × 4 × 2 = 64 GPUs.
- Generation:
tensor_parallel_size: 8(halved from 16) means 64/8=8 vllm instances, correctly scaled for the reduced cluster size.The configuration is mathematically sound. EP applies only to MoE layers while DP applies to non-MoE layers (per MoE design), so the full parallelism equation accounts for all GPU allocation.
1-1: Fix critical parallelism configuration incompatibility.The base configuration file exists and is correctly referenced. However, the parallelism settings are incompatible with the reduced GPU count. The reviewed file inherits
tensor_model_parallel_size: 2,context_parallel_size: 2, andexpert_model_parallel_size: 16from the base config, which requires 256 GPU slots (2 × 2 × 16 × 4 = 256). With only 64 GPUs available (16 nodes × 4 GPUs), these parallelism parameters must be reduced proportionally. Overridetensor_model_parallel_size,context_parallel_size, andexpert_model_parallel_sizeto values that sum to 64 or lower.⛔ Skipped due to learnings
Learnt from: CR Repo: NVIDIA-NeMo/RL PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-11-24T17:24:41.976Z Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipesLearnt from: CR Repo: NVIDIA-NeMo/RL PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-11-24T17:24:41.976Z Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes
| log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off | ||
| wandb: | ||
| name: grpo-deepseek-v3-64n4g-async-32T32G-1off |
There was a problem hiding this comment.
Fix inconsistent naming in logger configuration.
The log_dir and wandb.name use "32T32G" but this configuration is for 64 nodes with 4 GPUs per node (64n4g), not 32×32. This is inconsistent with:
- The filename:
grpo-deepseek-v3-64n4g-async-1off.yaml - The checkpoint_dir:
results/grpo-deepseek-v3-64n4g-async-1off - The actual cluster topology: 64 nodes × 4 GPUs = 256 GPUs
This appears to be a copy-paste error that will cause confusion when tracking experiments and logs.
🔎 Proposed fix
logger:
- log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off
+ log_dir: logs/grpo-deepseek-v3-64n4g-async-1off
wandb:
- name: grpo-deepseek-v3-64n4g-async-32T32G-1off
+ name: grpo-deepseek-v3-64n4g-async-1off📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| log_dir: logs/grpo-deepseek-v3-64n4g-async-32T32G-1off | |
| wandb: | |
| name: grpo-deepseek-v3-64n4g-async-32T32G-1off | |
| logger: | |
| log_dir: logs/grpo-deepseek-v3-64n4g-async-1off | |
| wandb: | |
| name: grpo-deepseek-v3-64n4g-async-1off |
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
|
close, duplicate of in #1667 |
What does this PR do ?
Add new performance tests for v0.5
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.