perf: Add qwen3 30b-a3b async-8-off recipe#1642
Conversation
📝 WalkthroughWalkthroughThis PR adds a new GRPO async performance configuration and corresponding test script for the Qwen3 30B-A3B model (24 nodes, 8 GPUs per node). It includes a YAML recipe with Megatron-DS-like parallelism settings, vLLM async generation, and a Bash test script that orchestrates the experiment execution with logging, metrics validation, and TensorBoard-to-JSON conversion. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml(1 hunks)tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh(1 hunks)tests/test_suites/performance.txt(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts
Files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
tests/test_suites/**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
Files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.shexamples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yamltests/test_suites/performance.txt
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Files:
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.shtests/test_suites/performance.txt
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.
Applied to files:
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes
Applied to files:
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml
📚 Learning: 2025-11-24T17:24:47.707Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply
Applied to files:
tests/test_suites/performance.txt
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[error] 28-28: Double quote array expansions to avoid re-splitting elements.
(SC2068)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (7)
tests/test_suites/performance.txt (1)
15-17: LGTM! Test entries added correctly.The new performance test script entries follow the existing naming pattern and are properly integrated into the test suite manifest.
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh (3)
5-11: Configuration looks good.The test configuration follows the standard pattern for the test infrastructure. The variables (NUM_NODES, NUM_RUNS, NUM_MINUTES) are consumed by external launch tooling even though they're not directly referenced in this script.
Based on learnings, these variables are part of the test infrastructure's standard interface.
15-29: Experiment execution follows repository conventions.The script correctly uses
uv runas required by the coding guidelines and follows the established patterns for test scripts in this repository (cd without error handling, argument forwarding with $@).Based on learnings, the patterns used here are consistent with other test scripts in tests/test_suites/llm/.
31-39: Log conversion and metrics validation implemented correctly.The conditional metrics check ensures validation only runs when the target step is reached, and the thresholds appear appropriate for the performance test.
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml (3)
1-6: Async GRPO configuration is correct.The async GRPO settings with
max_trajectory_age_steps: 8correctly match the "async-8off" designation in the filename.
7-10: Importance sampling and checkpointing configured correctly.The checkpoint directory name correctly matches the recipe name.
11-26: Policy and generation configuration looks appropriate.The Megatron parallelism settings and vLLM async engine configuration are suitable for the Qwen3 30B model with async GRPO.
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
798e0ec to
f384ecd
Compare
|
Hi @terrykong can I ask for a merge for this PR? This contains the 8-off perf recipe we mentioned meeting yesterday. |
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
What does this PR do ?
Wandb report
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.