Skip to content

perf: Add qwen3 30b-a3b async-8-off recipe#1642

Merged
terrykong merged 1 commit intomainfrom
youngeunk/qwen30b-8off-recipe
Dec 17, 2025
Merged

perf: Add qwen3 30b-a3b async-8-off recipe#1642
terrykong merged 1 commit intomainfrom
youngeunk/qwen30b-8off-recipe

Conversation

@youngeunkwon0405
Copy link
Copy Markdown
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Dec 16, 2025

What does this PR do ?

Wandb report

  • Performance/tokens_per_sec_per_gpu: 1004.741336 (5 step avg at step 5)
  • Performance/tokens_per_sec_per: 192,768 (5 step avg at step 5)

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Tests
    • Added new GRPO performance testing configurations for distributed async training scenarios with scalable parallelism strategies.
    • Introduced automated performance test execution scripts with integrated logging, GPU monitoring, and conditional metrics validation.
    • Expanded performance test suite with new benchmark tests.

✏️ Tip: You can customize this high-level summary in your review settings.

@youngeunkwon0405 youngeunkwon0405 self-assigned this Dec 16, 2025
@youngeunkwon0405 youngeunkwon0405 requested review from a team as code owners December 16, 2025 08:41
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 16, 2025

📝 Walkthrough

Walkthrough

This PR adds a new GRPO async performance configuration and corresponding test script for the Qwen3 30B-A3B model (24 nodes, 8 GPUs per node). It includes a YAML recipe with Megatron-DS-like parallelism settings, vLLM async generation, and a Bash test script that orchestrates the experiment execution with logging, metrics validation, and TensorBoard-to-JSON conversion.

Changes

Cohort / File(s) Summary
GRPO Performance Configuration
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml
New YAML configuration for async GRPO training with maximum trajectory age of 8 steps, importance-sampling correction, Megatron-DS parallelism (tensor=1, pipeline=1, expert=8), vLLM async generation with tensor parallel size 2, GPU memory utilization 0.8, and checkpointing to results directory.
GRPO Performance Test Script
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
New Bash test script that sets up experiment environment (24 nodes, 8 GPUs per node), executes GRPO performance run via uv run with logging and WandB integration, converts TensorBoard logs to JSON, and conditionally validates metrics thresholds.
Test Suite Manifest
tests/test_suites/performance.txt
Registered two new GRPO performance test scripts: grpo-qwen3-30ba3b-4n8g-async-1off.sh and grpo-qwen3-30ba3b-24n8g-async-8off.sh.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Configuration follows established patterns with specific parallelism and resource settings
  • Bash script is straightforward orchestration of standard experiment workflow with conditional metrics validation
  • Changes are consistent with existing GRPO test infrastructure

Possibly related PRs

Suggested labels

Performance, Run CICD

Suggested reviewers

  • guyueh1
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR adds new performance recipes and test suites without documenting actual test results, performance numbers, or convergence data in the description. Update PR description with actual test results from WandB report including final metrics, convergence behavior, and performance comparisons, and correct identified configuration naming errors.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a GRPO recipe configuration for Qwen3 30B with async-8-off settings.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch youngeunk/qwen30b-8off-recipe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4fffe0 and 798e0ec.

📒 Files selected for processing (3)
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh (1 hunks)
  • tests/test_suites/performance.txt (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml
  • tests/test_suites/performance.txt
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
  • tests/test_suites/performance.txt
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml
📚 Learning: 2025-11-24T17:24:47.707Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply

Applied to files:

  • tests/test_suites/performance.txt
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (7)
tests/test_suites/performance.txt (1)

15-17: LGTM! Test entries added correctly.

The new performance test script entries follow the existing naming pattern and are properly integrated into the test suite manifest.

tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh (3)

5-11: Configuration looks good.

The test configuration follows the standard pattern for the test infrastructure. The variables (NUM_NODES, NUM_RUNS, NUM_MINUTES) are consumed by external launch tooling even though they're not directly referenced in this script.

Based on learnings, these variables are part of the test infrastructure's standard interface.


15-29: Experiment execution follows repository conventions.

The script correctly uses uv run as required by the coding guidelines and follows the established patterns for test scripts in this repository (cd without error handling, argument forwarding with $@).

Based on learnings, the patterns used here are consistent with other test scripts in tests/test_suites/llm/.


31-39: Log conversion and metrics validation implemented correctly.

The conditional metrics check ensures validation only runs when the target step is reached, and the thresholds appear appropriate for the performance test.

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml (3)

1-6: Async GRPO configuration is correct.

The async GRPO settings with max_trajectory_age_steps: 8 correctly match the "async-8off" designation in the filename.


7-10: Importance sampling and checkpointing configured correctly.

The checkpoint directory name correctly matches the recipe name.


11-26: Policy and generation configuration looks appropriate.

The Megatron parallelism settings and vLLM async engine configuration are suitable for the Qwen3 30B model with async GRPO.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405 youngeunkwon0405 force-pushed the youngeunk/qwen30b-8off-recipe branch from 798e0ec to f384ecd Compare December 16, 2025 09:06
@youngeunkwon0405 youngeunkwon0405 added the CI:docs Run doctest label Dec 16, 2025
@guyueh1 guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Dec 16, 2025
@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

Hi @terrykong can I ask for a merge for this PR? This contains the 8-off perf recipe we mentioned meeting yesterday.

@terrykong terrykong merged commit 7dd9a01 into main Dec 17, 2025
57 of 58 checks passed
@terrykong terrykong deleted the youngeunk/qwen30b-8off-recipe branch December 17, 2025 06:27
DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
parthmannan pushed a commit to parthmannan/RL that referenced this pull request Jan 15, 2026
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 12, 2026
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:docs Run doctest CI:L2 Run doctests, unit tests, functional tests, and convergence tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants