perf: Add qwen3 30b-a3b async-8-off recipe by youngeunkwon0405 · Pull Request #1642 · NVIDIA-NeMo/RL

youngeunkwon0405 · 2025-12-16T08:41:42Z

What does this PR do ?

Performance/tokens_per_sec_per_gpu: 1004.741336 (5 step avg at step 5)
Performance/tokens_per_sec_per: 192,768 (5 step avg at step 5)

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Tests
- Added new GRPO performance testing configurations for distributed async training scenarios with scalable parallelism strategies.
- Introduced automated performance test execution scripts with integrated logging, GPU monitoring, and conditional metrics validation.
- Expanded performance test suite with new benchmark tests.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-16T08:50:22Z

📝 Walkthrough

Walkthrough

This PR adds a new GRPO async performance configuration and corresponding test script for the Qwen3 30B-A3B model (24 nodes, 8 GPUs per node). It includes a YAML recipe with Megatron-DS-like parallelism settings, vLLM async generation, and a Bash test script that orchestrates the experiment execution with logging, metrics validation, and TensorBoard-to-JSON conversion.

Changes

Cohort / File(s)	Summary
GRPO Performance Configuration `examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml`	New YAML configuration for async GRPO training with maximum trajectory age of 8 steps, importance-sampling correction, Megatron-DS parallelism (tensor=1, pipeline=1, expert=8), vLLM async generation with tensor parallel size 2, GPU memory utilization 0.8, and checkpointing to results directory.
GRPO Performance Test Script `tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh`	New Bash test script that sets up experiment environment (24 nodes, 8 GPUs per node), executes GRPO performance run via `uv run` with logging and WandB integration, converts TensorBoard logs to JSON, and conditionally validates metrics thresholds.
Test Suite Manifest `tests/test_suites/performance.txt`	Registered two new GRPO performance test scripts: `grpo-qwen3-30ba3b-4n8g-async-1off.sh` and `grpo-qwen3-30ba3b-24n8g-async-8off.sh`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Configuration follows established patterns with specific parallelism and resource settings
Bash script is straightforward orchestration of standard experiment workflow with conditional metrics validation
Changes are consistent with existing GRPO test infrastructure

Possibly related PRs

perf: [Perf script] QWEN3 30B-A3B tensor_parallel_size from 4 to 2 #1558: Modifies the same generation.vllm_cfg.tensor_parallel_size setting for Qwen3 30B-A3B generation configs.
perf: perf script change for qwen30b-a3b #1526: Adjusts model and vLLM parallelism settings across Qwen3 30B-A3B performance configurations.
cp: feat: Onboard perf recipes in tests (1322) into r0.4.0 #1497: Adds GRPO LLM performance recipes and test-suite scripts in the same examples/configs/recipes/llm/performance and tests/test_suites/llm/performance directories.

Suggested labels

Performance, Run CICD

Suggested reviewers

guyueh1
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR adds new performance recipes and test suites without documenting actual test results, performance numbers, or convergence data in the description.	Update PR description with actual test results from WandB report including final metrics, convergence behavior, and performance comparisons, and correct identified configuration naming errors.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding a GRPO recipe configuration for Qwen3 30B with async-8-off settings.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch youngeunk/qwen30b-8off-recipe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4fffe0 and 798e0ec.

📒 Files selected for processing (3)

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.yaml (1 hunks)
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh (1 hunks)
tests/test_suites/performance.txt (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.sh