Skip to content

cp: feat: Onboard perf recipes in tests (1322) into r0.4.0#1497

Merged
terrykong merged 1 commit intor0.4.0from
cherry-pick-1322-r0.4.0
Nov 10, 2025
Merged

cp: feat: Onboard perf recipes in tests (1322) into r0.4.0#1497
terrykong merged 1 commit intor0.4.0from
cherry-pick-1322-r0.4.0

Conversation

@chtruong814
Copy link
Copy Markdown
Contributor

@chtruong814 chtruong814 commented Nov 10, 2025

beep boop [🤖]: Hi @guyueh1 👋,

we've cherry picked #1322 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • New Features

    • Added experiment configurations for model training supporting multiple model architectures and execution modes, including asynchronous training variants.
  • Tests

    • Introduced performance test suite with automated test scripts, log processing utilities, and metrics validation for training experiments.
    • Updated test suite registry to support new performance testing infrastructure.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Co-authored-by: Youngeun Kwon <youngeunk@nvidia.com>
Co-authored-by: Zhiyu Li <zhiyul@NVIDIA.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Anna Shors <ashors@nvidia.com>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 10, 2025

📝 Walkthrough

Walkthrough

This PR adds GRPO performance configurations and test suites for multiple LLM variants including DeepSeek v3, LLaMA 3.1, and Qwen3 models, featuring both standard and asynchronous training setups. Includes configuration files specifying training hyperparameters, model parallelism, hardware allocation, and logging, along with bash test scripts for performance validation and test infrastructure integration.

Changes

Cohort / File(s) Summary
GRPO Configuration Files – DeepSeek v3
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n8g.yaml, grpo-deepseek-v3-64n8g-async-1off.yaml
Added comprehensive GRPO training configs for DeepSeek v3 with Megatron distributed settings, vLLM generation, importance sampling correction, checkpointing, and WandB logging. Async variant enables async_grpo with in-flight weight updates and specifies 64-node cluster.
GRPO Configuration Files – LLaMA 3.1
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml, grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
Added GRPO configs for LLaMA 3.1 8B Instruct with Megatron settings, async variant enabled with max_trajectory_age_steps=1, async_engine, and 2-node cluster allocation.
GRPO Configuration Files – Qwen3 (235B and 32B)
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n8g.yaml, grpo-qwen3-235b-32n8g-async-1off.yaml, grpo-qwen3-32b-4n8g.yaml, grpo-qwen3-32b-8n8g-async-1off.yaml
Added GRPO configs for Qwen3 235B and 32B models with Megatron parallelism, importance sampling correction, and corresponding async-1off variants with asynchronous GRPO settings.
GRPO Configuration Files – Qwen3 (30B A3B)
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml, grpo-qwen3-30ba3b-4n8g-async-1off.yaml
Added GRPO configs for Qwen3 30B A3B variant with standard and async-1off setups, specifying tensor/model/expert parallelism and sequence parallelism.
Performance Test Infrastructure
tests/test_suites/llm/performance/common.env
Added shared bash environment setup for LLM performance tests, defining functions for early exit on max steps, computing config paths, and setting up experiment directories.
GRPO Performance Test Scripts – DeepSeek v3
tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh, grpo-deepseek-v3-64n8g-async-1off.sh
Added test harnesses executing GRPO training via \uv run\\ with TensorBoard-to-JSON conversion and conditional metrics validation on token probability error thresholds.
GRPO Performance Test Scripts – LLaMA 3.1
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh, grpo-llama3.1-8b-instruct-2n8g-async-1off.sh
Added test harnesses for LLaMA 3.1 with experiment configuration, logging, checkpointing, and metrics validation based on train/loss thresholds.
GRPO Performance Test Scripts – Qwen3
tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh, grpo-qwen3-235b-32n8g-async-1off.sh, grpo-qwen3-30ba3b-4n8g.sh, grpo-qwen3-30ba3b-4n8g-async-1off.sh, grpo-qwen3-32b-4n8g.sh, grpo-qwen3-32b-8n8g-async-1off.sh
Added test scripts for Qwen3 model variants with identical structure: environment setup, experiment execution, log conversion, and conditional metrics checks.
Test Suite Registry
tests/test_suites/performance.txt
Added listing of 15 new GRPO performance test script paths under GRPO section.
Test Orchestration
tests/unit/test_recipes_and_test_suites.py
Integrated performance test suite into test discovery by adding performance_test_suite() fixture, performance_test_suite_path constant, and updating all_test_suites fixture and test_test_suites_exist parametrization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Breadth of changes: 23 new files (10 configs, 10 test scripts, 1 common.env, 1 registry, 1 Python update) across multiple model variants and async variants.
  • Homogeneity: High — test scripts and config files follow consistent, repetitive patterns with model-specific parameter variations, reducing per-file review complexity.
  • Key areas requiring attention:
    • Verify parameter consistency across standard and async-1off config pairs (e.g., max_trajectory_age_steps=1, in_flight_weight_updates=true).
    • Validate Megatron parallelism settings (tensor/pipeline/expert sizes) are correctly proportioned for node/GPU counts in each config.
    • Ensure metric thresholds in test scripts (e.g., train/token_mult_prob_error < 1.1) are appropriate for each model size.
    • Confirm common.env path resolution and CONFIG_PATH mapping logic correctly maps test script locations to config paths.

Possibly related PRs

  • NVIDIA-NeMo/RL#1098: Introduces async_grpo training infrastructure and async utilities that are directly consumed by the async-1off variants in this PR.
  • NVIDIA-NeMo/RL#1322: Overlaps with nearly identical LLM performance recipe configs, test scripts, and harness files as this PR.

Suggested labels

r0.4.0, CI:L0

Suggested reviewers

  • guyueh1
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning PR adds 10+ YAML configs and 11+ test scripts for GRPO performance recipes but lacks test results, validation data, or execution confirmation in PR description. Document test execution results, baseline metrics (mean(train/token_mult_prob_error) values), verification that configs match deployment specs, and address review comments about log naming and tensorboard settings.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main change: cherry-picking feature #1322 that onboards performance recipes in tests into the r0.4.0 release branch.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1322-r0.4.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1)

27-27: Add trailing newline.

YAML files should end with a trailing newline per standard conventions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 399d22f and 4676af3.

📒 Files selected for processing (23)
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n8g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n8g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n8g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-4n8g.yaml (1 hunks)
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n8g-async-1off.yaml (1 hunks)
  • tests/test_suites/llm/performance/common.env (1 hunks)
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh (1 hunks)
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh (1 hunks)
  • tests/test_suites/performance.txt (1 hunks)
  • tests/unit/test_recipes_and_test_suites.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
tests/test_suites/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place driver shell scripts and common.env under tests/test_suites// and list nightly tests in tests/test_suites/nightly.txt

Files:

  • tests/test_suites/performance.txt
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh
  • tests/test_suites/llm/performance/common.env
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/recipes/**/*.yaml: Recipe YAMLs under examples/configs/recipes/** are runnable snapshots and may omit documentation
When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name

Files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-4n8g.yaml
examples/configs/recipes/**/*.{yaml,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script

Files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-4n8g.yaml
examples/configs/recipes/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place recipe YAMLs under examples/configs/recipes//

Files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-4n8g.yaml
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Use uv run to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling python directly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts

Files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • tests/unit/test_recipes_and_test_suites.py
🧠 Learnings (17)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-09-20T14:59:08.052Z
Learning: If a change could affect performance, include before-and-after performance numbers in the PR description, along with configuration and context.
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/nightly.txt : Append the new driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt

Applied to files:

  • tests/test_suites/performance.txt
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh
  • tests/unit/test_recipes_and_test_suites.py
📚 Learning: 2025-09-20T14:59:08.052Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-09-20T14:59:08.052Z
Learning: If a change could affect performance, include before-and-after performance numbers in the PR description, along with configuration and context.

Applied to files:

  • tests/test_suites/performance.txt
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/performance.txt
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh
  • tests/test_suites/llm/performance/common.env
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/llm/*.sh : LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension

Applied to files:

  • tests/test_suites/performance.txt
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh
  • tests/test_suites/llm/performance/common.env
  • tests/unit/test_recipes_and_test_suites.py
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : LLM recipe YAML filenames must follow: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n8g.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/**/*.{sh} : For new model support, add a matching driver shell script under tests/test_suites/<domain>/ that sources common.env and invokes 'uv run ... --config <yaml>'

Applied to files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh
  • tests/test_suites/llm/performance/common.env
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh
  • tests/test_suites/llm/performance/common.env
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

  • tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
  • tests/test_suites/llm/performance/common.env
  • tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/**/*.{yaml,sh} : Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n8g.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : VLM recipe YAML filenames must follow: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
📚 Learning: 2025-09-18T14:57:31.003Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n8g.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n8g-async-1off.yaml
  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/*.yaml : When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/*.yaml

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

Applied to files:

  • examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
  • examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n8g.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/** : Place driver shell scripts and common.env under tests/test_suites/<domain>/ and list nightly tests in tests/test_suites/nightly.txt

Applied to files:

  • tests/test_suites/llm/performance/common.env
  • tests/unit/test_recipes_and_test_suites.py
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/vlm/*.sh : VLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension

Applied to files:

  • tests/test_suites/llm/performance/common.env
🧬 Code graph analysis (10)
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh (1)
tests/test_suites/llm/performance/common.env (1)
  • exit_if_max_steps_reached (12-20)
tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh (1)
tests/test_suites/llm/performance/common.env (1)
  • exit_if_max_steps_reached (12-20)
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh (1)
tests/test_suites/llm/performance/common.env (1)
  • exit_if_max_steps_reached (12-20)
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh (1)
tests/test_suites/llm/performance/common.env (1)
  • exit_if_max_steps_reached (12-20)
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh (1)
tests/test_suites/llm/performance/common.env (1)
  • exit_if_max_steps_reached (12-20)
tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh (1)
tests/test_suites/llm/performance/common.env (1)
  • exit_if_max_steps_reached (12-20)
tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh (1)
tests/test_suites/llm/performance/common.env (1)
  • exit_if_max_steps_reached (12-20)
tests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh (1)
tests/test_suites/llm/performance/common.env (1)
  • exit_if_max_steps_reached (12-20)
tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh (1)
tests/test_suites/llm/performance/common.env (1)
  • exit_if_max_steps_reached (12-20)
tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh (1)
tests/test_suites/llm/performance/common.env (1)
  • exit_if_max_steps_reached (12-20)
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh

[warning] 10-10: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 13-13: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 14-14: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 20-20: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 34-34: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh

[warning] 8-8: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 11-11: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 12-12: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 18-18: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 29-29: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh

[warning] 10-10: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 13-13: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 14-14: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 20-20: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 34-34: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh

[warning] 8-8: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 11-11: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 12-12: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 18-18: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 29-29: Double quote array expansions to avoid re-splitting elements.

(SC2068)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (27)
examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g.yaml (1)

1-44: LGTM! Configuration follows established patterns and is consistent with other GRPO recipe files in this PR. Filename, inheritance structure, and all settings align with the GRPO infrastructure.

examples/configs/recipes/llm/performance/grpo-qwen3-32b-8n8g-async-1off.yaml (1)

1-32: LGTM! Async variant correctly inherits from base config and properly enables async GRPO settings with importance sampling correction and async vLLM engine. Naming and configuration align with established patterns.

examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n8g.yaml (1)

1-59: LGTM! Configuration for large-scale Qwen3-235B model is well-structured with comprehensive distributed training settings (pipeline, context, and expert parallelism). The documented caveat about tensorboard is helpful context.

examples/configs/recipes/llm/performance/grpo-qwen3-32b-4n8g.yaml (1)

1-41: LGTM! Base configuration for Qwen3-32B is properly structured. This serves as the parent config for async variants, with appropriate parallelism strategy for the 4-node cluster.

examples/configs/recipes/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.yaml (1)

1-33: LGTM! Async variant properly enables async GRPO with trajectory management and importance sampling. Parallelism parameters are appropriately adjusted for the async configuration.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-async-1off.yaml (1)

1-33: LGTM! Large-scale async configuration for DeepSeek v3 is well-structured. Parallelism strategy (pure pipeline+expert with 32-way tensor parallelism for generation) is appropriate for the 64-node cluster.

tests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh (2)

28-28: Consider quoting $@ to avoid word-splitting (SC2068).

Line 28 passes $@ unquoted, which shellcheck flags as an error: "Double quote array expansions to avoid re-splitting elements." While recent codebase guidance (zpqiu, PR 1324) indicates using $@ unquoted for consistency with existing test scripts, this is a potential source of subtle bugs if arguments contain spaces or special characters. For robustness, consider using "$@" instead.

Alternatively, if unquoted $@ is an established pattern in this codebase worth preserving, you can acknowledge this as an intentional trade-off for consistency.


1-39: LGTM! Script follows established test infrastructure patterns: standard config variables, uv run usage, environment variable referencing, and conditional metrics logic are all correct. One minor shellcheck concern noted separately above.

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml (1)

1-54: LGTM! Llama 3.1 8B Instruct configuration is well-structured for a 2-node cluster. The max_new_tokens=4096 setting correctly matches max_total_sequence_length per established patterns, with overflow handling delegated to vLLM.

tests/test_suites/performance.txt (1)

1-15: LGTM!

The performance test suite listing is well-organized, grouping standard and async variants separately. All listed paths correspond to test scripts included in this PR.

examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1)

22-24: Verify log directory and experiment name match cluster configuration.

The log directory and WandB experiment name reference "1T1G" (suggesting 1 tensor parallel, 1 pipeline parallel) but the cluster configuration specifies 2 nodes with 8 GPUs per node. This naming inconsistency could cause confusion when analyzing results.

Consider aligning the naming with the cluster configuration or documenting the parallelism strategy if "1T1G" refers to a specific tensor/pipeline parallelism configuration.

tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh (1)

1-39: LGTM!

The script follows the established test infrastructure patterns correctly, including standard configuration variables, early-exit guards, experiment execution via uv run, and conditional metrics validation.

tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh (1)

1-45: LGTM!

The script properly handles DeepSeek-specific requirements (NVLS disablement for OOM avoidance, configurable HF checkpoint path) while following the standard test infrastructure patterns.

tests/test_suites/llm/performance/common.env (1)

1-45: LGTM!

The common environment script provides robust infrastructure for performance tests, including proper error handling (set -eou pipefail), path validation, early-exit optimization, and dry-run support.

tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh (1)

1-39: LGTM!

The script follows the established test infrastructure patterns correctly for the LLaMA 3.1 8B instruct model performance testing.

tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh (1)

1-40: LGTM!

The script follows the established test infrastructure patterns correctly for the Qwen3 235B model, including NVLS disablement to avoid OOM issues on the large-scale 32-node deployment.

tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.sh (1)

1-39: LGTM! Script follows the established test infrastructure pattern.

The script correctly implements the standard GRPO performance test flow: environment setup, early-exit guard, experiment execution with proper logging and checkpointing, TensorBoard log conversion, and conditional metrics validation.

tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh (2)

1-39: Script implementation follows the standard pattern correctly.

Assuming the filename is verified to be correct, the script logic properly implements the GRPO performance test flow with appropriate configuration, logging, and metrics validation.


1-1: The shell script filename is correct and not a typo. The verification shows that grpo-qwen3-30ba3b-4n8g.sh has a corresponding YAML config at grpo-qwen3-30ba3b-4n8g.yaml, and the naming is consistent across both files. The "30ba3b" is an intentional model variant identifier, not a naming error. The shell script filename properly mirrors the YAML base name as required by the coding guidelines.

Likely an incorrect or invalid review comment.

tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh (1)

1-45: LGTM! DeepSeek-specific configuration is appropriately handled.

The script correctly implements DeepSeek v3 requirements including NVLS disablement for OOM avoidance, configurable HF checkpoint path, and proper model/tokenizer overrides. The higher resource allocation (32 nodes, 240 minutes) aligns with the larger model scale.

tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh (1)

1-6: LGTM! Environment setup is appropriate for large model testing.

The NVLS disablement is properly documented and necessary to avoid OOM issues with the 235B model.

tests/unit/test_recipes_and_test_suites.py (4)

31-31: LGTM! Performance test suite path properly defined.

The constant follows the established naming pattern for test suite paths.


72-80: LGTM! Performance test suite fixture correctly implemented.

The fixture follows the same pattern as nightly_test_suite and release_test_suite, ensuring consistency across the test infrastructure.


83-89: LGTM! Performance suite properly integrated into test aggregation.

The all_test_suites fixture correctly includes the new performance_test_suite parameter and concatenates it with nightly and release suites.


102-113: LGTM! Test parametrization updated to include performance suite.

The test_test_suites_exist test now validates all three test suites (nightly, release, and performance) with appropriate test IDs.

examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n8g.yaml (1)

1-57: LGTM! Configuration is comprehensive and follows established patterns.

The YAML properly configures:

  • GRPO training parameters with appropriate batch sizes for the scale
  • Megatron parallelism settings for DeepSeek v3 (EP16, PP16)
  • vLLM generation with async engine (TP32)
  • Proper cluster resource allocation (32 nodes × 8 GPUs)

The configuration aligns with the corresponding test script and follows the project's recipe naming conventions.

tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh (1)

1-39: LGTM! Script correctly implements the GRPO performance test pattern.

The script properly configures and executes the GRPO Qwen3-32B experiment with appropriate logging, checkpointing, and metrics validation.

Comment on lines +30 to +32
log_dir: logs/grpo-qwen3-235b-32n8g-16T16G-async-1off
wandb:
name: grpo-qwen3-235b-32n8g-16T16G-async-1off
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix inconsistent log directory and experiment name.

The log directory and WandB experiment name reference "16T16G" but this configuration is for a 32-node, 8-GPU-per-node cluster (32n8g). This appears to be a copy-paste error from the 16n8g variant.

Apply this diff:

 logger:
-  log_dir: logs/grpo-qwen3-235b-32n8g-16T16G-async-1off
+  log_dir: logs/grpo-qwen3-235b-32n8g-async-1off
   wandb:
-    name: grpo-qwen3-235b-32n8g-16T16G-async-1off
+    name: grpo-qwen3-235b-32n8g-async-1off
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log_dir: logs/grpo-qwen3-235b-32n8g-16T16G-async-1off
wandb:
name: grpo-qwen3-235b-32n8g-16T16G-async-1off
log_dir: logs/grpo-qwen3-235b-32n8g-async-1off
wandb:
name: grpo-qwen3-235b-32n8g-async-1off
🤖 Prompt for AI Agents
In
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n8g-async-1off.yaml
around lines 30-32, the log_dir and wandb.name still contain "16T16G" from the
16n8g copy; update both entries so they reference the correct 32n8g
configuration (e.g., replace "16T16G" with "32T8G" so both log_dir and
wandb.name become consistent with the 32n8g-async-1off run).

Comment on lines +18 to +30
cd $PROJECT_ROOT
uv run examples/run_grpo_math.py \
--config $CONFIG_PATH \
grpo.max_num_steps=$MAX_STEPS \
logger.log_dir=$LOG_DIR \
logger.wandb_enabled=True \
logger.wandb.project=nemo-rl \
logger.wandb.name=$EXP_NAME \
logger.monitor_gpus=True \
checkpointing.enabled=True \
checkpointing.checkpoint_dir=$CKPT_DIR \
$@ \
2>&1 | tee $RUN_LOG
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add missing tensorboard_enabled configuration.

The script converts TensorBoard logs at line 33 but doesn't enable TensorBoard logging in the experiment configuration. This is inconsistent with other performance test scripts and may cause the log conversion to fail or produce empty results.

Apply this diff to enable TensorBoard logging:

     logger.monitor_gpus=True \
+    logger.tensorboard_enabled=True \
     checkpointing.enabled=True \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd $PROJECT_ROOT
uv run examples/run_grpo_math.py \
--config $CONFIG_PATH \
grpo.max_num_steps=$MAX_STEPS \
logger.log_dir=$LOG_DIR \
logger.wandb_enabled=True \
logger.wandb.project=nemo-rl \
logger.wandb.name=$EXP_NAME \
logger.monitor_gpus=True \
checkpointing.enabled=True \
checkpointing.checkpoint_dir=$CKPT_DIR \
$@ \
2>&1 | tee $RUN_LOG
cd $PROJECT_ROOT
uv run examples/run_grpo_math.py \
--config $CONFIG_PATH \
grpo.max_num_steps=$MAX_STEPS \
logger.log_dir=$LOG_DIR \
logger.wandb_enabled=True \
logger.wandb.project=nemo-rl \
logger.wandb.name=$EXP_NAME \
logger.monitor_gpus=True \
logger.tensorboard_enabled=True \
checkpointing.enabled=True \
checkpointing.checkpoint_dir=$CKPT_DIR \
$@ \
2>&1 | tee $RUN_LOG
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 18-18: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 29-29: Double quote array expansions to avoid re-splitting elements.

(SC2068)

🤖 Prompt for AI Agents
In tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh around lines 18 to
30, the experiment CLI call enables WandB but does not enable TensorBoard, yet
the script later converts TensorBoard logs; add the missing
logger.tensorboard_enabled=True flag to the uv run arguments (alongside the
existing logger.wandb.* and logger.monitor_gpus entries) so TensorBoard logging
is actually enabled for the run.

@terrykong terrykong merged commit c7c6b1d into r0.4.0 Nov 10, 2025
66 of 69 checks passed
@terrykong terrykong deleted the cherry-pick-1322-r0.4.0 branch November 10, 2025 20:50
@coderabbitai coderabbitai Bot mentioned this pull request Jan 13, 2026
4 tasks
@coderabbitai coderabbitai Bot mentioned this pull request Feb 18, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants