Grouped Rubric Bugfixes and Demo Prep by AutumnAurelium · Pull Request #1198 · NVIDIA-NeMo/RL

AutumnAurelium · 2025-09-24T07:39:53Z

Changelog:

Verifiers Rubric#score_rollouts now receive rollouts guaranteed to be part of the same group.
Removed GroupedRubric in favor of the above pattern.
Migrated PairwiseJudgeRubric to use the new system.
Added two-GPU reverser config for demonstration.
Switched several example configs to not specifying a particular max_new_tokens sampling param, in favor of vLLM's configured max context length.
Added TT->HF checkpoint conversion script

Summary by CodeRabbit

New Features
- Added vLLM-over-HTTP generation backend and integrated tool-call parsing.
- Introduced DTensor v2 policy worker and configuration.
- Added Verifiers (VF) integration: environments, rollouts, and GRPO runner.
- Implemented Qwen3 and Qwen3-MoE custom models with parallelization and adapters.
- Enhanced rollout logs (tool calls, env metrics) and HTML rendering in WandB.
- Provided multiple GRPO config files and example environments (reverse-text, tool-test, math tools, alphabet-sort, env-group, judges).
Documentation
- New guides and READMEs, including Torch nightly usage and VF env docs.
Chores
- Added license file, dependency updates (vLLM, verifiers, flash-attn), workspace setup, and ignore rules.
- Utility scripts for nightly setup, vLLM build, checkpoint conversion.

coderabbitai · 2025-09-24T07:40:04Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds verifiers (VF) environment integration and a new vLLM-over-HTTP generation backend. Introduces multiple VF example envs, configs, and a GRPO runner. Expands GRPO to run VF rollouts, propagates tool-calls, enriches logging, and adds DTensor v2 policy workflow. Implements custom Qwen3/Qwen3MoE models, parallelization, and state-dict adapters. Updates dependencies and build scripts.

Changes

Cohort / File(s)	Summary
Repo meta `/.gitignore`, `/LICENSE_TORCHTITAN`, `/default_runtime_env.yaml`, `/pyproject.toml`	Ignore agent dir; add BSD license file; no-op yaml churn; broaden deps (verifiers, vf-exts, vllm, flash-attn), add optional vllm-http extra, workspace member for vf-exts, and conflict rules.
Docs `/docs/guides/torch-nightly.md`, `/examples/vf-envs/*/README.md`	New guide for nightly PyTorch and vLLM setup; READMEs for VF environments with usage and metrics.
VF extensions package `/env_api/vf_exts/pyproject.toml`, `/env_api/vf_exts/vf_exts/__init__.py`, `/env_api/vf_exts/vf_exts/env/mt_env_group.py`, `/env_api/vf_exts/vf_exts/rubric/__init__.py`, `/env_api/vf_exts/vf_exts/rubric/btrm_judge.py`	Add vf-exts package: versioned API, MultiTurnEnvGroup wrapper, export PairwiseJudgeRubric; implement pairwise judge rubric with OpenAI client.
VF example environments `/examples/vf-envs/vf_reverse_text/`, `/examples/vf-envs/vf_alphabet_sort/`, `/examples/vf-envs/vf_tool_test/`, `/examples/vf-envs/vf_smolagents_math_tools/`, `/examples/vf-envs/vf_pairwise_judge/`, `/examples/vf-envs/vf_policy_judge/`, `/examples/vf-envs/vf_multigroup_test/*`	Add multiple VF env loaders, rubrics, tools, datasets, and packaging configs; include multi-env group and pairwise/policy judge envs.
RL configs (GRPO) `/examples/configs/rl//.yaml`, `/examples/configs/rl//qwen3_4B.yaml`, `/examples/configs/rl/reverser/`	New end-to-end GRPO YAMLs for Qwen3 variants and tasks (env group, pairwise/policy judge, reverser, tool test), including vllm_http generation settings.
Examples and tests `/examples/run_grpo_vf.py`, `/examples/torchtitan_tests/*`	Add GRPO runner for VF integration and model-comparison scripts (Llama3, Qwen3, Qwen3-MoE) against HF baselines.
GRPO + rollout integration `/nemo_rl/algorithms/grpo.py`, `/nemo_rl/experience/rollouts.py`, `/nemo_rl/experience/vf_rollouts.py`, `/nemo_rl/environments/vf_environment.py`, `/nemo_rl/data/__init__.py`	Add vllm_http backend hooks, VF rollout path, per-sample rollout logs and tool_calls propagation, VF environment wrapper, and tokenizer_kwargs in DataConfig.
vLLM HTTP backend `/nemo_rl/models/generation/vllm_http/`, `/nemo_rl/models/generation/__init__.py`, `/nemo_rl/models/generation/vllm/worker*.py`, `/nemo_rl/distributed/ray_actor_environment_registry.py`, `/nemo_rl/distributed/virtual_cluster.py`	Introduce Ray Serve vLLM HTTP app, config, and generation stub; parse tool calls; register executables and actor mappings; add VLLM_HTTP executable.
Policy + DTensor v2 `/nemo_rl/models/policy/__init__.py`, `/nemo_rl/models/policy/lm_policy.py`, `/nemo_rl/models/policy/dtensor_policy_worker.py`, `/nemo_rl/models/policy/dtensor_v2/v2_config.py`, `/nemo_rl/models/policy/dtensor_v2/v2_policy_worker.py`	Add DTensor v2 config, worker, and LM policy path; adjust input_ids dtype; compute device mesh and training/inference flows for DTensor v2.
Custom models core `/nemo_rl/models/custom/{model.py,attention.py,expert_parallel.py,parallelize.py,utils.py}`, `/nemo_rl/models/custom/kernels/moe_indices.py`	Add base model/args, attention (flex/SDPA), MoE kernels/utilities, parallelization (TP/EP/FS/AC), DTensor helpers, and Triton kernel for MoE index permutation.
Qwen3 models `/nemo_rl/models/custom/qwen3/{args.py,model.py,parallelize.py,state_dict_adapter.py}`	Implement Qwen3 model, args, per-layer TP plan, buffer replication, HF state-dict adapter.
Qwen3-MoE models `/nemo_rl/models/custom/qwen3moe/{args.py,model.py,parallelize.py,state_dict_adapter.py}`	Implement Qwen3MoE model and args, parallelize wrapper, and HF state-dict adapter.
Generation workers changes `/nemo_rl/models/generation/vllm/vllm_worker.py`, `/nemo_rl/models/generation/vllm/vllm_worker_async.py`	Decode generated texts and attach parsed tool_calls to outputs.
Distributed + utils `/nemo_rl/distributed/batched_data_dict.py`, `/nemo_rl/utils/logger.py`	Tighten tensor type checks in batching; render rollout logs to HTML in WandB with helpers.
Tools `/tools/setup-nightly-torch.sh`, `/tools/build-vllm-with-nightly.sh`, `/tools/convert_checkpoint.py`, `/tools/eliminate_torch_deps.py`	Scripts to set up nightly torch and vLLM, convert DCP to HF checkpoints, and strip torch deps from files.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Trainer as GRPO Trainer
  participant Policy as VllmHttpGeneration
  participant Serve as Ray Serve (VLLMOpenAIServe)
  participant Env as VF Environment(s)
  Note over Trainer,Policy: Setup phase
  Trainer->>Policy: start Serve app + finish_generation()
  Policy->>Serve: Deploy vLLM HTTP
  Serve-->>Policy: Ready

  Note over Trainer,Env: Rollout (VF path)
  Trainer->>Env: a_generate(prompts, sampling_args)
  Env->>Serve: OpenAI-style completions (HTTP)
  Serve-->>Env: Completions (+tool_calls)
  Env-->>Trainer: Messages, rewards, metrics

  Note over Trainer: Train step and logging
  Trainer->>Trainer: Compute losses, update policy
  Trainer->>Serve: Optional admin ops (prefix cache reset)

sequenceDiagram
  autonumber
  participant Runner as run_grpo_vf.py
  participant Loader as vf.load_environment
  participant Group as MultiTurnEnvGroup
  participant GRPO as Trainer
  Runner->>Loader: load VF env(s)
  Loader-->>Runner: Env, datasets, rubric
  Runner->>Group: Aggregate envs
  Runner->>GRPO: setup + train(config, env_group)
  GRPO->>Group: rollout per task
  Group->>Env: delegate rollout
  Env-->>GRPO: results + rewards

sequenceDiagram
  autonumber
  participant Scorer as PairwiseJudgeRubric
  participant Judge as OpenAI Judge Client
  Scorer->>Judge: Prompt(A vs B)
  Judge-->>Scorer: <answer>1-7</answer>
  Scorer->>Scorer: Parse opinion → rewards
  Scorer-->>Caller: RolloutScores (+malformed flags)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

Possibly related PRs

feat: Support Reward Model based Environments #1026 — Also modifies GRPO to integrate a new environment path (RewardModelEnvironment); overlaps conceptually with this PR’s GRPO env/backend extensions.
feat: Expose async vLLM engine as HTTP server #1110 — Adds/adjusts vLLM HTTP and async server integration; directly related to the new vllm_http backend introduced here.

Suggested labels

documentation, CI:L1

Suggested reviewers

terrykong
joyang-nv
yuki-97
jgerh

Poem

In burrows of code I hop with delight,
New envs to judge, new servers to light.
Qwen3 and friends, all nicely aligned—
GRPO munches carrots, metrics refined.
Tool-calls parsed, rewards in a heap—
Ship it, and let the clusters leap! 🥕✨

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63439ac and 28279fc.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (83)

.gitignore (1 hunks)
LICENSE_TORCHTITAN (1 hunks)
default_runtime_env.yaml (1 hunks)
docs/guides/torch-nightly.md (1 hunks)
env_api/vf_exts/pyproject.toml (1 hunks)
env_api/vf_exts/vf_exts/__init__.py (1 hunks)
env_api/vf_exts/vf_exts/env/mt_env_group.py (1 hunks)
env_api/vf_exts/vf_exts/rubric/__init__.py (1 hunks)
env_api/vf_exts/vf_exts/rubric/btrm_judge.py (1 hunks)
examples/configs/rl/alphabet_sort/qwen3_4B.yaml (1 hunks)
examples/configs/rl/env_group/qwen3_4B.yaml (1 hunks)
examples/configs/rl/pairwise_judge/qwen3_4B.yaml (1 hunks)
examples/configs/rl/policy_judge/qwen3_4B.yaml (1 hunks)
examples/configs/rl/reverser/qwen3_0.6B_2gpu.yaml (1 hunks)
examples/configs/rl/reverser/qwen3_0.6B_dtv1.yaml (1 hunks)
examples/configs/rl/reverser/qwen3_0.6B_dtv2.yaml (1 hunks)
examples/configs/rl/reverser/qwen3_30B_A3B.yaml (1 hunks)
examples/configs/rl/tooltest/qwen3_4B.yaml (1 hunks)
examples/run_grpo_vf.py (1 hunks)
examples/torchtitan_tests/llama3_compare.py (1 hunks)
examples/torchtitan_tests/qwen3_compare.py (1 hunks)
examples/torchtitan_tests/qwen3moe_compare.py (1 hunks)
examples/vf-envs/vf_alphabet_sort/README.md (1 hunks)
examples/vf-envs/vf_alphabet_sort/pyproject.toml (1 hunks)
examples/vf-envs/vf_alphabet_sort/vf_alphabet_sort.py (1 hunks)
examples/vf-envs/vf_multigroup_test/pyproject.toml (1 hunks)
examples/vf-envs/vf_multigroup_test/vf_multigroup_test.py (1 hunks)
examples/vf-envs/vf_pairwise_judge/pyproject.toml (1 hunks)
examples/vf-envs/vf_pairwise_judge/vf_pairwise_judge.py (1 hunks)
examples/vf-envs/vf_policy_judge/pyproject.toml (1 hunks)
examples/vf-envs/vf_policy_judge/vf_policy_judge.py (1 hunks)
examples/vf-envs/vf_reverse_text/README.md (1 hunks)
examples/vf-envs/vf_reverse_text/pyproject.toml (1 hunks)
examples/vf-envs/vf_reverse_text/vf_reverse_text.py (1 hunks)
examples/vf-envs/vf_smolagents_math_tools/README.md (1 hunks)
examples/vf-envs/vf_smolagents_math_tools/pyproject.toml (1 hunks)
examples/vf-envs/vf_smolagents_math_tools/vf_smolagents_math_tools.py (1 hunks)
examples/vf-envs/vf_tool_test/README.md (1 hunks)
examples/vf-envs/vf_tool_test/pyproject.toml (1 hunks)
examples/vf-envs/vf_tool_test/vf_tool_test.py (1 hunks)
nemo_rl/algorithms/grpo.py (15 hunks)
nemo_rl/data/__init__.py (2 hunks)
nemo_rl/distributed/batched_data_dict.py (1 hunks)
nemo_rl/distributed/ray_actor_environment_registry.py (1 hunks)
nemo_rl/distributed/virtual_cluster.py (1 hunks)
nemo_rl/environments/vf_environment.py (1 hunks)
nemo_rl/experience/rollouts.py (13 hunks)
nemo_rl/experience/vf_rollouts.py (1 hunks)
nemo_rl/models/custom/attention.py (1 hunks)
nemo_rl/models/custom/convert.py (1 hunks)
nemo_rl/models/custom/expert_parallel.py (1 hunks)
nemo_rl/models/custom/kernels/moe_indices.py (1 hunks)
nemo_rl/models/custom/model.py (1 hunks)
nemo_rl/models/custom/moe.py (1 hunks)
nemo_rl/models/custom/parallelize.py (1 hunks)
nemo_rl/models/custom/qwen3/args.py (1 hunks)
nemo_rl/models/custom/qwen3/model.py (1 hunks)
nemo_rl/models/custom/qwen3/parallelize.py (1 hunks)
nemo_rl/models/custom/qwen3/state_dict_adapter.py (1 hunks)
nemo_rl/models/custom/qwen3moe/args.py (1 hunks)
nemo_rl/models/custom/qwen3moe/model.py (1 hunks)
nemo_rl/models/custom/qwen3moe/parallelize.py (1 hunks)
nemo_rl/models/custom/qwen3moe/state_dict_adapter.py (1 hunks)
nemo_rl/models/custom/state_dict_adapter.py (1 hunks)
nemo_rl/models/custom/utils.py (1 hunks)
nemo_rl/models/generation/__init__.py (2 hunks)
nemo_rl/models/generation/vllm/vllm_worker.py (4 hunks)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1 hunks)
nemo_rl/models/generation/vllm_http/config.py (1 hunks)
nemo_rl/models/generation/vllm_http/vllm_http.py (1 hunks)
nemo_rl/models/generation/vllm_http/vllm_http_generation.py (1 hunks)
nemo_rl/models/generation/vllm_http/worker_ext.py (1 hunks)
nemo_rl/models/policy/__init__.py (2 hunks)
nemo_rl/models/policy/dtensor_policy_worker.py (3 hunks)
nemo_rl/models/policy/dtensor_v2/v2_config.py (1 hunks)
nemo_rl/models/policy/dtensor_v2/v2_policy_worker.py (1 hunks)
nemo_rl/models/policy/lm_policy.py (5 hunks)
nemo_rl/utils/logger.py (2 hunks)
pyproject.toml (5 hunks)
tools/build-vllm-with-nightly.sh (1 hunks)
tools/convert_checkpoint.py (1 hunks)
tools/eliminate_torch_deps.py (1 hunks)
tools/setup-nightly-torch.sh (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

AutumnAurelium · 2025-09-24T07:40:28Z

Apologies, meant to submit this to my own fork and fat-fingered the dropdown.

AutumnAurelium added 30 commits August 13, 2025 17:08

initial verifiers test

2f6b29c

fix a few bugs

903f9c4

fix incorrect type hint and bizarre tokenization behavior

b7fa6b6

use answers instead of discarding

a63d782

use verifiers format_dataset to support question-only datasets

fbb4bc1

try to prevent KeyErrors with weird states

f5587fb

move sample VF env to nemo_rl package

a9f16c5

forgot to pass dataset

bb318bd

specify python env for vf ray worker

437fa3b

dependency hack to prevent OOM in vllm worker

f9b40a0

back to normal deps

8d4fd27

back to normal deps v2

338f838

crank context limit

bd8d5d8

clarify error somewhat

2a98ff4

keyerror

eb001ef

fix state init to better mimic verifiers rollouts

4dff124

un-crank context length

5ad96b5

try disabling colocation for this run

0075c9b

explicitly disable colo

a8812d5

actually specify 2-gpu system

cee97d5

try commented again

63ff5f9

don't dedicate node

56ef70d

use simpler and faster example env

8dd22df

disable validation since we don't have one

bc6ff96

enable wandb

5f54abc

use existing option

ee6a7de

add multiturn alphabet env test

222498d

make script/run names more specific

6a4e95d

different checkpoint dirs

0f33ee9

fix wandb naming

1217ea2

AutumnAurelium added 16 commits September 22, 2025 12:31

remove reference to grouped rubric

0bede27

qualify names

6157f99

qualify rolloutscores

cd46513

add 2gpu config

346c51b

1 GPU for vllm actors

4d4accf

disable custom client

df150b7

experiment: checkpoint converter

79ab1ae

save with full HF model

cf149c6

log progress

8367d97

do not eat exceptions

df67de4

blindly init process group to avoid crash

4014679

manually set rank

92f46a7

use non-dist-dependent load system

3e67a22

add suffix for proper location

be9844e

use step-based checkpointing

c0b8bad

-1 gen batch size to avoid confusion

28279fc

AutumnAurelium requested review from a team as code owners September 24, 2025 07:39

AutumnAurelium closed this Sep 24, 2025

github-actions Bot added the Documentation Improvements or additions to documentation label Sep 24, 2025

AutumnAurelium deleted the aria/grouped-fix branch September 28, 2025 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouped Rubric Bugfixes and Demo Prep#1198

Grouped Rubric Bugfixes and Demo Prep#1198
AutumnAurelium wants to merge 222 commits intoNVIDIA-NeMo:mainfrom
arcee-ai:aria/grouped-fix

AutumnAurelium commented Sep 24, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Sep 24, 2025 •

edited

Loading

Review failed

Uh oh!

AutumnAurelium commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AutumnAurelium commented Sep 24, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

AutumnAurelium commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AutumnAurelium commented Sep 24, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 24, 2025 •

edited

Loading