Skip to content

Grouped Rubric Bugfixes and Demo Prep#1198

Closed
AutumnAurelium wants to merge 222 commits intoNVIDIA-NeMo:mainfrom
arcee-ai:aria/grouped-fix
Closed

Grouped Rubric Bugfixes and Demo Prep#1198
AutumnAurelium wants to merge 222 commits intoNVIDIA-NeMo:mainfrom
arcee-ai:aria/grouped-fix

Conversation

@AutumnAurelium
Copy link
Copy Markdown

@AutumnAurelium AutumnAurelium commented Sep 24, 2025

Changelog:

  • Verifiers Rubric#score_rollouts now receive rollouts guaranteed to be part of the same group.
  • Removed GroupedRubric in favor of the above pattern.
  • Migrated PairwiseJudgeRubric to use the new system.
  • Added two-GPU reverser config for demonstration.
  • Switched several example configs to not specifying a particular max_new_tokens sampling param, in favor of vLLM's configured max context length.
  • Added TT->HF checkpoint conversion script

Summary by CodeRabbit

  • New Features
    • Added vLLM-over-HTTP generation backend and integrated tool-call parsing.
    • Introduced DTensor v2 policy worker and configuration.
    • Added Verifiers (VF) integration: environments, rollouts, and GRPO runner.
    • Implemented Qwen3 and Qwen3-MoE custom models with parallelization and adapters.
    • Enhanced rollout logs (tool calls, env metrics) and HTML rendering in WandB.
    • Provided multiple GRPO config files and example environments (reverse-text, tool-test, math tools, alphabet-sort, env-group, judges).
  • Documentation
    • New guides and READMEs, including Torch nightly usage and VF env docs.
  • Chores
    • Added license file, dependency updates (vLLM, verifiers, flash-attn), workspace setup, and ignore rules.
    • Utility scripts for nightly setup, vLLM build, checkpoint conversion.

@AutumnAurelium AutumnAurelium requested review from a team as code owners September 24, 2025 07:39
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 24, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds verifiers (VF) environment integration and a new vLLM-over-HTTP generation backend. Introduces multiple VF example envs, configs, and a GRPO runner. Expands GRPO to run VF rollouts, propagates tool-calls, enriches logging, and adds DTensor v2 policy workflow. Implements custom Qwen3/Qwen3MoE models, parallelization, and state-dict adapters. Updates dependencies and build scripts.

Changes

Cohort / File(s) Summary
Repo meta
/.gitignore, /LICENSE_TORCHTITAN, /default_runtime_env.yaml, /pyproject.toml
Ignore agent dir; add BSD license file; no-op yaml churn; broaden deps (verifiers, vf-exts, vllm, flash-attn), add optional vllm-http extra, workspace member for vf-exts, and conflict rules.
Docs
/docs/guides/torch-nightly.md, /examples/vf-envs/*/README.md
New guide for nightly PyTorch and vLLM setup; READMEs for VF environments with usage and metrics.
VF extensions package
/env_api/vf_exts/pyproject.toml, /env_api/vf_exts/vf_exts/__init__.py, /env_api/vf_exts/vf_exts/env/mt_env_group.py, /env_api/vf_exts/vf_exts/rubric/__init__.py, /env_api/vf_exts/vf_exts/rubric/btrm_judge.py
Add vf-exts package: versioned API, MultiTurnEnvGroup wrapper, export PairwiseJudgeRubric; implement pairwise judge rubric with OpenAI client.
VF example environments
/examples/vf-envs/vf_reverse_text/*, /examples/vf-envs/vf_alphabet_sort/*, /examples/vf-envs/vf_tool_test/*, /examples/vf-envs/vf_smolagents_math_tools/*, /examples/vf-envs/vf_pairwise_judge/*, /examples/vf-envs/vf_policy_judge/*, /examples/vf-envs/vf_multigroup_test/*
Add multiple VF env loaders, rubrics, tools, datasets, and packaging configs; include multi-env group and pairwise/policy judge envs.
RL configs (GRPO)
/examples/configs/rl/*/*.yaml, /examples/configs/rl/*/qwen3_4B.yaml, /examples/configs/rl/reverser/*
New end-to-end GRPO YAMLs for Qwen3 variants and tasks (env group, pairwise/policy judge, reverser, tool test), including vllm_http generation settings.
Examples and tests
/examples/run_grpo_vf.py, /examples/torchtitan_tests/*
Add GRPO runner for VF integration and model-comparison scripts (Llama3, Qwen3, Qwen3-MoE) against HF baselines.
GRPO + rollout integration
/nemo_rl/algorithms/grpo.py, /nemo_rl/experience/rollouts.py, /nemo_rl/experience/vf_rollouts.py, /nemo_rl/environments/vf_environment.py, /nemo_rl/data/__init__.py
Add vllm_http backend hooks, VF rollout path, per-sample rollout logs and tool_calls propagation, VF environment wrapper, and tokenizer_kwargs in DataConfig.
vLLM HTTP backend
/nemo_rl/models/generation/vllm_http/*, /nemo_rl/models/generation/__init__.py, /nemo_rl/models/generation/vllm/*worker*.py, /nemo_rl/distributed/ray_actor_environment_registry.py, /nemo_rl/distributed/virtual_cluster.py
Introduce Ray Serve vLLM HTTP app, config, and generation stub; parse tool calls; register executables and actor mappings; add VLLM_HTTP executable.
Policy + DTensor v2
/nemo_rl/models/policy/__init__.py, /nemo_rl/models/policy/lm_policy.py, /nemo_rl/models/policy/dtensor_policy_worker.py, /nemo_rl/models/policy/dtensor_v2/v2_config.py, /nemo_rl/models/policy/dtensor_v2/v2_policy_worker.py
Add DTensor v2 config, worker, and LM policy path; adjust input_ids dtype; compute device mesh and training/inference flows for DTensor v2.
Custom models core
/nemo_rl/models/custom/{model.py,attention.py,expert_parallel.py,parallelize.py,utils.py}, /nemo_rl/models/custom/kernels/moe_indices.py
Add base model/args, attention (flex/SDPA), MoE kernels/utilities, parallelization (TP/EP/FS/AC), DTensor helpers, and Triton kernel for MoE index permutation.
Qwen3 models
/nemo_rl/models/custom/qwen3/{args.py,model.py,parallelize.py,state_dict_adapter.py}
Implement Qwen3 model, args, per-layer TP plan, buffer replication, HF state-dict adapter.
Qwen3-MoE models
/nemo_rl/models/custom/qwen3moe/{args.py,model.py,parallelize.py,state_dict_adapter.py}
Implement Qwen3MoE model and args, parallelize wrapper, and HF state-dict adapter.
Generation workers changes
/nemo_rl/models/generation/vllm/vllm_worker.py, /nemo_rl/models/generation/vllm/vllm_worker_async.py
Decode generated texts and attach parsed tool_calls to outputs.
Distributed + utils
/nemo_rl/distributed/batched_data_dict.py, /nemo_rl/utils/logger.py
Tighten tensor type checks in batching; render rollout logs to HTML in WandB with helpers.
Tools
/tools/setup-nightly-torch.sh, /tools/build-vllm-with-nightly.sh, /tools/convert_checkpoint.py, /tools/eliminate_torch_deps.py
Scripts to set up nightly torch and vLLM, convert DCP to HF checkpoints, and strip torch deps from files.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Trainer as GRPO Trainer
  participant Policy as VllmHttpGeneration
  participant Serve as Ray Serve (VLLMOpenAIServe)
  participant Env as VF Environment(s)
  Note over Trainer,Policy: Setup phase
  Trainer->>Policy: start Serve app + finish_generation()
  Policy->>Serve: Deploy vLLM HTTP
  Serve-->>Policy: Ready

  Note over Trainer,Env: Rollout (VF path)
  Trainer->>Env: a_generate(prompts, sampling_args)
  Env->>Serve: OpenAI-style completions (HTTP)
  Serve-->>Env: Completions (+tool_calls)
  Env-->>Trainer: Messages, rewards, metrics

  Note over Trainer: Train step and logging
  Trainer->>Trainer: Compute losses, update policy
  Trainer->>Serve: Optional admin ops (prefix cache reset)
Loading
sequenceDiagram
  autonumber
  participant Runner as run_grpo_vf.py
  participant Loader as vf.load_environment
  participant Group as MultiTurnEnvGroup
  participant GRPO as Trainer
  Runner->>Loader: load VF env(s)
  Loader-->>Runner: Env, datasets, rubric
  Runner->>Group: Aggregate envs
  Runner->>GRPO: setup + train(config, env_group)
  GRPO->>Group: rollout per task
  Group->>Env: delegate rollout
  Env-->>GRPO: results + rewards
Loading
sequenceDiagram
  autonumber
  participant Scorer as PairwiseJudgeRubric
  participant Judge as OpenAI Judge Client
  Scorer->>Judge: Prompt(A vs B)
  Judge-->>Scorer: <answer>1-7</answer>
  Scorer->>Scorer: Parse opinion → rewards
  Scorer-->>Caller: RolloutScores (+malformed flags)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

Possibly related PRs

Suggested labels

documentation, CI:L1

Suggested reviewers

  • terrykong
  • joyang-nv
  • yuki-97
  • jgerh

Poem

In burrows of code I hop with delight,
New envs to judge, new servers to light.
Qwen3 and friends, all nicely aligned—
GRPO munches carrots, metrics refined.
Tool-calls parsed, rewards in a heap—
Ship it, and let the clusters leap! 🥕✨

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63439ac and 28279fc.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (83)
  • .gitignore (1 hunks)
  • LICENSE_TORCHTITAN (1 hunks)
  • default_runtime_env.yaml (1 hunks)
  • docs/guides/torch-nightly.md (1 hunks)
  • env_api/vf_exts/pyproject.toml (1 hunks)
  • env_api/vf_exts/vf_exts/__init__.py (1 hunks)
  • env_api/vf_exts/vf_exts/env/mt_env_group.py (1 hunks)
  • env_api/vf_exts/vf_exts/rubric/__init__.py (1 hunks)
  • env_api/vf_exts/vf_exts/rubric/btrm_judge.py (1 hunks)
  • examples/configs/rl/alphabet_sort/qwen3_4B.yaml (1 hunks)
  • examples/configs/rl/env_group/qwen3_4B.yaml (1 hunks)
  • examples/configs/rl/pairwise_judge/qwen3_4B.yaml (1 hunks)
  • examples/configs/rl/policy_judge/qwen3_4B.yaml (1 hunks)
  • examples/configs/rl/reverser/qwen3_0.6B_2gpu.yaml (1 hunks)
  • examples/configs/rl/reverser/qwen3_0.6B_dtv1.yaml (1 hunks)
  • examples/configs/rl/reverser/qwen3_0.6B_dtv2.yaml (1 hunks)
  • examples/configs/rl/reverser/qwen3_30B_A3B.yaml (1 hunks)
  • examples/configs/rl/tooltest/qwen3_4B.yaml (1 hunks)
  • examples/run_grpo_vf.py (1 hunks)
  • examples/torchtitan_tests/llama3_compare.py (1 hunks)
  • examples/torchtitan_tests/qwen3_compare.py (1 hunks)
  • examples/torchtitan_tests/qwen3moe_compare.py (1 hunks)
  • examples/vf-envs/vf_alphabet_sort/README.md (1 hunks)
  • examples/vf-envs/vf_alphabet_sort/pyproject.toml (1 hunks)
  • examples/vf-envs/vf_alphabet_sort/vf_alphabet_sort.py (1 hunks)
  • examples/vf-envs/vf_multigroup_test/pyproject.toml (1 hunks)
  • examples/vf-envs/vf_multigroup_test/vf_multigroup_test.py (1 hunks)
  • examples/vf-envs/vf_pairwise_judge/pyproject.toml (1 hunks)
  • examples/vf-envs/vf_pairwise_judge/vf_pairwise_judge.py (1 hunks)
  • examples/vf-envs/vf_policy_judge/pyproject.toml (1 hunks)
  • examples/vf-envs/vf_policy_judge/vf_policy_judge.py (1 hunks)
  • examples/vf-envs/vf_reverse_text/README.md (1 hunks)
  • examples/vf-envs/vf_reverse_text/pyproject.toml (1 hunks)
  • examples/vf-envs/vf_reverse_text/vf_reverse_text.py (1 hunks)
  • examples/vf-envs/vf_smolagents_math_tools/README.md (1 hunks)
  • examples/vf-envs/vf_smolagents_math_tools/pyproject.toml (1 hunks)
  • examples/vf-envs/vf_smolagents_math_tools/vf_smolagents_math_tools.py (1 hunks)
  • examples/vf-envs/vf_tool_test/README.md (1 hunks)
  • examples/vf-envs/vf_tool_test/pyproject.toml (1 hunks)
  • examples/vf-envs/vf_tool_test/vf_tool_test.py (1 hunks)
  • nemo_rl/algorithms/grpo.py (15 hunks)
  • nemo_rl/data/__init__.py (2 hunks)
  • nemo_rl/distributed/batched_data_dict.py (1 hunks)
  • nemo_rl/distributed/ray_actor_environment_registry.py (1 hunks)
  • nemo_rl/distributed/virtual_cluster.py (1 hunks)
  • nemo_rl/environments/vf_environment.py (1 hunks)
  • nemo_rl/experience/rollouts.py (13 hunks)
  • nemo_rl/experience/vf_rollouts.py (1 hunks)
  • nemo_rl/models/custom/attention.py (1 hunks)
  • nemo_rl/models/custom/convert.py (1 hunks)
  • nemo_rl/models/custom/expert_parallel.py (1 hunks)
  • nemo_rl/models/custom/kernels/moe_indices.py (1 hunks)
  • nemo_rl/models/custom/model.py (1 hunks)
  • nemo_rl/models/custom/moe.py (1 hunks)
  • nemo_rl/models/custom/parallelize.py (1 hunks)
  • nemo_rl/models/custom/qwen3/args.py (1 hunks)
  • nemo_rl/models/custom/qwen3/model.py (1 hunks)
  • nemo_rl/models/custom/qwen3/parallelize.py (1 hunks)
  • nemo_rl/models/custom/qwen3/state_dict_adapter.py (1 hunks)
  • nemo_rl/models/custom/qwen3moe/args.py (1 hunks)
  • nemo_rl/models/custom/qwen3moe/model.py (1 hunks)
  • nemo_rl/models/custom/qwen3moe/parallelize.py (1 hunks)
  • nemo_rl/models/custom/qwen3moe/state_dict_adapter.py (1 hunks)
  • nemo_rl/models/custom/state_dict_adapter.py (1 hunks)
  • nemo_rl/models/custom/utils.py (1 hunks)
  • nemo_rl/models/generation/__init__.py (2 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker.py (4 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker_async.py (1 hunks)
  • nemo_rl/models/generation/vllm_http/config.py (1 hunks)
  • nemo_rl/models/generation/vllm_http/vllm_http.py (1 hunks)
  • nemo_rl/models/generation/vllm_http/vllm_http_generation.py (1 hunks)
  • nemo_rl/models/generation/vllm_http/worker_ext.py (1 hunks)
  • nemo_rl/models/policy/__init__.py (2 hunks)
  • nemo_rl/models/policy/dtensor_policy_worker.py (3 hunks)
  • nemo_rl/models/policy/dtensor_v2/v2_config.py (1 hunks)
  • nemo_rl/models/policy/dtensor_v2/v2_policy_worker.py (1 hunks)
  • nemo_rl/models/policy/lm_policy.py (5 hunks)
  • nemo_rl/utils/logger.py (2 hunks)
  • pyproject.toml (5 hunks)
  • tools/build-vllm-with-nightly.sh (1 hunks)
  • tools/convert_checkpoint.py (1 hunks)
  • tools/eliminate_torch_deps.py (1 hunks)
  • tools/setup-nightly-torch.sh (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Sep 24, 2025
@AutumnAurelium
Copy link
Copy Markdown
Author

Apologies, meant to submit this to my own fork and fat-fingered the dropdown.

@AutumnAurelium AutumnAurelium deleted the aria/grouped-fix branch September 28, 2025 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant