Grouped Rubric Bugfixes and Demo Prep#1198
Grouped Rubric Bugfixes and Demo Prep#1198AutumnAurelium wants to merge 222 commits intoNVIDIA-NeMo:mainfrom
Conversation
|
Caution Review failedThe pull request is closed. WalkthroughAdds verifiers (VF) environment integration and a new vLLM-over-HTTP generation backend. Introduces multiple VF example envs, configs, and a GRPO runner. Expands GRPO to run VF rollouts, propagates tool-calls, enriches logging, and adds DTensor v2 policy workflow. Implements custom Qwen3/Qwen3MoE models, parallelization, and state-dict adapters. Updates dependencies and build scripts. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Trainer as GRPO Trainer
participant Policy as VllmHttpGeneration
participant Serve as Ray Serve (VLLMOpenAIServe)
participant Env as VF Environment(s)
Note over Trainer,Policy: Setup phase
Trainer->>Policy: start Serve app + finish_generation()
Policy->>Serve: Deploy vLLM HTTP
Serve-->>Policy: Ready
Note over Trainer,Env: Rollout (VF path)
Trainer->>Env: a_generate(prompts, sampling_args)
Env->>Serve: OpenAI-style completions (HTTP)
Serve-->>Env: Completions (+tool_calls)
Env-->>Trainer: Messages, rewards, metrics
Note over Trainer: Train step and logging
Trainer->>Trainer: Compute losses, update policy
Trainer->>Serve: Optional admin ops (prefix cache reset)
sequenceDiagram
autonumber
participant Runner as run_grpo_vf.py
participant Loader as vf.load_environment
participant Group as MultiTurnEnvGroup
participant GRPO as Trainer
Runner->>Loader: load VF env(s)
Loader-->>Runner: Env, datasets, rubric
Runner->>Group: Aggregate envs
Runner->>GRPO: setup + train(config, env_group)
GRPO->>Group: rollout per task
Group->>Env: delegate rollout
Env-->>GRPO: results + rewards
sequenceDiagram
autonumber
participant Scorer as PairwiseJudgeRubric
participant Judge as OpenAI Judge Client
Scorer->>Judge: Prompt(A vs B)
Judge-->>Scorer: <answer>1-7</answer>
Scorer->>Scorer: Parse opinion → rewards
Scorer-->>Caller: RolloutScores (+malformed flags)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~150 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (83)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Apologies, meant to submit this to my own fork and fat-fingered the dropdown. |
Changelog:
Rubric#score_rolloutsnow receive rollouts guaranteed to be part of the same group.GroupedRubricin favor of the above pattern.PairwiseJudgeRubricto use the new system.max_new_tokenssampling param, in favor of vLLM's configured max context length.Summary by CodeRabbit