feat: Support Reward Model based Environments by RayenTian · Pull Request #1026 · NVIDIA-NeMo/RL

RayenTian · 2025-08-30T09:53:27Z

What does this PR do ?

Support reward model environment.

Issues

List issues that this PR closes #670

Usage

uv run examples/run_grpo_rm.py --config examples/configs/grpo_rm_1B.yaml

Experience Result

Reward Environment Correctness

The correctness of the reward model ("Skywork/Skywork-Reward-V2-Qwen3-8B") was validated using reward-bench. The reward model was evaluated with the allenai/reward-bench dataset and results were compared between the reference implementation and ours. The comparison is shown below:

Results from reward-bench:

        "alpacaeval-easy": 0.99,
        "alpacaeval-hard": 0.968421052631579,
        "alpacaeval-length": 0.9789473684210527,
        "donotanswer": 0.8161764705882353,
        "hep-cpp": 0.9817073170731707,
        "hep-go": 0.9817073170731707,
        "hep-java": 1.0,
        "hep-js": 0.9878048780487805,
        "hep-python": 1.0,
        "hep-rust": 0.9817073170731707,
        "llmbar-adver-GPTInst": 0.8586956521739131,
        "llmbar-adver-GPTOut": 0.7659574468085106,
        "llmbar-adver-manual": 0.782608695652174,
        "llmbar-adver-neighbor": 0.7910447761194029,
        "llmbar-natural": 0.93,
        "math-prm": 0.9798657718120806,
        "mt-bench-easy": 1.0,
        "mt-bench-hard": 0.8648648648648649,
        "mt-bench-med": 1.0,
        "refusals-dangerous": 0.98,
        "refusals-offensive": 0.98,
        "xstest-should-refuse": 0.9675324675324676,
        "xstest-should-respond": 0.952

Results from our implementation:

      "alpacaeval-easy": 0.99, 
      "alpacaeval-hard": 0.968421052631579, 
      "alpacaeval-length": 0.9789473684210527, 
      "donotanswer": 0.7941176470588235, 
      "hep-cpp": 0.975609756097561, 
      "hep-go": 0.9939024390243902, 
      "hep-java": 0.9939024390243902, 
      "hep-js": 0.9939024390243902, 
      "hep-python": 0.9939024390243902, 
      "hep-rust": 0.9695121951219512, 
      "llmbar-adver-GPTInst": 0.8586956521739131, 
      "llmbar-adver-GPTOut": 0.8085106382978723, 
      "llmbar-adver-manual": 0.782608695652174, 
      "llmbar-adver-neighbor": 0.7910447761194029, 
      "llmbar-natural": 0.94, 
      "math-prm": 0.9798657718120806, 
      "mt-bench-easy": 1.0, 
      "mt-bench-hard": 0.8648648648648649, 
      "mt-bench-med": 1.0, 
      "refusals-dangerous": 0.98, 
      "refusals-offensive": 0.98, 
      "xstest-should-refuse": 0.9675324675324676, 
      "xstest-should-respond": 0.956

Training Results

DTensor Path, Colocated

Config:

policy.max_total_sequence_length=2048
cluster.gpus_per_node=8
data.dataset_name=OpenMathInstruct-2
env.reward_model.model_name=Skywork/Skywork-Reward-V2-Qwen3-0.6B
env.reward_model.precision="bfloat16"
env.reward_model.batch_size=32
env.reward_model.resources.gpus_per_node=4
env.reward_model.dtensor_cfg.tensor_parallel_size=4
policy.dtensor_cfg.tensor_parallel_size=2
policy.precision="bfloat16"

DTensor Path, Non-colocated

Config:

policy.max_total_sequence_length=2048
cluster.gpus_per_node=8
data.dataset_name=OpenMathInstruct-2
env.reward_model.model_name=Skywork/Skywork-Reward-V2-Qwen3-0.6B
env.reward_model.precision="bfloat16"
env.reward_model.batch_size=32
env.reward_model.resources.gpus_per_node=8
env.reward_model.dtensor_cfg.tensor_parallel_size=4
policy.dtensor_cfg.tensor_parallel_size=2
policy.precision="bfloat16"
policy.generation.colocated.enabled=False
policy.generation.colocated.resources.gpus_per_node=4
policy.generation.colocated.resources.num_nodes=1

Summary by CodeRabbit

New Features
- Reward-model environment for GRPO: scoring, metrics, graceful shutdown, and resource-aware deployment (colocated/separated inference).
- Distributed policy scoring API to run model scoring across workers (new Policy.score interface).
- New HF-format math data processor and updated math example usage.
- New GRPO reward-model example script and sample config for reward-model runs.
Documentation
- New Environments guide; updated GRPO and Reward Model guides; docs navigation updated; minor formatting fix.
Tests
- Functional E2E test for GRPO+reward-model; unit tests for Reward Model Environment and HF math processor.

github-actions · 2025-08-30T09:53:52Z

ℹ️ File Consistency Check

Check based on commit: a316042 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-01T01:41:06Z

ℹ️ File Consistency Check

Check based on commit: 81d92b6 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-01T02:18:36Z

ℹ️ File Consistency Check

Check based on commit: 81d92b6 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-01T05:07:01Z

ℹ️ File Consistency Check

Check based on commit: eddebe3 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-01T05:09:23Z

ℹ️ File Consistency Check

Check based on commit: eddebe3 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-01T06:46:08Z

ℹ️ File Consistency Check

Check based on commit: c212d70 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-01T07:22:50Z

ℹ️ File Consistency Check

Check based on commit: c212d70 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-01T07:51:32Z

ℹ️ File Consistency Check

Check based on commit: df87157 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-01T07:53:34Z

ℹ️ File Consistency Check

Check based on commit: df87157 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: ruit <ruit@nvidia.com>

…worker itself set them. Signed-off-by: ruit <ruit@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com>

…esolve in load_checkpoint Signed-off-by: ruit <ruit@nvidia.com>

…del strategy. Signed-off-by: ruit <ruit@nvidia.com>

github-actions · 2025-09-20T07:21:54Z

ℹ️ File Consistency Check

Check based on commit: b0890d4 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

afennelly-mitre · 2025-10-29T17:06:01Z

@RayenTian how difficult would it be to extend the current reward model environment to use the megatron backend instead? Is there any reason that you chose to implement with the dtensor backend first?

For context, I'm working on a training setup where we need to use a reward model for the reward signal, but also need to use the megatron backend to ensure we can train up to 70B param models (as currently supported upper bound in Nemo RL).

Any insights on your end would be greatly appreciated - I'm currently reviewing your implementation and seeing how I can translate it to use the megatron backend instead.

afennelly-mitre · 2025-10-29T17:10:53Z

@RayenTian how difficult would it be to extend the current reward model environment to use the megatron backend instead? Is there any reason that you chose to implement with the dtensor backend first?

For context, I'm working on a training setup where we need to use a reward model for the reward signal, but also need to use the megatron backend to ensure we can train up to 70B param models (as currently supported upper bound in Nemo RL).

Any insights on your end would be greatly appreciated - I'm currently reviewing your implementation and seeing how I can translate it to use the megatron backend instead.

I know there is the Reward Model Training on Megatron backend issue that is currently open, but there was no response to my question on the Issue from 2 weeks ago. Just trying to gauge if this is an active development effort for the Nemo RL maintainers?

RayenTian · 2025-10-30T01:51:32Z

@RayenTian how difficult would it be to extend the current reward model environment to use the megatron backend instead? Is there any reason that you chose to implement with the dtensor backend first?
For context, I'm working on a training setup where we need to use a reward model for the reward signal, but also need to use the megatron backend to ensure we can train up to 70B param models (as currently supported upper bound in Nemo RL).
Any insights on your end would be greatly appreciated - I'm currently reviewing your implementation and seeing how I can translate it to use the megatron backend instead.

I know there is the Reward Model Training on Megatron backend issue that is currently open, but there was no response to my question on the Issue from 2 weeks ago. Just trying to gauge if this is an active development effort for the Nemo RL maintainers?

Hi, @afennelly-mitre. Thanks a lot for your support and interest in NemoRL!

Currently, using the mcore backend for training the reward model (RM) is not ready yet, which is why we implemented the RM with the dtensor backend first. As you may have seen in the discussion on issue #720.

For your setup, if you only need to use the RM as the environment, it is fully supported to train your policy model with the mcore backend while keeping the RM on dtensor. You can achieve this simply by modifying the backend field for the policy in your configuration file.

github-actions Bot added the Documentation Improvements or additions to documentation label Aug 30, 2025

RayenTian temporarily deployed to public August 30, 2025 10:18 — with GitHub Actions Inactive

RayenTian force-pushed the ruit/reward_model branch from a316042 to 81d92b6 Compare September 1, 2025 01:40

RayenTian temporarily deployed to public September 1, 2025 01:40 — with GitHub Actions Inactive

RayenTian added the CI:L0 Run doctests and unit tests label Sep 1, 2025

RayenTian temporarily deployed to nemo-ci September 1, 2025 02:18 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci September 1, 2025 02:37 — with GitHub Actions Inactive

yuki-97 temporarily deployed to public September 1, 2025 05:06 — with GitHub Actions Inactive

yuki-97 added CI:docs Run doctest and removed CI:L0 Run doctests and unit tests labels Sep 1, 2025

yuki-97 temporarily deployed to nemo-ci September 1, 2025 05:09 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci September 1, 2025 05:13 — with GitHub Actions Inactive

RayenTian temporarily deployed to public September 1, 2025 06:45 — with GitHub Actions Inactive

RayenTian requested a review from yuki-97 September 1, 2025 06:48

RayenTian added the CI:L0 Run doctests and unit tests label Sep 1, 2025

RayenTian temporarily deployed to nemo-ci September 1, 2025 07:22 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci September 1, 2025 07:27 — with GitHub Actions Inactive

RayenTian temporarily deployed to public September 1, 2025 07:51 — with GitHub Actions Inactive

RayenTian added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Sep 1, 2025

RayenTian temporarily deployed to nemo-ci September 1, 2025 07:53 — with GitHub Actions Inactive

RayenTian added 19 commits September 20, 2025 00:20

update with review comment

da416d7

Signed-off-by: ruit <ruit@nvidia.com>

Remove Ray-specific environment variables in rayworkergroup, let the …

d24eee0

…worker itself set them. Signed-off-by: ruit <ruit@nvidia.com>

explicitly call the environment's shutdown to fix ray actor exit error

19f9f12

Signed-off-by: ruit <ruit@nvidia.com>

fix rollout

a92dca6

Signed-off-by: ruit <ruit@nvidia.com>

rename variables

d10cdd9

Signed-off-by: ruit <ruit@nvidia.com>

remove unused config

028a9d9

Signed-off-by: ruit <ruit@nvidia.com>

modify doc

90e8505

Signed-off-by: ruit <ruit@nvidia.com>

Fix coderabbit's comment

0e5538c

Signed-off-by: ruit <ruit@nvidia.com>

add issue link, change to ray env init, Improved type hints

932e63a

Signed-off-by: ruit <ruit@nvidia.com>

add explanation for how resources are allocated

e8dd071

Signed-off-by: ruit <ruit@nvidia.com>

fix import path

1837d76

Signed-off-by: ruit <ruit@nvidia.com>

add assert and apply PR809

03d0166

Signed-off-by: ruit <ruit@nvidia.com>

modify config to add issue link

949e612

Signed-off-by: ruit <ruit@nvidia.com>

fix doc

eff5d4b

Signed-off-by: ruit <ruit@nvidia.com>

fix the sphinx doc failure

0d7796e

Signed-off-by: ruit <ruit@nvidia.com>

fix fro dataset refactor

c25182d

Signed-off-by: ruit <ruit@nvidia.com>

add init value for reward_model_policy

cf61aa8

Signed-off-by: ruit <ruit@nvidia.com>

the path of model can not endwith model, or will conflict with path r…

4cbe07b

…esolve in load_checkpoint Signed-off-by: ruit <ruit@nvidia.com>

Make sure to check if the properties exist when closing the reward mo…

b0890d4

…del strategy. Signed-off-by: ruit <ruit@nvidia.com>

ffrujeri approved these changes Sep 20, 2025

View reviewed changes

coderabbitai Bot mentioned this pull request Sep 24, 2025

Grouped Rubric Bugfixes and Demo Prep #1198

Closed

This was referenced Oct 2, 2025

feat: Support DAPO dynamic sampling and reward shaping #602

Merged

feat: refit refactoring with zmq and overlapping #1267

Merged

coderabbitai Bot mentioned this pull request Nov 4, 2025

Mmanohara/merge grpo helpsteer cp tp #1472

Open

4 tasks

coderabbitai Bot mentioned this pull request Nov 18, 2025

refactor: refactor env and data processor & add nemotron super 49b recipes #1506

Merged

coderabbitai Bot mentioned this pull request Dec 4, 2025

feat: add SGLang rollout backend, part1 #1580

Closed

4 tasks

Conversation

RayenTian commented Aug 30, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Experience Result

Reward Environment Correctness

Training Results

DTensor Path, Colocated

DTensor Path, Non-colocated

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Aug 30, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Sep 20, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

afennelly-mitre commented Oct 29, 2025

Uh oh!

afennelly-mitre commented Oct 29, 2025

Uh oh!

RayenTian commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

RayenTian commented Aug 30, 2025 •

edited by coderabbitai Bot

Loading