Skip to content

feat: Support Reward Model based Environments#1026

Merged
terrykong merged 45 commits intomainfrom
ruit/reward_model
Sep 20, 2025
Merged

feat: Support Reward Model based Environments#1026
terrykong merged 45 commits intomainfrom
ruit/reward_model

Conversation

@RayenTian
Copy link
Copy Markdown
Contributor

@RayenTian RayenTian commented Aug 30, 2025

What does this PR do ?

Support reward model environment.

Issues

List issues that this PR closes #670

Usage

uv run examples/run_grpo_rm.py --config examples/configs/grpo_rm_1B.yaml 

Experience Result

Reward Environment Correctness

The correctness of the reward model ("Skywork/Skywork-Reward-V2-Qwen3-8B") was validated using reward-bench. The reward model was evaluated with the allenai/reward-bench dataset and results were compared between the reference implementation and ours. The comparison is shown below:

Results from reward-bench:

        "alpacaeval-easy": 0.99,
        "alpacaeval-hard": 0.968421052631579,
        "alpacaeval-length": 0.9789473684210527,
        "donotanswer": 0.8161764705882353,
        "hep-cpp": 0.9817073170731707,
        "hep-go": 0.9817073170731707,
        "hep-java": 1.0,
        "hep-js": 0.9878048780487805,
        "hep-python": 1.0,
        "hep-rust": 0.9817073170731707,
        "llmbar-adver-GPTInst": 0.8586956521739131,
        "llmbar-adver-GPTOut": 0.7659574468085106,
        "llmbar-adver-manual": 0.782608695652174,
        "llmbar-adver-neighbor": 0.7910447761194029,
        "llmbar-natural": 0.93,
        "math-prm": 0.9798657718120806,
        "mt-bench-easy": 1.0,
        "mt-bench-hard": 0.8648648648648649,
        "mt-bench-med": 1.0,
        "refusals-dangerous": 0.98,
        "refusals-offensive": 0.98,
        "xstest-should-refuse": 0.9675324675324676,
        "xstest-should-respond": 0.952

Results from our implementation:

      "alpacaeval-easy": 0.99, 
      "alpacaeval-hard": 0.968421052631579, 
      "alpacaeval-length": 0.9789473684210527, 
      "donotanswer": 0.7941176470588235, 
      "hep-cpp": 0.975609756097561, 
      "hep-go": 0.9939024390243902, 
      "hep-java": 0.9939024390243902, 
      "hep-js": 0.9939024390243902, 
      "hep-python": 0.9939024390243902, 
      "hep-rust": 0.9695121951219512, 
      "llmbar-adver-GPTInst": 0.8586956521739131, 
      "llmbar-adver-GPTOut": 0.8085106382978723, 
      "llmbar-adver-manual": 0.782608695652174, 
      "llmbar-adver-neighbor": 0.7910447761194029, 
      "llmbar-natural": 0.94, 
      "math-prm": 0.9798657718120806, 
      "mt-bench-easy": 1.0, 
      "mt-bench-hard": 0.8648648648648649, 
      "mt-bench-med": 1.0, 
      "refusals-dangerous": 0.98, 
      "refusals-offensive": 0.98, 
      "xstest-should-refuse": 0.9675324675324676, 
      "xstest-should-respond": 0.956

Training Results

DTensor Path, Colocated

Config:

  • policy.max_total_sequence_length=2048
  • cluster.gpus_per_node=8
  • data.dataset_name=OpenMathInstruct-2
  • env.reward_model.model_name=Skywork/Skywork-Reward-V2-Qwen3-0.6B
  • env.reward_model.precision="bfloat16"
  • env.reward_model.batch_size=32
  • env.reward_model.resources.gpus_per_node=4
  • env.reward_model.dtensor_cfg.tensor_parallel_size=4
  • policy.dtensor_cfg.tensor_parallel_size=2
  • policy.precision="bfloat16"
image

DTensor Path, Non-colocated

Config:

  • policy.max_total_sequence_length=2048
  • cluster.gpus_per_node=8
  • data.dataset_name=OpenMathInstruct-2
  • env.reward_model.model_name=Skywork/Skywork-Reward-V2-Qwen3-0.6B
  • env.reward_model.precision="bfloat16"
  • env.reward_model.batch_size=32
  • env.reward_model.resources.gpus_per_node=8
  • env.reward_model.dtensor_cfg.tensor_parallel_size=4
  • policy.dtensor_cfg.tensor_parallel_size=2
  • policy.precision="bfloat16"
  • policy.generation.colocated.enabled=False
  • policy.generation.colocated.resources.gpus_per_node=4
  • policy.generation.colocated.resources.num_nodes=1
image

Summary by CodeRabbit

  • New Features

    • Reward-model environment for GRPO: scoring, metrics, graceful shutdown, and resource-aware deployment (colocated/separated inference).
    • Distributed policy scoring API to run model scoring across workers (new Policy.score interface).
    • New HF-format math data processor and updated math example usage.
    • New GRPO reward-model example script and sample config for reward-model runs.
  • Documentation

    • New Environments guide; updated GRPO and Reward Model guides; docs navigation updated; minor formatting fix.
  • Tests

    • Functional E2E test for GRPO+reward-model; unit tests for Reward Model Environment and HF math processor.

@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Aug 30, 2025
@github-actions
Copy link
Copy Markdown

ℹ️ File Consistency Check

Check based on commit: a316042 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

Check based on commit: 81d92b6 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@RayenTian RayenTian added the CI:L0 Run doctests and unit tests label Sep 1, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

Check based on commit: 81d92b6 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

Check based on commit: eddebe3 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@yuki-97 yuki-97 added CI:docs Run doctest and removed CI:L0 Run doctests and unit tests labels Sep 1, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

Check based on commit: eddebe3 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

Check based on commit: c212d70 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@RayenTian RayenTian requested a review from yuki-97 September 1, 2025 06:48
@RayenTian RayenTian added the CI:L0 Run doctests and unit tests label Sep 1, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

Check based on commit: c212d70 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

Check based on commit: df87157 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@RayenTian RayenTian added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Sep 1, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 1, 2025

ℹ️ File Consistency Check

Check based on commit: df87157 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Signed-off-by: ruit <ruit@nvidia.com>
…worker itself set them.

Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
…esolve in load_checkpoint

Signed-off-by: ruit <ruit@nvidia.com>
…del strategy.

Signed-off-by: ruit <ruit@nvidia.com>
@github-actions
Copy link
Copy Markdown

ℹ️ File Consistency Check

Check based on commit: b0890d4 (PR #1026 from ruit/reward_model)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@afennelly-mitre
Copy link
Copy Markdown

@RayenTian how difficult would it be to extend the current reward model environment to use the megatron backend instead? Is there any reason that you chose to implement with the dtensor backend first?

For context, I'm working on a training setup where we need to use a reward model for the reward signal, but also need to use the megatron backend to ensure we can train up to 70B param models (as currently supported upper bound in Nemo RL).

Any insights on your end would be greatly appreciated - I'm currently reviewing your implementation and seeing how I can translate it to use the megatron backend instead.

@afennelly-mitre
Copy link
Copy Markdown

@RayenTian how difficult would it be to extend the current reward model environment to use the megatron backend instead? Is there any reason that you chose to implement with the dtensor backend first?

For context, I'm working on a training setup where we need to use a reward model for the reward signal, but also need to use the megatron backend to ensure we can train up to 70B param models (as currently supported upper bound in Nemo RL).

Any insights on your end would be greatly appreciated - I'm currently reviewing your implementation and seeing how I can translate it to use the megatron backend instead.

I know there is the Reward Model Training on Megatron backend issue that is currently open, but there was no response to my question on the Issue from 2 weeks ago. Just trying to gauge if this is an active development effort for the Nemo RL maintainers?

@RayenTian
Copy link
Copy Markdown
Contributor Author

@RayenTian how difficult would it be to extend the current reward model environment to use the megatron backend instead? Is there any reason that you chose to implement with the dtensor backend first?
For context, I'm working on a training setup where we need to use a reward model for the reward signal, but also need to use the megatron backend to ensure we can train up to 70B param models (as currently supported upper bound in Nemo RL).
Any insights on your end would be greatly appreciated - I'm currently reviewing your implementation and seeing how I can translate it to use the megatron backend instead.

I know there is the Reward Model Training on Megatron backend issue that is currently open, but there was no response to my question on the Issue from 2 weeks ago. Just trying to gauge if this is an active development effort for the Nemo RL maintainers?

Hi, @afennelly-mitre. Thanks a lot for your support and interest in NemoRL!

Currently, using the mcore backend for training the reward model (RM) is not ready yet, which is why we implemented the RM with the dtensor backend first. As you may have seen in the discussion on issue #720.

For your setup, if you only need to use the RM as the environment, it is fully supported to train your policy model with the mcore backend while keeping the RM on dtensor. You can achieve this simply by modifying the backend field for the policy in your configuration file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests Documentation Improvements or additions to documentation r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reward Model based Environments

8 participants