feat: Support Reward Model based Environments#1026
Conversation
ℹ️ File Consistency CheckCheck based on commit: a316042 (PR #1026 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
a316042 to
81d92b6
Compare
ℹ️ File Consistency CheckCheck based on commit: 81d92b6 (PR #1026 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: 81d92b6 (PR #1026 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: eddebe3 (PR #1026 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: eddebe3 (PR #1026 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: c212d70 (PR #1026 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: c212d70 (PR #1026 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: df87157 (PR #1026 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: df87157 (PR #1026 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
Signed-off-by: ruit <ruit@nvidia.com>
…worker itself set them. Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
…esolve in load_checkpoint Signed-off-by: ruit <ruit@nvidia.com>
…del strategy. Signed-off-by: ruit <ruit@nvidia.com>
ℹ️ File Consistency CheckCheck based on commit: b0890d4 (PR #1026 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
|
@RayenTian how difficult would it be to extend the current reward model environment to use the megatron backend instead? Is there any reason that you chose to implement with the dtensor backend first? For context, I'm working on a training setup where we need to use a reward model for the reward signal, but also need to use the megatron backend to ensure we can train up to 70B param models (as currently supported upper bound in Nemo RL). Any insights on your end would be greatly appreciated - I'm currently reviewing your implementation and seeing how I can translate it to use the megatron backend instead. |
I know there is the Reward Model Training on Megatron backend issue that is currently open, but there was no response to my question on the Issue from 2 weeks ago. Just trying to gauge if this is an active development effort for the Nemo RL maintainers? |
Hi, @afennelly-mitre. Thanks a lot for your support and interest in NemoRL! Currently, using the mcore backend for training the reward model (RM) is not ready yet, which is why we implemented the RM with the dtensor backend first. As you may have seen in the discussion on issue #720. For your setup, if you only need to use the RM as the environment, it is fully supported to train your policy model with the mcore backend while keeping the RM on dtensor. You can achieve this simply by modifying the backend field for the policy in your configuration file. |
What does this PR do ?
Support reward model environment.
Issues
List issues that this PR closes #670
Usage
Experience Result
Reward Environment Correctness
The correctness of the reward model ("Skywork/Skywork-Reward-V2-Qwen3-8B") was validated using reward-bench. The reward model was evaluated with the allenai/reward-bench dataset and results were compared between the reference implementation and ours. The comparison is shown below:
Results from reward-bench:
Results from our implementation:
Training Results
DTensor Path, Colocated
Config:
policy.max_total_sequence_length=2048cluster.gpus_per_node=8data.dataset_name=OpenMathInstruct-2env.reward_model.model_name=Skywork/Skywork-Reward-V2-Qwen3-0.6Benv.reward_model.precision="bfloat16"env.reward_model.batch_size=32env.reward_model.resources.gpus_per_node=4env.reward_model.dtensor_cfg.tensor_parallel_size=4policy.dtensor_cfg.tensor_parallel_size=2policy.precision="bfloat16"DTensor Path, Non-colocated
Config:
policy.max_total_sequence_length=2048cluster.gpus_per_node=8data.dataset_name=OpenMathInstruct-2env.reward_model.model_name=Skywork/Skywork-Reward-V2-Qwen3-0.6Benv.reward_model.precision="bfloat16"env.reward_model.batch_size=32env.reward_model.resources.gpus_per_node=8env.reward_model.dtensor_cfg.tensor_parallel_size=4policy.dtensor_cfg.tensor_parallel_size=2policy.precision="bfloat16"policy.generation.colocated.enabled=Falsepolicy.generation.colocated.resources.gpus_per_node=4policy.generation.colocated.resources.num_nodes=1Summary by CodeRabbit
New Features
Documentation
Tests