fix: fix and re-enable rm env functional test#1905
Conversation
16a866a to
e322488
Compare
dc81561 to
55f35a4
Compare
📝 WalkthroughWalkthroughReplaces hard-coded reward-model environment checks with dynamic extraction-based detection in the GRPO algorithm, enabling flexible resource allocation based on configured environment names instead of direct string matching. Also adds a new test invocation for reward-model environment scenarios. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
55f35a4 to
d30095c
Compare
Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Aniket Singh Yadav <singhyadavaniket43@gmail.com>
PR: Enhance GRPO setup to support reward model environment configuration
Summary
This PR fixes how GRPO determines whether the reward model environment is used, so that cluster resource setup (GPU/node allocation for policy, inference, and reward model) is correct when train or validation data uses the
reward_modelenv.Previously, reward-model usage was inferred inside
grpo.setup()fromdata_config["env_name"] == "reward_model", which only reflects the default env and can miss cases where the reward model env is used only in validation or in a non-default task. This change moves the detection to the entrypoint using the actual env names required by the data config, and passes a single flag intosetup().Motivation
data.default.env_name: "reward_model"or any train/validation task that uses the reward model env, the cluster must reserve GPUs/nodes for the reward model. Relying only ondata_config["env_name"]can be wrong when:extract_necessary_env_names(config["data"]). Using that for “is reward model env needed?” avoids duplicating logic and keeps behavior aligned with what the data pipeline actually uses.Changes
1.
examples/run_grpo.pyextract_necessary_env_namesfromnemo_rl.data.datasets.setup_data_with_envs(), compute:env_name_list = extract_necessary_env_names(config["data"])rm_env_enabled = "reward_model" in env_name_listrm_env_enabled=rm_env_enabledintosetup().2.
nemo_rl/algorithms/grpo.pysetup():rm_env_enabled: bool = False.reward_model_enabledfromdata_config["env_name"].rm_env_enabledeverywhere the previousreward_model_enabledwas used for:total_nodes == 1and when assertingtrain_gpus_per_node > 0.Behavior of cluster setup (colocated vs non-colocated, single vs multi-node, reward model reservation) is unchanged except that the “reward model env in use” flag is now derived from the data config’s required env names instead of the default env name only.
3.
tests/functional/L1_Functional_Tests_GPU.shtime uv run --no-sync bash ./tests/functional/grpo_rm.shso that the reward model env path is exercised in L1 GPU tests.
Testing
bash tests/functional/grpo_rm_env.sh(or
grpo_rm.shas wired in L1).Summary by CodeRabbit
Improvements
Tests