Skip to content

refactor: reuse setup data#1808

Merged
terrykong merged 3 commits intomainfrom
yukih/reuse-setup-data
Jan 23, 2026
Merged

refactor: reuse setup data#1808
terrykong merged 3 commits intomainfrom
yukih/reuse-setup-data

Conversation

@yuki-97
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 commented Jan 22, 2026

We use a same setup_data function in different run_xxx.py, this PR move it to nemo_rl/data/utils.py and reuse it.

Summary by CodeRabbit

Release Notes

  • Refactor
    • Consolidated data setup logic across example training scripts into centralized utilities, reducing code duplication and improving maintainability.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 requested review from a team as code owners January 22, 2026 07:46
@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jan 22, 2026
@yuki-97 yuki-97 changed the title feat: reuse setup data refactor: reuse setup data Jan 22, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

Consolidates duplicated data setup logic from five example scripts (run_distillation_math.py, run_grpo.py, run_grpo_math.py, run_grpo_rm.py, run_vlm_grpo.py) into a centralized setup_data_with_envs utility function in nemo_rl/data/utils.py. Each example script is updated to import and call the new centralized function instead of maintaining local implementations.

Changes

Cohort / File(s) Summary
Example scripts refactoring
examples/run_distillation_math.py, examples/run_grpo.py, examples/run_grpo_math.py, examples/run_grpo_rm.py, examples/run_vlm_grpo.py
Removed local setup_data function implementations (~100 lines each) and associated dataset/environment utility imports. Updated main() to call setup_data_with_envs(tokenizer, config["data"], config["env"]) imported from nemo_rl.data.utils.
New centralized utility
nemo_rl/data/utils.py
Added setup_data_with_envs() function that orchestrates environment creation and dataset preparation. Validates train split presence, builds environments, loads/processes datasets, optionally merges validation data, and returns training/validation datasets with task-to-environment mappings.
Minor documentation update
examples/run_sft.py
Added single-line TODO comment above setup_data function noting future refactor to move implementation to nemo_rl/data/utils.py.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested labels

CI:L1

Suggested reviewers

  • terrykong
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'refactor: reuse setup data' accurately summarizes the main change: consolidating duplicated setup_data functions across multiple example files into a single reusable utility function in nemo_rl/data/utils.py.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR consolidates duplicate setup_data functions from multiple example files into a centralized utility, preserving existing behavior and logic without introducing new functionality or affecting numerics, convergence, or performance.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/run_vlm_grpo.py (1)

93-100: Update stale comment about a “local” helper.

Line 94 says the setup function is local, but it now comes from nemo_rl.data.utils.

✏️ Suggested rewording
-    # this function is local to this script, and can be extended to other VLM datasets
+    # shared helper for VLM dataset/env setup (extend in nemo_rl.data.utils if needed)
🤖 Fix all issues with AI agents
In `@nemo_rl/data/utils.py`:
- Line 1: Update the file header year from 2025 to 2026: modify the top-of-file
copyright comment in nemo_rl/data/utils.py (the file header line starting with
"# Copyright (c)") so it reads 2026 to match repo header requirements.
- Around line 31-41: Add a Google-style docstring for the public function
setup_data_with_envs describing its purpose, parameters, return values, and
exceptions: include a short summary line, an Args section listing tokenizer
(AutoProcessor|AutoTokenizer), data_config (DataConfig), env_configs (dict[str,
Any]), and is_vlm (bool) with brief types and meanings, a Returns section
documenting the tuple (AllTaskProcessedDataset,
Optional[AllTaskProcessedDataset], dict[str, EnvironmentInterface], dict[str,
EnvironmentInterface]) and what each element represents, and an optional Raises
section for any exceptions that can be thrown; place this docstring immediately
under the def setup_data_with_envs(...) signature to satisfy the public API doc
guidelines.

Comment thread nemo_rl/data/utils.py
Comment thread nemo_rl/data/utils.py
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 22, 2026
@terrykong terrykong merged commit 5d7bc27 into main Jan 23, 2026
41 of 42 checks passed
@terrykong terrykong deleted the yukih/reuse-setup-data branch January 23, 2026 06:06
yfw pushed a commit that referenced this pull request Feb 9, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
xavier-owkin pushed a commit to owkin/Owkin-NeMo-RL that referenced this pull request Feb 10, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 12, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants