refactor: refactor dataset module by yuki-97 · Pull Request #977 · NVIDIA-NeMo/RL

yuki-97 · 2025-08-25T15:49:30Z

What does this PR do ?

Refactor and add load_response_dataset, load_preference_dataset APIs.
Support load dataset from local or Hugging Face:
1. ResponseDataset for SFT and RL.
2. BinaryPreferenceDataset and PreferenceDataset for DPO and RM.

Refactor structure to:

├── __init__.py
├── processed_dataset.py
├── utils.py
├── eval_datasets
│   ├── __init__.py
│   ├── local_math_dataset.py
│   └── xxx_dataset.py
├── preference_datasets
│   ├── __init__.py
│   ├── binary_preference_dataset.py
│   ├── preference_dataset.py
│   └── xxx_dataset.py
└── response_datasets
    ├── __init__.py
    ├── response_dataset.py
    └── xxx_dataset.py

Remove data_cls which is duplicated with dataset_name. The logic will become below after this PR:
1. If dataset_name is a built-in dataset, we'll use that dataset class as before.
2. If dataset_name is in ["ResponseDataset", "BinaryPreferenceDataset", "PreferenceDataset"], we'll load data from local or Hugging Face through train_data_path and val_data_path.

Issues

Related #909.

Usage

For datasets supported by default (e.g. squad, open_assistant, HelpSteer3, etc.), the usage is the same as before:
```
data:
  dataset_name: "squad"
  ...
```

For prompt-response data for SFT and RL, we'll use ResponseDataset:

data:
  dataset_name: ResponseDataset
  train_data_path: <PathToTrainingDataset>
  val_data_path: <PathToValidationDataset>
  input_key: <QuestionKey>, default is "input"
  output_key: <AnswerKey>, default is "output"
  train_split: <TrainSplit>, used for HuggingFace datasets, default is None
  val_split: <ValSplit>, used for HuggingFace datasets, default is None
  ...

For preference data for DPO and RM, we'll use PreferenceDataset:

data:
  dataset_name: PreferenceDataset
  train_data_path: <PathToTrainingDataset>
  val_data_paths:
    <NameOfValidationDataset1>: <PathToValidationDataset1>
    ...
  train_split: <TrainSplit>, used for HuggingFace datasets, default is None
  val_split: <ValSplit>, used for HuggingFace datasets, default is None
  ...

For binary preference data for DPO and RM, we'll use BinaryPreferenceDataset:

data:
  dataset_name: BinaryPreferenceDataset
  train_data_path: <PathToTrainingDataset>
  val_data_path: <PathToValidationDataset>
  prompt_key: <PromptKey>, default is "prompt"
  chosen_key: <ChosenKey>, default is "chosen"
  rejected_key: <RejectedKey>, default is "rejected"
  train_split: <TrainSplit>, used for HuggingFace datasets, default is None
  val_split: <ValSplit>, used for HuggingFace datasets, default is None
  ...

Test Result

Run well on below configs and some other configs.

"examples/configs/recipes/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.yaml"
"examples/configs/recipes/llm/dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2.yaml"
"examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml"
"examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml"

convergence test with llama8b

sft-llama3.2-1b-1n8g-fsdp2tp1.v3

dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2

grpo-deepscaler-1.5b-8K

vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1

Summary by CodeRabbit

New Features
- Unified dataset loaders (response, preference, eval); new ResponseDataset and BinaryPreferenceDataset; HelpSteer3 and Tulu3Preference added as built-in defaults with on‑the‑fly HuggingFace downloads and multi‑validation support.
Documentation
- Expanded guides and example configs with concrete dataset blocks, keys (paths/input/output/chosen/rejected) and split options; remote CSV eval guidance.
Refactor
- Reorganized data package, centralized dataset utilities/exports, consolidated AIME variants into a single AIMEDataset; moved dataset helpers to a shared utils module.
Tests
- Added loader tests; removed/skipped flaky remote-download tests.

parthchadha

LGTM! @terrykong can you review as well

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Aug 25, 2025

yuki-97 temporarily deployed to nemo-ci August 25, 2025 15:49 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci August 25, 2025 15:54 — with GitHub Actions Inactive

github-actions Bot added the Documentation Improvements or additions to documentation label Aug 25, 2025

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 25, 2025

yuki-97 temporarily deployed to nemo-ci August 25, 2025 16:21 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci August 25, 2025 16:26 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/refactor-dataset branch from c8aad08 to a1ef1d2 Compare August 26, 2025 03:00

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 26, 2025

yuki-97 temporarily deployed to nemo-ci August 26, 2025 03:01 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci August 26, 2025 03:49 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci August 26, 2025 06:00 — with GitHub Actions Inactive

yuki-97 marked this pull request as ready for review August 26, 2025 09:55

yuki-97 requested review from ashors1, parthchadha and terrykong August 26, 2025 09:55

parthchadha previously approved these changes Aug 26, 2025

View reviewed changes

yuki-97 dismissed parthchadha’s stale review via eea8706 September 2, 2025 13:14

yuki-97 force-pushed the yukih/refactor-dataset branch from 7ea9787 to eea8706 Compare September 2, 2025 13:14

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 2, 2025

yuki-97 temporarily deployed to nemo-ci September 2, 2025 13:15 — with GitHub Actions Inactive

yuki-97 requested a review from jveronvialard September 2, 2025 13:15

yuki-97 temporarily deployed to nemo-ci September 2, 2025 13:23 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci September 2, 2025 15:17 — with GitHub Actions Inactive

jveronvialard requested changes Sep 2, 2025

View reviewed changes

yuki-97 force-pushed the yukih/refactor-dataset branch 2 times, most recently from 2a13777 to a495eaa Compare September 3, 2025 07:11

terrykong added the r0.4.0 label Sep 17, 2025

auto-merge was automatically disabled September 17, 2025 16:45
Pull Request is not mergeable

chtruong814 removed this pull request from the merge queue due to the queue being cleared Sep 18, 2025