Skip to content

refactor: refactor dataset module#977

Merged
chtruong814 merged 34 commits intomainfrom
yukih/refactor-dataset
Sep 18, 2025
Merged

refactor: refactor dataset module#977
chtruong814 merged 34 commits intomainfrom
yukih/refactor-dataset

Conversation

@yuki-97
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 commented Aug 25, 2025

What does this PR do ?

  1. Refactor and add load_response_dataset, load_preference_dataset APIs.
  2. Support load dataset from local or Hugging Face:
    1. ResponseDataset for SFT and RL.
    2. BinaryPreferenceDataset and PreferenceDataset for DPO and RM.
  3. Refactor structure to:
    ├── __init__.py
    ├── processed_dataset.py
    ├── utils.py
    ├── eval_datasets
    │   ├── __init__.py
    │   ├── local_math_dataset.py
    │   └── xxx_dataset.py
    ├── preference_datasets
    │   ├── __init__.py
    │   ├── binary_preference_dataset.py
    │   ├── preference_dataset.py
    │   └── xxx_dataset.py
    └── response_datasets
        ├── __init__.py
        ├── response_dataset.py
        └── xxx_dataset.py
    
  4. Remove data_cls which is duplicated with dataset_name. The logic will become below after this PR:
    1. If dataset_name is a built-in dataset, we'll use that dataset class as before.
    2. If dataset_name is in ["ResponseDataset", "BinaryPreferenceDataset", "PreferenceDataset"], we'll load data from local or Hugging Face through train_data_path and val_data_path.

Issues

Related #909.

Usage

  1. For datasets supported by default (e.g. squad, open_assistant, HelpSteer3, etc.), the usage is the same as before:
    data:
      dataset_name: "squad"
      ...
  2. For prompt-response data for SFT and RL, we'll use ResponseDataset:
    data:
      dataset_name: ResponseDataset
      train_data_path: <PathToTrainingDataset>
      val_data_path: <PathToValidationDataset>
      input_key: <QuestionKey>, default is "input"
      output_key: <AnswerKey>, default is "output"
      train_split: <TrainSplit>, used for HuggingFace datasets, default is None
      val_split: <ValSplit>, used for HuggingFace datasets, default is None
      ...
  3. For preference data for DPO and RM, we'll use PreferenceDataset:
    data:
      dataset_name: PreferenceDataset
      train_data_path: <PathToTrainingDataset>
      val_data_paths:
        <NameOfValidationDataset1>: <PathToValidationDataset1>
        ...
      train_split: <TrainSplit>, used for HuggingFace datasets, default is None
      val_split: <ValSplit>, used for HuggingFace datasets, default is None
      ...
  4. For binary preference data for DPO and RM, we'll use BinaryPreferenceDataset:
    data:
      dataset_name: BinaryPreferenceDataset
      train_data_path: <PathToTrainingDataset>
      val_data_path: <PathToValidationDataset>
      prompt_key: <PromptKey>, default is "prompt"
      chosen_key: <ChosenKey>, default is "chosen"
      rejected_key: <RejectedKey>, default is "rejected"
      train_split: <TrainSplit>, used for HuggingFace datasets, default is None
      val_split: <ValSplit>, used for HuggingFace datasets, default is None
      ...

Test Result

Run well on below configs and some other configs.

"examples/configs/recipes/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.yaml"
"examples/configs/recipes/llm/dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2.yaml"
"examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml"
"examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml"
convergence test with llama8b image
sft-llama3.2-1b-1n8g-fsdp2tp1.v3 image
dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2 image
grpo-deepscaler-1.5b-8K image
vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1 image

Summary by CodeRabbit

  • New Features

    • Unified dataset loaders (response, preference, eval); new ResponseDataset and BinaryPreferenceDataset; HelpSteer3 and Tulu3Preference added as built-in defaults with on‑the‑fly HuggingFace downloads and multi‑validation support.
  • Documentation

    • Expanded guides and example configs with concrete dataset blocks, keys (paths/input/output/chosen/rejected) and split options; remote CSV eval guidance.
  • Refactor

    • Reorganized data package, centralized dataset utilities/exports, consolidated AIME variants into a single AIMEDataset; moved dataset helpers to a shared utils module.
  • Tests

    • Added loader tests; removed/skipped flaky remote-download tests.

@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Aug 25, 2025
@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Aug 25, 2025
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 25, 2025
@yuki-97 yuki-97 force-pushed the yukih/refactor-dataset branch from c8aad08 to a1ef1d2 Compare August 26, 2025 03:00
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 26, 2025
@yuki-97 yuki-97 marked this pull request as ready for review August 26, 2025 09:55
parthchadha
parthchadha previously approved these changes Aug 26, 2025
Copy link
Copy Markdown
Contributor

@parthchadha parthchadha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @terrykong can you review as well

@yuki-97 yuki-97 force-pushed the yukih/refactor-dataset branch from 7ea9787 to eea8706 Compare September 2, 2025 13:14
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 2, 2025
Comment thread docs/guides/dpo.md Outdated
Comment thread docs/guides/dpo.md
Comment thread docs/guides/eval.md Outdated
Comment thread docs/guides/dpo.md
Comment thread docs/guides/rm.md Outdated
Comment thread examples/run_dpo.py
Comment thread examples/run_rm.py
Comment thread nemo_rl/data/datasets/preference_datasets/__init__.py Outdated
Comment thread nemo_rl/data/datasets/preference_datasets/preference_dataset.py Outdated
Comment thread nemo_rl/data/datasets/preference_datasets/preference_dataset.py
@yuki-97 yuki-97 force-pushed the yukih/refactor-dataset branch 2 times, most recently from 2a13777 to a495eaa Compare September 3, 2025 07:11
auto-merge was automatically disabled September 17, 2025 16:45

Pull Request is not mergeable

@chtruong814 chtruong814 removed this pull request from the merge queue due to the queue being cleared Sep 18, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Sep 18, 2025
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Sep 18, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Sep 18, 2025
@chtruong814 chtruong814 removed this pull request from the merge queue due to the queue being cleared Sep 18, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Sep 18, 2025
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Sep 18, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Sep 18, 2025
@chtruong814 chtruong814 removed this pull request from the merge queue due to a manual request Sep 18, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Sep 18, 2025
@chtruong814 chtruong814 removed this pull request from the merge queue due to the queue being cleared Sep 18, 2025
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 18, 2025
@terrykong terrykong enabled auto-merge (squash) September 18, 2025 18:14
This was referenced Feb 4, 2026
@coderabbitai coderabbitai Bot mentioned this pull request Feb 18, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests Documentation Improvements or additions to documentation r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants