Fix FSDP2 strategy related bugs by BoxiangW · Pull Request #15 · NVIDIA-NeMo/Automodel

BoxiangW · 2025-05-29T04:25:05Z

No description provided.

copy-pr-bot · 2025-05-29T04:25:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copilot

Pull Request Overview

This PR fixes known FSDP2 strategy issues by updating distributed configuration settings, refining checkpoint scheduling, and updating dataset loading parameters. Key changes include:

Adding new distributed configuration options (dp_size, tp_size, cp_size, sequence_parallel) and dataset parameter (num_samples_limit) in the YAML recipe.
Updating checkpoint scheduling logic to skip checkpointing on step 0.
Refactoring FSDP2 manager imports and type annotations, and adding a trust_remote_code parameter to the HellaSwag dataset.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
recipes/llama_3_2_1b_hellaswag.yaml	Updated distributed and dataset configurations to support FSDP2 fixes.
automodel/training/step_scheduler.py	Modified checkpoint condition to avoid triggering on step 0.
automodel/training/base_recipe.py	Added directory creation for checkpoint saving.
automodel/distributed/parallelizer.py	Removed safe_import_from wrappers in favor of direct imports.
automodel/distributed/fsdp2.py	Updated type annotations and default tensor parallel sharding plan.
automodel/datasets/utils.py	Minor punctuation fixes in comments.
automodel/datasets/llm/hellaswag.py	Updated dataset initialization to include trust_remote_code.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa

Thanks a lot @BoxiangW ,

I fixed one minor issue with the linter and DCO'd.

One thing we may want to do in the future is have a function provide the default TP-plan/s, but I think for now this is mergable and we should proceed -- a future note for us to add in upcoming PRs.

Thank you.

BoxiangW requested review from akoumpa and Copilot May 29, 2025 04:25

BoxiangW self-assigned this May 29, 2025

Copilot AI reviewed May 29, 2025

View reviewed changes

BoxiangW and others added 5 commits May 29, 2025 05:00

FSDP2 changes

8dee970

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

Change yaml file

3c05333

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

Fix bug for hellaswag

ed720a1

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

Fix fsdp2 bugs

03d6981

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

fix lint

a43ebde

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa force-pushed the boxiangw/fsdp2-change branch from 07e47d1 to a43ebde Compare May 29, 2025 12:00

akoumpa approved these changes May 29, 2025

View reviewed changes

akoumpa merged commit d283fe3 into main May 29, 2025
3 checks passed

ko3n1g deleted the boxiangw/fsdp2-change branch June 16, 2025 15:24

adil-a mentioned this pull request Apr 21, 2026

fix: nemotron_flash_1b_squad_peft checkpoint robustness #1953

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FSDP2 strategy related bugs#15

Fix FSDP2 strategy related bugs#15
akoumpa merged 5 commits intomainfrom
boxiangw/fsdp2-change

BoxiangW commented May 29, 2025

Uh oh!

copy-pr-bot Bot commented May 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

akoumpa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BoxiangW commented May 29, 2025

Uh oh!

copy-pr-bot Bot commented May 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

akoumpa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants