Skip to content

Fix FSDP2 strategy related bugs#15

Merged
akoumpa merged 5 commits intomainfrom
boxiangw/fsdp2-change
May 29, 2025
Merged

Fix FSDP2 strategy related bugs#15
akoumpa merged 5 commits intomainfrom
boxiangw/fsdp2-change

Conversation

@BoxiangW
Copy link
Copy Markdown
Contributor

No description provided.

@BoxiangW BoxiangW requested review from akoumpa and Copilot May 29, 2025 04:25
@BoxiangW BoxiangW self-assigned this May 29, 2025
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes known FSDP2 strategy issues by updating distributed configuration settings, refining checkpoint scheduling, and updating dataset loading parameters. Key changes include:

  • Adding new distributed configuration options (dp_size, tp_size, cp_size, sequence_parallel) and dataset parameter (num_samples_limit) in the YAML recipe.
  • Updating checkpoint scheduling logic to skip checkpointing on step 0.
  • Refactoring FSDP2 manager imports and type annotations, and adding a trust_remote_code parameter to the HellaSwag dataset.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
recipes/llama_3_2_1b_hellaswag.yaml Updated distributed and dataset configurations to support FSDP2 fixes.
automodel/training/step_scheduler.py Modified checkpoint condition to avoid triggering on step 0.
automodel/training/base_recipe.py Added directory creation for checkpoint saving.
automodel/distributed/parallelizer.py Removed safe_import_from wrappers in favor of direct imports.
automodel/distributed/fsdp2.py Updated type annotations and default tensor parallel sharding plan.
automodel/datasets/utils.py Minor punctuation fixes in comments.
automodel/datasets/llm/hellaswag.py Updated dataset initialization to include trust_remote_code.

BoxiangW and others added 5 commits May 29, 2025 05:00
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the boxiangw/fsdp2-change branch from 07e47d1 to a43ebde Compare May 29, 2025 12:00
Copy link
Copy Markdown
Contributor

@akoumpa akoumpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @BoxiangW ,

I fixed one minor issue with the linter and DCO'd.

One thing we may want to do in the future is have a function provide the default TP-plan/s, but I think for now this is mergable and we should proceed -- a future note for us to add in upcoming PRs.

Thank you.

@akoumpa akoumpa merged commit d283fe3 into main May 29, 2025
3 checks passed
@ko3n1g ko3n1g deleted the boxiangw/fsdp2-change branch June 16, 2025 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants