Skip to content

Fix FSDP v1 bug: trainer incorrectly uses an unwrapped model#39617

Open
YanjunChen329 wants to merge 1 commit intohuggingface:mainfrom
YanjunChen329:patch-1
Open

Fix FSDP v1 bug: trainer incorrectly uses an unwrapped model#39617
YanjunChen329 wants to merge 1 commit intohuggingface:mainfrom
YanjunChen329:patch-1

Conversation

@YanjunChen329
Copy link
Copy Markdown

@YanjunChen329 YanjunChen329 commented Jul 23, 2025

What does this PR do?

Fixes # (issue) #39619

Currently, HG Trainer does not have the backward compatibility to support FSDP v1.

The bug happens at line 2378, which is a code path where both delay_optimizer_creation and use_accelerator_prepare are true. This seems to only be the case when FSDP v1 is used.

At line 2378, we set self.model to be the FSDPv1-wrapped model. However, at line 2403, we set self.model again to be model, which is the unwrapped model instance initialized at line 2361.

This bug makes the trainer use the unwrapped model in the following forward calls, which causes weight shape mismatch and uninitialization errors because all-gather calls in FSDP are not properly triggered.

This error can be fixed by replacing self.model with model, which makes it consistent with the other code paths

Example error:
[rank0]: RuntimeError: 'weight' must be 2-D

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@zach-huggingface, @SunMarc and @qgallouedec

Currently, HG Trainer does not have the backward compatibility to support FSDP v1.

The bug happens at line 2378, which is a code path where both `delay_optimizer_creation` and `use_accelerator_prepare` are true. This seems to only be the case when FSDP v1 is used.

At line 2378, we set `self.model` to be the FSDPv1-wrapped model. However, at line 2403, we set `self.model` again to be `model`, which is the unwrapped model instance initialized at line 2361.

This bug makes the trainer use the unwrapped model in the following forward calls, which causes weight shape mismatch and uninitialization errors because all-gather calls in FSDP are not properly triggered.

This error can be fixed by replacing `self.model` with `model`, which makes it consistent with the other code paths


Example error:
[rank0]: RuntimeError: 'weight' must be 2-D
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant