Fix FSDP v1 bug: trainer incorrectly uses an unwrapped model#39617
Open
YanjunChen329 wants to merge 1 commit intohuggingface:mainfrom
Open
Fix FSDP v1 bug: trainer incorrectly uses an unwrapped model#39617YanjunChen329 wants to merge 1 commit intohuggingface:mainfrom
YanjunChen329 wants to merge 1 commit intohuggingface:mainfrom
Conversation
Currently, HG Trainer does not have the backward compatibility to support FSDP v1. The bug happens at line 2378, which is a code path where both `delay_optimizer_creation` and `use_accelerator_prepare` are true. This seems to only be the case when FSDP v1 is used. At line 2378, we set `self.model` to be the FSDPv1-wrapped model. However, at line 2403, we set `self.model` again to be `model`, which is the unwrapped model instance initialized at line 2361. This bug makes the trainer use the unwrapped model in the following forward calls, which causes weight shape mismatch and uninitialization errors because all-gather calls in FSDP are not properly triggered. This error can be fixed by replacing `self.model` with `model`, which makes it consistent with the other code paths Example error: [rank0]: RuntimeError: 'weight' must be 2-D
4 tasks
This was referenced Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes # (issue) #39619
Currently, HG Trainer does not have the backward compatibility to support FSDP v1.
The bug happens at line 2378, which is a code path where both
delay_optimizer_creationanduse_accelerator_prepareare true. This seems to only be the case when FSDP v1 is used.At line 2378, we set
self.modelto be the FSDPv1-wrapped model. However, at line 2403, we setself.modelagain to bemodel, which is the unwrapped model instance initialized at line 2361.This bug makes the trainer use the unwrapped model in the following forward calls, which causes weight shape mismatch and uninitialization errors because all-gather calls in FSDP are not properly triggered.
This error can be fixed by replacing
self.modelwithmodel, which makes it consistent with the other code pathsExample error:
[rank0]: RuntimeError: 'weight' must be 2-D
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@zach-huggingface, @SunMarc and @qgallouedec