fix bug using FSDP V1 will lead to model device not properly set by kaixuanliu · Pull Request #39177 · huggingface/transformers

kaixuanliu · 2025-07-02T15:15:08Z

In this PR: 36132, when we use FSDP, it will not use accelerator to prepare model, which will lead to model weight not loaded to right gpu device. One example to reproduce the bug, in peft library's sft example, when we run cmd like
accelerate launch --config_file "fsdp_config.yaml" train.py --seed 100 --model_name_or_path "meta-llama/Llama-2-7b-chat-hf" --dataset_name "smangrul/ultrachat-10k-chatml" --chat_template_format "chatml" --add_special_tokens False --append_concat_token False --splits "train,test" --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level "info" --logging_strategy "steps" --eval_strategy "epoch" --save_strategy "epoch" --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type "cosine" --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir "llama-sft-lora-fsdp" --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field "content" --use_flash_attn False --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "q_proj,k_proj,v_proj,o_proj,up_proj,gate_proj" --use_4bit_quantization False, it will crash and returns error

[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py", line 943, in wrapper
[rank1]:     output = func(self, *args, **kwargs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/models/llama/modeling_llama.py", line 408, in fo
rward
[rank1]:     inputs_embeds = self.embed_tokens(input_ids)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/sparse.py", line 192, in forward
[rank1]:     return F.embedding(
[rank1]:            ^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py", line 2546, in embedding
[rank1]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.

while using transformers 4.52.4 will not run into this issue.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu · 2025-07-03T02:07:15Z

@SunMarc @ArthurZucker pls help review, thx!

kaixuanliu · 2025-07-04T01:36:27Z

As @kmehant mentioned in #39152, this manner should be better to not disrupt the design in #36132, @SunMarc @ArthurZucker WDYT?

IlyasMoutawwakil · 2025-07-04T08:41:29Z

-                    if delay_optimizer_creation:
-                        model = self.accelerator.prepare(self.model)
+                    if self.is_tp_enabled:
+                        self.optimizer = self.accelerator.prepare(self.optimizer)


thanks ! can you add a comment explaining why only the optimizer is prepared here.

You can refer to this: #36132

I mean as a comment in the code 😅 so that no one would change it mistakenly later

My two cents. We can add this comment. Thanks

We should avoid accelerate preparing the model in TP case since we dont need it as it is handled by transformers from_pretrained and also it goes into DDP based preparation.

SunMarc

Thanks ! Please add a comment to it as suggested by @IlyasMoutawwakil ! Also could you udpate delay_optimizer_creation to remove is_tp_enabled ?

kmehant · 2025-07-04T16:43:00Z

#39177 (review)

+1

kaixuanliu · 2025-07-07T03:06:07Z

Thx for advice from @IlyasMoutawwakil and @kmehant , have updated the code, @SunMarc pls help review again.

kmehant

Works as expected on my end for TP trainings. Thanks

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

ArthurZucker

Thanks, this particular piece of code seems to be the source of many issue 😅

SunMarc

Thanks !

pengzhenghao-nv · 2025-07-11T06:08:54Z

Thanks! I spent whole day debugging why the FSDP root model is never called and it turns out that this PR solves the bug...................

wonder why this hasn't been noticed before

…gingface#39177) * fix bug using FSDP V1 will lead to model device not properly set Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update the code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu added 2 commits July 2, 2025 20:15

fix bug using FSDP V1 will lead to model device not properly set

464c030

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'main' into fsdp-bugfix

23af4c5

This was referenced Jul 3, 2025

when delaying optimizer creation only prepare the model #39152

Merged

FSDP RuntimeError: 'weight' must be 2-D #39186

Closed

Merge branch 'main' into fsdp-bugfix

9006148

IlyasMoutawwakil reviewed Jul 4, 2025

View reviewed changes

SunMarc reviewed Jul 4, 2025

View reviewed changes

IlyasMoutawwakil approved these changes Jul 7, 2025

View reviewed changes

kmehant approved these changes Jul 7, 2025

View reviewed changes

kaixuanliu added 2 commits July 7, 2025 08:12

update the code

7dad1fd

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'main' into fsdp-bugfix

bb322f1

ArthurZucker approved these changes Jul 7, 2025

View reviewed changes

ArthurZucker merged commit 4243bb8 into huggingface:main Jul 7, 2025
23 checks passed

SunMarc reviewed Jul 7, 2025

View reviewed changes

kaixuanliu deleted the fsdp-bugfix branch April 13, 2026 02:41

Conversation

kaixuanliu commented Jul 2, 2025

Uh oh!

kaixuanliu commented Jul 3, 2025

Uh oh!

kaixuanliu commented Jul 4, 2025

Uh oh!

IlyasMoutawwakil Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

kaixuanliu Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

kmehant Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

kmehant commented Jul 4, 2025

Uh oh!

kaixuanliu commented Jul 7, 2025

Uh oh!

kmehant left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

pengzhenghao-nv commented Jul 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kmehant Jul 4, 2025 •

edited

Loading