Skip to content

fix bug using FSDP V1 will lead to model device not properly set#39177

Merged
ArthurZucker merged 5 commits intohuggingface:mainfrom
kaixuanliu:fsdp-bugfix
Jul 7, 2025
Merged

fix bug using FSDP V1 will lead to model device not properly set#39177
ArthurZucker merged 5 commits intohuggingface:mainfrom
kaixuanliu:fsdp-bugfix

Conversation

@kaixuanliu
Copy link
Copy Markdown
Contributor

In this PR: 36132, when we use FSDP, it will not use accelerator to prepare model, which will lead to model weight not loaded to right gpu device. One example to reproduce the bug, in peft library's sft example, when we run cmd like
accelerate launch --config_file "fsdp_config.yaml" train.py --seed 100 --model_name_or_path "meta-llama/Llama-2-7b-chat-hf" --dataset_name "smangrul/ultrachat-10k-chatml" --chat_template_format "chatml" --add_special_tokens False --append_concat_token False --splits "train,test" --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level "info" --logging_strategy "steps" --eval_strategy "epoch" --save_strategy "epoch" --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type "cosine" --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir "llama-sft-lora-fsdp" --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field "content" --use_flash_attn False --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "q_proj,k_proj,v_proj,o_proj,up_proj,gate_proj" --use_4bit_quantization False, it will crash and returns error

[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py", line 943, in wrapper
[rank1]:     output = func(self, *args, **kwargs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/models/llama/modeling_llama.py", line 408, in fo
rward
[rank1]:     inputs_embeds = self.embed_tokens(input_ids)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/sparse.py", line 192, in forward
[rank1]:     return F.embedding(
[rank1]:            ^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py", line 2546, in embedding
[rank1]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.

while using transformers 4.52.4 will not run into this issue.

@kaixuanliu
Copy link
Copy Markdown
Contributor Author

@SunMarc @ArthurZucker pls help review, thx!

@kaixuanliu
Copy link
Copy Markdown
Contributor Author

As @kmehant mentioned in #39152, this manner should be better to not disrupt the design in #36132, @SunMarc @ArthurZucker WDYT?

Comment on lines -2359 to +2360
if delay_optimizer_creation:
model = self.accelerator.prepare(self.model)
if self.is_tp_enabled:
self.optimizer = self.accelerator.prepare(self.optimizer)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks ! can you add a comment explaining why only the optimizer is prepared here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to this: #36132

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean as a comment in the code 😅 so that no one would change it mistakenly later

Copy link
Copy Markdown
Contributor

@kmehant kmehant Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents. We can add this comment. Thanks

We should avoid accelerate preparing the model in TP case since we dont need it as it is handled by transformers from_pretrained and also it goes into DDP based preparation.

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Please add a comment to it as suggested by @IlyasMoutawwakil ! Also could you udpate delay_optimizer_creation to remove is_tp_enabled ?

@kmehant
Copy link
Copy Markdown
Contributor

kmehant commented Jul 4, 2025

#39177 (review)

+1

@kaixuanliu
Copy link
Copy Markdown
Contributor Author

Thx for advice from @IlyasMoutawwakil and @kmehant , have updated the code, @SunMarc pls help review again.

Copy link
Copy Markdown
Contributor

@kmehant kmehant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected on my end for TP trainings. Thanks

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this particular piece of code seems to be the source of many issue 😅

@ArthurZucker ArthurZucker merged commit 4243bb8 into huggingface:main Jul 7, 2025
23 checks passed
Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@pengzhenghao-nv
Copy link
Copy Markdown

Thanks! I spent whole day debugging why the FSDP root model is never called and it turns out that this PR solves the bug...................

wonder why this hasn't been noticed before

SwiftAkira pushed a commit to SwiftAkira/transformers that referenced this pull request Jul 11, 2025
…gingface#39177)

* fix bug using FSDP V1 will lead to model device not properly set

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update the code

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
rjgleaton pushed a commit to rjgleaton/transformers that referenced this pull request Jul 17, 2025
…gingface#39177)

* fix bug using FSDP V1 will lead to model device not properly set

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update the code

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…gingface#39177)

* fix bug using FSDP V1 will lead to model device not properly set

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update the code

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…gingface#39177)

* fix bug using FSDP V1 will lead to model device not properly set

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update the code

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…gingface#39177)

* fix bug using FSDP V1 will lead to model device not properly set

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update the code

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…gingface#39177)

* fix bug using FSDP V1 will lead to model device not properly set

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update the code

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…gingface#39177)

* fix bug using FSDP V1 will lead to model device not properly set

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update the code

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…gingface#39177)

* fix bug using FSDP V1 will lead to model device not properly set

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update the code

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…gingface#39177)

* fix bug using FSDP V1 will lead to model device not properly set

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update the code

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
@kaixuanliu kaixuanliu deleted the fsdp-bugfix branch April 13, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants