Skip to content

fix train deepseek V4 with fsdp2: AttributeError: 'Tensor' object has no attribute 'device_mesh'#4023

Open
frozenleaves wants to merge 1 commit intohuggingface:mainfrom
frozenleaves:main
Open

fix train deepseek V4 with fsdp2: AttributeError: 'Tensor' object has no attribute 'device_mesh'#4023
frozenleaves wants to merge 1 commit intohuggingface:mainfrom
frozenleaves:main

Conversation

@frozenleaves
Copy link
Copy Markdown

What does this PR do?

fix the bug about training deepseek v4 with fsdp2:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/frozen/LlamaFactory/src/train.py", line 28, in <module>
[rank0]:     main()
[rank0]:   File "/home/frozen/LlamaFactory/src/train.py", line 19, in main
[rank0]:     run_exp()
[rank0]:   File "/home/frozen/LlamaFactory/src/llamafactory/train/tuner.py", line 139, in run_exp
[rank0]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank0]:   File "/home/frozen/LlamaFactory/src/llamafactory/train/tuner.py", line 107, in _training_function
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/home/frozen/LlamaFactory/src/llamafactory/train/sft/workflow.py", line 140, in run_sft
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/frozen/transformers/src/transformers/trainer.py", line 1427, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/frozen/transformers/src/transformers/trainer.py", line 1466, in _inner_training_loop
[rank0]:     model, train_dataloader = self._prepare_for_training(max_steps, train_dataloader, resume_from_checkpoint)
[rank0]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/frozen/transformers/src/transformers/trainer.py", line 1602, in _prepare_for_training
[rank0]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/frozen/accelerate/src/accelerate/accelerator.py", line 1553, in prepare
[rank0]:     result = self._prepare_fsdp2(*args)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/frozen/accelerate/src/accelerate/accelerator.py", line 1727, in _prepare_fsdp2
[rank0]:     model = fsdp2_prepare_model(self, model)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/frozen/accelerate/src/accelerate/utils/fsdp_utils.py", line 782, in fsdp2_prepare_model
[rank0]:     fsdp2_load_full_state_dict(
[rank0]:   File "/home/frozen/accelerate/src/accelerate/utils/fsdp_utils.py", line 521, in fsdp2_load_full_state_dict
[rank0]:     device_mesh = sharded_param.device_mesh
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'Tensor' object has no attribute 'device_mesh'

In fsdp2_load_full_state_dict function. It iterates over model.state_dict() and assumes that every item is a DTensor, directly accessing the .device_mesh attribute.

However, when using FSDP2 (via fully_shard), only model parameters are converted to DTensors, while persistent buffers remain standard torch.Tensors.

In the DeepSeek-V4 model, the MoE router registers persistent buffers (specifically bias and tid2eid). When fsdp2_load_full_state_dict is called and iterates over these buffers, it triggers an AttributeError: 'Tensor' object has no attribute 'device_mesh' because they lack the DTensor-specific attributes, thus interrupting the model loading process.

This PR modifies the fsdp2_load_full_state_dict function in accelerate/utils/fsdp_utils.py.

In both the chief (primary) and non-chief process branches, an explicit type check has been added for items in the state_dict:

  1. If an item is a DTensor, the original loading and attribute access logic is retained.
  2. If an item is not a DTensor (i.e., it is a regular Tensor like a persistent buffer), the logic now bypasses the DTensor-specific attribute accesses. Instead, it directly broadcasts the tensor and keeps it in its original standard Tensor state.

The test is based on this PR: huggingface/transformers#45643 . Both the main branch and the latest release version of accelerate can reproduce this issue.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant