fix bug using FSDP V1 will lead to model device not properly set#39177
fix bug using FSDP V1 will lead to model device not properly set#39177ArthurZucker merged 5 commits intohuggingface:mainfrom
Conversation
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
|
@SunMarc @ArthurZucker pls help review, thx! |
|
As @kmehant mentioned in #39152, this manner should be better to not disrupt the design in #36132, @SunMarc @ArthurZucker WDYT? |
| if delay_optimizer_creation: | ||
| model = self.accelerator.prepare(self.model) | ||
| if self.is_tp_enabled: | ||
| self.optimizer = self.accelerator.prepare(self.optimizer) |
There was a problem hiding this comment.
thanks ! can you add a comment explaining why only the optimizer is prepared here.
There was a problem hiding this comment.
I mean as a comment in the code 😅 so that no one would change it mistakenly later
There was a problem hiding this comment.
My two cents. We can add this comment. Thanks
We should avoid accelerate preparing the model in TP case since we dont need it as it is handled by transformers from_pretrained and also it goes into DDP based preparation.
SunMarc
left a comment
There was a problem hiding this comment.
Thanks ! Please add a comment to it as suggested by @IlyasMoutawwakil ! Also could you udpate delay_optimizer_creation to remove is_tp_enabled ?
|
Thx for advice from @IlyasMoutawwakil and @kmehant , have updated the code, @SunMarc pls help review again. |
kmehant
left a comment
There was a problem hiding this comment.
Works as expected on my end for TP trainings. Thanks
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks, this particular piece of code seems to be the source of many issue 😅
|
Thanks! I spent whole day debugging why the FSDP root model is never called and it turns out that this PR solves the bug................... wonder why this hasn't been noticed before |
…gingface#39177) * fix bug using FSDP V1 will lead to model device not properly set Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update the code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
…gingface#39177) * fix bug using FSDP V1 will lead to model device not properly set Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update the code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
…gingface#39177) * fix bug using FSDP V1 will lead to model device not properly set Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update the code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
…gingface#39177) * fix bug using FSDP V1 will lead to model device not properly set Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update the code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
…gingface#39177) * fix bug using FSDP V1 will lead to model device not properly set Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update the code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
…gingface#39177) * fix bug using FSDP V1 will lead to model device not properly set Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update the code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
…gingface#39177) * fix bug using FSDP V1 will lead to model device not properly set Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update the code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
…gingface#39177) * fix bug using FSDP V1 will lead to model device not properly set Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update the code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
…gingface#39177) * fix bug using FSDP V1 will lead to model device not properly set Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update the code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
In this PR: 36132, when we use FSDP, it will not use accelerator to prepare model, which will lead to model weight not loaded to right gpu device. One example to reproduce the bug, in peft library's sft example, when we run cmd like
accelerate launch --config_file "fsdp_config.yaml" train.py --seed 100 --model_name_or_path "meta-llama/Llama-2-7b-chat-hf" --dataset_name "smangrul/ultrachat-10k-chatml" --chat_template_format "chatml" --add_special_tokens False --append_concat_token False --splits "train,test" --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level "info" --logging_strategy "steps" --eval_strategy "epoch" --save_strategy "epoch" --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type "cosine" --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir "llama-sft-lora-fsdp" --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field "content" --use_flash_attn False --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "q_proj,k_proj,v_proj,o_proj,up_proj,gate_proj" --use_4bit_quantization False, it will crash and returns errorwhile using transformers 4.52.4 will not run into this issue.