🐛 Describe the bug
When i run this script:
torchrun --standalone --nproc_per_node=1 train_sft.py \
--pretrain "/root/zl/download/pretrained/llama_7b_hf/" \
--model 'llama' \
--strategy colossalai_zero2 \
--log_interval 10 \
--save_path trained_models/Coati-7B \
--dataset /root/zl/code/InstructionWild/data/instinwild_en.json \
--batch_size 4 \
--accimulation_steps 1 \
--lr 2e-5 \
--max_epochs 3 \
--lora_rank 8 \
--max_datasets_size 2
I got an error during backward:
Traceback (most recent call last):
File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 184, in <module>
train(args)
File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 155, in train
trainer.fit(logger=logger, log_interval=args.log_interval)
File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/sft.py", line 110, in fit
self.strategy.optimizer_step(self.optimizer)
File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/strategies/colossalai.py", line 154, in optimizer_step
optimizer.step()
File "/opt/conda/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 467, in step
assert param_shape == flat_fp32_avg_grads.shape, \
AssertionError: fp32 param and grad have different shape torch.Size([20277248]) vs torch.Size([288768])
However, when i change the --strategy into ddp, it trains normally.
So is there a bug in implementing colossalai_zero2 , or it is incompatible with LoRA?
Environment
I use the docker image hpcaitech/colossalai:0.2.7
🐛 Describe the bug
When i run this script:
I got an error during backward:
However, when i change the
--strategyintoddp, it trains normally.So is there a bug in implementing
colossalai_zero2, or it is incompatible with LoRA?Environment
I use the docker image
hpcaitech/colossalai:0.2.7