Skip to content

[BUG]: Incompatible between colossalai_zero2 and LoRA tuning #3419

@zhangliang-04

Description

@zhangliang-04

🐛 Describe the bug

When i run this script:

torchrun --standalone --nproc_per_node=1 train_sft.py \
    --pretrain "/root/zl/download/pretrained/llama_7b_hf/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --log_interval 10 \
    --save_path trained_models/Coati-7B \
    --dataset /root/zl/code/InstructionWild/data/instinwild_en.json \
    --batch_size 4 \
    --accimulation_steps 1 \
    --lr 2e-5 \
    --max_epochs 3 \
    --lora_rank 8 \
    --max_datasets_size 2

I got an error during backward:

Traceback (most recent call last):
  File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 184, in <module>
    train(args)
  File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 155, in train
    trainer.fit(logger=logger, log_interval=args.log_interval)
  File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/sft.py", line 110, in fit
    self.strategy.optimizer_step(self.optimizer)
  File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/strategies/colossalai.py", line 154, in optimizer_step
    optimizer.step()
  File "/opt/conda/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 467, in step
    assert param_shape == flat_fp32_avg_grads.shape, \
AssertionError: fp32 param and grad have different shape torch.Size([20277248]) vs torch.Size([288768])

However, when i change the --strategy into ddp, it trains normally.
So is there a bug in implementing colossalai_zero2 , or it is incompatible with LoRA?

Environment

I use the docker image hpcaitech/colossalai:0.2.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions