Skip to content

[BUG]: llama2保存学习率时报NotImplementedError #4782

@tomyoung903

Description

@tomyoung903

🐛 Describe the bug

ColossalAI/examples/language/llama2/pretrain.py 训练过程一切正常,但是保存checkpoint的时候保存lr_scheduler的state会报错

Traceback (most recent call last):
File "/scratch/users/nus/tomyoung/Colossal-llama2/ColossalAI/examples/language/llama2/pretrain.py", line 320, in
main()
File "/scratch/users/nus/tomyoung/Colossal-llama2/ColossalAI/examples/language/llama2/pretrain.py", line 309, in main
save(booster, model, optimizer, lr_scheduler, epoch, step + 1, args.batch_size, coordinator,
File "/scratch/users/nus/tomyoung/Colossal-llama2/ColossalAI/examples/language/llama2/pretrain.py", line 89, in save
booster.save_lr_scheduler(lr_scheduler, os.path.join(save_dir, 'lr_scheduler'))
File "/scratch/users/nus/tomyoung/Colossal-llama2/ColossalAI/colossalai/booster/booster.py", line 293, in save_lr_scheduler
self.checkpoint_io.save_lr_scheduler(lr_scheduler, checkpoint)
File "/scratch/users/nus/tomyoung/Colossal-llama2/ColossalAI/colossalai/booster/plugin/torch_ddp_plugin.py", line 50, in save_lr_scheduler
super().save_lr_scheduler(lr_scheduler, checkpoint)
File "/scratch/users/nus/tomyoung/Colossal-llama2/ColossalAI/colossalai/checkpoint_io/checkpoint_io_base.py", line 321, in save_lr_scheduler
torch.save(lr_scheduler.state_dict(), checkpoint)
File "/scratch/users/nus/tomyoung/Colossal-llama2/ColossalAI/colossalai/nn/lr_scheduler/delayed.py", line 94, in state_dict
raise NotImplementedError()

17ddca20a49612b50d56c26488edb96

Environment

Cuda 11.6

torch.version
'2.0.0+cu117'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions