When I run 1000 steps with hybird parallel setting to save model.
Saving checkpoint
Traceback (most recent call last):
File "/mnt/hwfile/songyixin/ColossalAI/examples/language/llama2/pretrain.py", line 316, in <module>
main()
File "/mnt/hwfile/songyixin/ColossalAI/examples/language/llama2/pretrain.py", line 305, in main
save(booster, model, optimizer, lr_scheduler, epoch, step + 1, args.batch_size, coordinator,
File "/mnt/hwfile/songyixin/ColossalAI/examples/language/llama2/pretrain.py", line 87, in save
booster.save_model(model, os.path.join(save_dir, 'model'), shard=True)
File "/mnt/hwfile/songyixin/ColossalAI/colossalai/booster/booster.py", line 242, in save_model
self.checkpoint_io.save_model(model,
File "/mnt/hwfile/songyixin/ColossalAI/colossalai/checkpoint_io/checkpoint_io_base.py", line 140, in save_model
self.save_sharded_model(model, checkpoint, gather_dtensor, prefix, size_per_shard, use_safetensors)
File "/mnt/hwfile/songyixin/ColossalAI/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 231, in save_sharded_model
total_size = save_state_dict_shards(sharded_state_dict=state_dict_shard,
File "/mnt/hwfile/songyixin/ColossalAI/colossalai/checkpoint_io/utils.py", line 256, in save_state_dict_shards
save_state_dict(shard, checkpoint_file_path, use_safetensors=use_safetensors)
File "/mnt/hwfile/songyixin/ColossalAI/colossalai/checkpoint_io/utils.py", line 324, in save_state_dict
torch.save(state_dict, checkpoint_file_path)
File "/mnt/petrelfs/songyixin/miniconda3/envs/cai/lib/python3.9/site-packages/torch/serialization.py", line 441, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/mnt/petrelfs/songyixin/miniconda3/envs/cai/lib/python3.9/site-packages/torch/serialization.py", line 653, in _save
pickler.dump(obj)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
🐛 Describe the bug
When I run 1000 steps with hybird parallel setting to save model.
I didn't change any code of pretrain.py
Environment
No response