[2023-11-20 18:48:14] [I][unknown_node][s3.py:63:save_to_s3][s3] - rank-0 -> start to export ckpt, save_dir=/data/checkpoints/job_0_ckpt_100, file prefix=m_c-100
[2023-11-20 18:48:14] [I][unknown_node][s3.py:63:save_to_s3][s3] - rank-1 -> start to export ckpt, save_dir=/data/checkpoints/job_0_ckpt_100, file prefix=m_c-100
[2023-11-20 18:48:14] [I][unknown_node][s3.py:75:save_to_s3][s3] - rank-0 -> [start]saving model into: /data/checkpoints/job_0_ckpt_100/model
[2023-11-20 18:48:14] [I][unknown_node][s3.py:75:save_to_s3][s3] - rank-1 -> [start]saving model into: /data/checkpoints/job_0_ckpt_100/model
[2023-11-20 18:48:54] [I][unknown_node][s3.py:78:save_to_s3][s3] - rank-1 -> [done]model has been saved in: /data/checkpoints/job_0_ckpt_100/model
[2023-11-20 18:48:54] [I][unknown_node][s3.py:82:save_to_s3][s3] - rank-1 -> [start]saving optimizer into: /data/checkpoints/job_0_ckpt_100/optimizer
[2023-11-20 18:48:58] [I][unknown_node][s3.py:78:save_to_s3][s3] - rank-0 -> [done]model has been saved in: /data/checkpoints/job_0_ckpt_100/model
[2023-11-20 18:48:58] [I][unknown_node][s3.py:82:save_to_s3][s3] - rank-0 -> [start]saving optimizer into: /data/checkpoints/job_0_ckpt_100/optimizer
Traceback (most recent call last):
File "pretrain.py", line 349, in <module>
main(args, cfg)
File "pretrain.py", line 286, in main
save_to_s3(
File "util/s3.py", line 84, in save_to_s3
booster.save_optimizer(optimizer, p_optimizer, shard=True, size_per_shard=optimizer_shard_size)
File "/home/.local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 307, in save_optimizer
self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
File "/home/.local/lib/python3.10/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 189, in save_optimizer
self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)
File "/home/.local/lib/python3.10/site-packages/colossalai/booster/plugin/gemini_plugin.py", line 186, in save_sharded_optimizer
total_size = save_state_dict_shards(
File "/home/.local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 234, in save_state_dict_shards
for idx, shard_pair in enumerate(sharded_state_dict):
File "/home/.local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 792, in state_shard
state = self.collect_states(param_id=param_id, only_rank_0=only_rank_0)
File "/home/.local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 516, in collect_states
dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size])
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2072, in all_gather_object
object_list[i] = _tensor_to_object(tensor, tensor_size)
IndexError: list assignment index out of range
🐛 Describe the bug
With Gemini plugin and tp=2, the training looks okay, but when saving, I get
IndexError: list assignment index out of rangeHere is the traceback:
Environment
Colossalai: main branch
Pytorch: 2.0.1