[BUG]: saving Shardformer in Gemini got list assignment index out of range

### 🐛 Describe the bug

With Gemini plugin and tp=2, the training looks okay, but when saving, I get `IndexError: list assignment index out of range`

Here is the traceback:
```
[2023-11-20 18:48:14] [I][unknown_node][s3.py:63:save_to_s3][s3] - rank-0 -> start to export ckpt, save_dir=/data/checkpoints/job_0_ckpt_100, file prefix=m_c-100
[2023-11-20 18:48:14] [I][unknown_node][s3.py:63:save_to_s3][s3] - rank-1 -> start to export ckpt, save_dir=/data/checkpoints/job_0_ckpt_100, file prefix=m_c-100
[2023-11-20 18:48:14] [I][unknown_node][s3.py:75:save_to_s3][s3] - rank-0 -> [start]saving model into: /data/checkpoints/job_0_ckpt_100/model
[2023-11-20 18:48:14] [I][unknown_node][s3.py:75:save_to_s3][s3] - rank-1 -> [start]saving model into: /data/checkpoints/job_0_ckpt_100/model
[2023-11-20 18:48:54] [I][unknown_node][s3.py:78:save_to_s3][s3] - rank-1 -> [done]model has been saved in: /data/checkpoints/job_0_ckpt_100/model
[2023-11-20 18:48:54] [I][unknown_node][s3.py:82:save_to_s3][s3] - rank-1 -> [start]saving optimizer into: /data/checkpoints/job_0_ckpt_100/optimizer
[2023-11-20 18:48:58] [I][unknown_node][s3.py:78:save_to_s3][s3] - rank-0 -> [done]model has been saved in: /data/checkpoints/job_0_ckpt_100/model
[2023-11-20 18:48:58] [I][unknown_node][s3.py:82:save_to_s3][s3] - rank-0 -> [start]saving optimizer into: /data/checkpoints/job_0_ckpt_100/optimizer
Traceback (most recent call last):
  File "pretrain.py", line 349, in <module>
    main(args, cfg)
  File "pretrain.py", line 286, in main
    save_to_s3(
  File "util/s3.py", line 84, in save_to_s3
    booster.save_optimizer(optimizer, p_optimizer, shard=True, size_per_shard=optimizer_shard_size)
  File "/home/.local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 307, in save_optimizer
    self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
  File "/home/.local/lib/python3.10/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 189, in save_optimizer
    self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)
  File "/home/.local/lib/python3.10/site-packages/colossalai/booster/plugin/gemini_plugin.py", line 186, in save_sharded_optimizer
    total_size = save_state_dict_shards(
  File "/home/.local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 234, in save_state_dict_shards
    for idx, shard_pair in enumerate(sharded_state_dict):
  File "/home/.local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 792, in state_shard
    state = self.collect_states(param_id=param_id, only_rank_0=only_rank_0)
  File "/home/.local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 516, in collect_states
    dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size])
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2072, in all_gather_object
    object_list[i] = _tensor_to_object(tensor, tensor_size)
IndexError: list assignment index out of range
```

### Environment

Colossalai: main branch
Pytorch: 2.0.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: saving Shardformer in Gemini got list assignment index out of range #5075

🐛 Describe the bug

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: saving Shardformer in Gemini got list assignment index out of range #5075

Description

🐛 Describe the bug

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions