[BUG]: NaN loss while using Gemini + tensor_parallel

### 🐛 Describe the bug

Using llama2 [benchmark.py](https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/llama2/benchmark.py) with the change:
```
if args.plugin == "gemini":
        plugin = GeminiPlugin(
            precision="bf16",
            shard_param_frac=args.shard_param_frac,
            offload_optim_frac=args.offload_optim_frac,
            offload_param_frac=args.offload_param_frac,
            tp_size=args.tp,
        )
....
```


1. plugin = '3d' + tp=2
loss is normal

```
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 \
  ColossalAI/examples/language/llama2/benchmark.py \
  -c 7b -g -b 1 --max_length 4096 -p 3d --tp 2 --pp 1 --zero 2 -s 100
```
![image](https://github.com/hpcaitech/ColossalAI/assets/1772912/9a8c4e79-d6d7-4eb6-9cd2-9e7936e8a056)



2. plugin = 'gemini'
loss is normal

```
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 \
  ColossalAI/examples/language/llama2/benchmark.py \
  -c 7b -g -b 1 --max_length 4096 -p gemini --tp 1 --pp 1 --zero 2 -s 100
```
![image](https://github.com/hpcaitech/ColossalAI/assets/1772912/0dffbda6-206f-4036-9f1a-5599231ba304)


3. plugin = 'gemini' + tp=2
loss is normal at the beginning, and changes to NaN later
```
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 \
  ColossalAI/examples/language/llama2/benchmark.py \
  -c 7b -g -b 1 --max_length 4096 -p gemini --tp 2 --pp 1 --zero 2 -s 100
```
![image](https://github.com/hpcaitech/ColossalAI/assets/1772912/43e8617e-2e00-410b-b519-ed400f1e4951)


Why loss goes to NaN when using 'gemini' + tensor parallel?  And how to fix it?

Thanks a lot.


### Environment

Pytorch: 2.0.1
ColossalAI: main branch (`CUDA_EXT=1 pip install .`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: NaN loss while using Gemini + tensor_parallel #5110

🐛 Describe the bug

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: NaN loss while using Gemini + tensor_parallel #5110

Description

🐛 Describe the bug

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions