[BUG]: AssertionError: you are calculating the l2 norm twice

### 🐛 Describe the bug

**Describe the bug**

I meet AssertionError: you are calculating the l2 norm twice, which looks similar to another [issue](https://github.com/hpcaitech/ColossalAI/issues/2380). 

I guess it is related to [set_l2_norm](https://github.com/hpcaitech/ColossalAI/blob/24586599193d4f18fbaf66174f6fd669d59bc9d2/colossalai/gemini/chunk/chunk.py#L239). But strangely, this error doesn't raise at the very beginning, but appear after several steps. Once this happens, the loss becomes nan.

**To Reproduce**
```
python -m torch.distributed.run --nproc_per_node=$GPU_NUM --nnodes=$WORLD_SIZE \
  --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT ./train_dreambooth_colossalai.py
```

I'm working with [this dreambooth example](https://github.com/hpcaitech/ColossalAI/blob/main/examples/images/dreambooth/train_dreambooth_colossalai.py). Both single machine and multiple machine cannot work well in the training.

### Environment

I install the lastest ColossalAI from the source as instructed by [the new readme](https://github.com/hpcaitech/ColossalAI/pull/2375).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: AssertionError: you are calculating the l2 norm twice #2382

🐛 Describe the bug

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: AssertionError: you are calculating the l2 norm twice #2382

Description

🐛 Describe the bug

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions