🐛 Describe the bug
Describe the bug
I meet AssertionError: you are calculating the l2 norm twice, which looks similar to another issue.
I guess it is related to set_l2_norm. But strangely, this error doesn't raise at the very beginning, but appear after several steps. Once this happens, the loss becomes nan.
To Reproduce
python -m torch.distributed.run --nproc_per_node=$GPU_NUM --nnodes=$WORLD_SIZE \
--node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT ./train_dreambooth_colossalai.py
I'm working with this dreambooth example. Both single machine and multiple machine cannot work well in the training.
Environment
I install the lastest ColossalAI from the source as instructed by the new readme.
🐛 Describe the bug
Describe the bug
I meet AssertionError: you are calculating the l2 norm twice, which looks similar to another issue.
I guess it is related to set_l2_norm. But strangely, this error doesn't raise at the very beginning, but appear after several steps. Once this happens, the loss becomes nan.
To Reproduce
I'm working with this dreambooth example. Both single machine and multiple machine cannot work well in the training.
Environment
I install the lastest ColossalAI from the source as instructed by the new readme.