🐛 Describe the bug
I add gradient accumulation into https://github.com/hpcaitech/ColossalAI/blob/e327e95144f4db8875531699e5b048f77cb80eba/examples/language/gpt/gemini/train_gpt_demo.py
and find that all output logits are 0 (loss is 10.828) after train one optimizer step .
If use "zero2", it runs well.
Environment
Colossal-AI version: 0.1.12
PyTorch Version: 1.12.1
PyTorch Version required by Colossal-AI: 1.12
PyTorch version match: ✓
System CUDA Version: 11.3
CUDA Version required by PyTorch: 11.3
CUDA Version required by Colossal-AI: 11.3
CUDA Version Match: ✓
CUDA Extension: ✓
🐛 Describe the bug
I add gradient accumulation into https://github.com/hpcaitech/ColossalAI/blob/e327e95144f4db8875531699e5b048f77cb80eba/examples/language/gpt/gemini/train_gpt_demo.py
and find that all output logits are 0 (loss is 10.828) after train one optimizer step .
If use "zero2", it runs well.
Environment
Colossal-AI version: 0.1.12
PyTorch Version: 1.12.1
PyTorch Version required by Colossal-AI: 1.12
PyTorch version match: ✓
System CUDA Version: 11.3
CUDA Version required by PyTorch: 11.3
CUDA Version required by Colossal-AI: 11.3
CUDA Version Match: ✓
CUDA Extension: ✓