🐛 Describe the bug
Using llama2 benchmark.py with the change:
if args.plugin == "gemini":
plugin = GeminiPlugin(
precision="bf16",
shard_param_frac=args.shard_param_frac,
offload_optim_frac=args.offload_optim_frac,
offload_param_frac=args.offload_param_frac,
tp_size=args.tp,
)
....
- plugin = '3d' + tp=2
loss is normal
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 \
ColossalAI/examples/language/llama2/benchmark.py \
-c 7b -g -b 1 --max_length 4096 -p 3d --tp 2 --pp 1 --zero 2 -s 100

- plugin = 'gemini'
loss is normal
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 \
ColossalAI/examples/language/llama2/benchmark.py \
-c 7b -g -b 1 --max_length 4096 -p gemini --tp 1 --pp 1 --zero 2 -s 100

- plugin = 'gemini' + tp=2
loss is normal at the beginning, and changes to NaN later
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 \
ColossalAI/examples/language/llama2/benchmark.py \
-c 7b -g -b 1 --max_length 4096 -p gemini --tp 2 --pp 1 --zero 2 -s 100

Why loss goes to NaN when using 'gemini' + tensor parallel? And how to fix it?
Thanks a lot.
Environment
Pytorch: 2.0.1
ColossalAI: main branch (CUDA_EXT=1 pip install .)
🐛 Describe the bug
Using llama2 benchmark.py with the change:
loss is normal
loss is normal
loss is normal at the beginning, and changes to NaN later
Why loss goes to NaN when using 'gemini' + tensor parallel? And how to fix it?
Thanks a lot.
Environment
Pytorch: 2.0.1
ColossalAI: main branch (
CUDA_EXT=1 pip install .)