Skip to content

[BUG]: NaN loss while using Gemini + tensor_parallel #5110

@airlsyn

Description

@airlsyn

🐛 Describe the bug

Using llama2 benchmark.py with the change:

if args.plugin == "gemini":
        plugin = GeminiPlugin(
            precision="bf16",
            shard_param_frac=args.shard_param_frac,
            offload_optim_frac=args.offload_optim_frac,
            offload_param_frac=args.offload_param_frac,
            tp_size=args.tp,
        )
....
  1. plugin = '3d' + tp=2
    loss is normal
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 \
  ColossalAI/examples/language/llama2/benchmark.py \
  -c 7b -g -b 1 --max_length 4096 -p 3d --tp 2 --pp 1 --zero 2 -s 100

image

  1. plugin = 'gemini'
    loss is normal
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 \
  ColossalAI/examples/language/llama2/benchmark.py \
  -c 7b -g -b 1 --max_length 4096 -p gemini --tp 1 --pp 1 --zero 2 -s 100

image

  1. plugin = 'gemini' + tp=2
    loss is normal at the beginning, and changes to NaN later
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 \
  ColossalAI/examples/language/llama2/benchmark.py \
  -c 7b -g -b 1 --max_length 4096 -p gemini --tp 2 --pp 1 --zero 2 -s 100

image

Why loss goes to NaN when using 'gemini' + tensor parallel? And how to fix it?

Thanks a lot.

Environment

Pytorch: 2.0.1
ColossalAI: main branch (CUDA_EXT=1 pip install .)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions