Skip to content

[BUG]: all output logits are 0 after one optimizer step (gemini & grad accumulate). #2496

@bestbzw

Description

@bestbzw

🐛 Describe the bug

I add gradient accumulation into https://github.com/hpcaitech/ColossalAI/blob/e327e95144f4db8875531699e5b048f77cb80eba/examples/language/gpt/gemini/train_gpt_demo.py
and find that all output logits are 0 (loss is 10.828) after train one optimizer step .

If use "zero2", it runs well.

Environment

Colossal-AI version: 0.1.12

PyTorch Version: 1.12.1
PyTorch Version required by Colossal-AI: 1.12
PyTorch version match: ✓

System CUDA Version: 11.3
CUDA Version required by PyTorch: 11.3
CUDA Version required by Colossal-AI: 11.3
CUDA Version Match: ✓

CUDA Extension: ✓

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions