Skip to content

[BUG]: TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode' #2590

@ivrschool

Description

@ivrschool

🐛 Describe the bug

I am able to run training job with zero1 and zero2 but I am facing this issue with distplan 'colossalai':

Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2755) of binary: /opt/conda/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py",line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_gpt_demo.py FAILED```

### Environment

I am using this branch: https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt

I am using aws ubuntu Deep Learning Pytorch=1.12.1 ec2 instance with g5.12xlarge gpu instance (4GPUs, 96GB, 48vCPU, 192GB RAM):
 
Colossal-AI version: 0.1.12
----------------------------
PyTorch Version: 1.12.1
PyTorch Version required by Colossal-AI: 1.12
PyTorch version match: ✓
----------------------------
System CUDA Version: 11.6
CUDA Version required by PyTorch: 11.6
CUDA Version required by Colossal-AI: 11.3
CUDA Version Match: x
----------------------------
CUDA Extension: ✓

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions