I am able to run training job with zero1 and zero2 but I am facing this issue with distplan 'colossalai':
Traceback (most recent call last):
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
main()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
main()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
main()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
main()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2755) of binary: /opt/conda/envs/pytorch/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/pytorch/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py",line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_gpt_demo.py FAILED```
### Environment
I am using this branch: https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt
I am using aws ubuntu Deep Learning Pytorch=1.12.1 ec2 instance with g5.12xlarge gpu instance (4GPUs, 96GB, 48vCPU, 192GB RAM):
Colossal-AI version: 0.1.12
----------------------------
PyTorch Version: 1.12.1
PyTorch Version required by Colossal-AI: 1.12
PyTorch version match: ✓
----------------------------
System CUDA Version: 11.6
CUDA Version required by PyTorch: 11.6
CUDA Version required by Colossal-AI: 11.3
CUDA Version Match: x
----------------------------
CUDA Extension: ✓
🐛 Describe the bug
I am able to run training job with zero1 and zero2 but I am facing this issue with distplan 'colossalai':