What happend?
I used colossalai to train stable diffution(https://github.com/huggingface/diffusers/blob/main/examples/research_projects/colossalai/train_dreambooth_colossalai.py) and find the error:
Traceback (most recent call last):
File "/workspace/mutilObject/diffusers/examples/research_projects/colossalai/train_dreambooth_colossalai.py", line 701, in
main(args)
File "/workspace/mutilObject/diffusers/examples/research_projects/colossalai/train_dreambooth_colossalai.py", line 379, in main
colossalai.launch_from_torch(config={})
File "/root/miniconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/initialize.py", line 219, in launch_from_torch
launch(config=config,
File "/root/miniconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/initialize.py", line 99, in launch
gpc.init_global_dist(rank, world_size, backend, host, port)
File "/root/miniconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/context/parallel_context.py", line 374, in init_global_dist
dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method)
File "/root/miniconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 722, in init_process_group
raise RuntimeError("trying to initialize the default process group " "twice!")
RuntimeError: trying to initialize the default process group twice!
Reproduction
torchrun --nproc_per_node 2 train_dreambooth_colossalai.py
--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4
--instance_data_dir=./training_images
--class_data_dir=./class_images
--output_dir=./trained_models
--with_prior_preservation --prior_loss_weight=1.0
--instance_prompt="coachchristo person"
--class_prompt="person"
--resolution=512
--train_batch_size=1
--learning_rate=5e-6
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=1000
--placement="cuda"
And following is my environment configuration:
colossalai 0.2.0+torch1.13cu11.6
torch 1.13.1
diffusers 0.12.1
GPU 2*A100
What happend?
I used colossalai to train stable diffution(https://github.com/huggingface/diffusers/blob/main/examples/research_projects/colossalai/train_dreambooth_colossalai.py) and find the error:
Traceback (most recent call last):
File "/workspace/mutilObject/diffusers/examples/research_projects/colossalai/train_dreambooth_colossalai.py", line 701, in
main(args)
File "/workspace/mutilObject/diffusers/examples/research_projects/colossalai/train_dreambooth_colossalai.py", line 379, in main
colossalai.launch_from_torch(config={})
File "/root/miniconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/initialize.py", line 219, in launch_from_torch
launch(config=config,
File "/root/miniconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/initialize.py", line 99, in launch
gpc.init_global_dist(rank, world_size, backend, host, port)
File "/root/miniconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/context/parallel_context.py", line 374, in init_global_dist
dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method)
File "/root/miniconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 722, in init_process_group
raise RuntimeError("trying to initialize the default process group " "twice!")
RuntimeError: trying to initialize the default process group twice!
Reproduction
torchrun --nproc_per_node 2 train_dreambooth_colossalai.py
--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4
--instance_data_dir=./training_images
--class_data_dir=./class_images
--output_dir=./trained_models
--with_prior_preservation --prior_loss_weight=1.0
--instance_prompt="coachchristo person"
--class_prompt="person"
--resolution=512
--train_batch_size=1
--learning_rate=5e-6
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=1000
--placement="cuda"
And following is my environment configuration:
colossalai 0.2.0+torch1.13cu11.6
torch 1.13.1
diffusers 0.12.1
GPU 2*A100