[hotfix] Fix typos for supporting distributed training#2415
[hotfix] Fix typos for supporting distributed training#2415haofanwang wants to merge 10 commits intohpcaitech:mainfrom haofanwang:main
Conversation
feifeibear
left a comment
There was a problem hiding this comment.
@haofanwang thanks for your warm-hearting contribution. However, I believe the original version should not be modified. The original intention is that if the backend is gloo, we don't need to use group_cpu but just use group directly.
|
I see, but is it normal? As gloo backend is usually for distributed CPU training. @feifeibear The default backend is nccl, which makes claim "dist.get_backend() != 'gloo'" True, and group_cpu is always used in such a case. To be more specific, gpc.init_global_dist and gpc.init_parallel_groups() lead to connection issue. However, with above changes, distributed GPU training works fine. |
|
@feifeibear do we still need this CPU process group? Many users encounter environment issues when init this group. |
|
Your pre-commit check failed, follow the steps to run pre-commit on your file for code style consistency.
View your job at https://github.com/hpcaitech/ColossalAI/actions/runs/3890406692. |
|
Hi @haofanwang . In our design, |
|
@kurisusnowdeng @haofanwang Can we close this PR? It looks like the problem has been solved. |
This PR fixes the distributed training problem mentioned in #2407.