Skip to content

[hotfix] Fix typos for supporting distributed training#2415

Closed
haofanwang wants to merge 10 commits intohpcaitech:mainfrom
haofanwang:main
Closed

[hotfix] Fix typos for supporting distributed training#2415
haofanwang wants to merge 10 commits intohpcaitech:mainfrom
haofanwang:main

Conversation

@haofanwang
Copy link
Copy Markdown
Contributor

This PR fixes the distributed training problem mentioned in #2407.

Copy link
Copy Markdown
Contributor

@feifeibear feifeibear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haofanwang thanks for your warm-hearting contribution. However, I believe the original version should not be modified. The original intention is that if the backend is gloo, we don't need to use group_cpu but just use group directly.

@haofanwang
Copy link
Copy Markdown
Contributor Author

I see, but is it normal? As gloo backend is usually for distributed CPU training. @feifeibear

The default backend is nccl, which makes claim "dist.get_backend() != 'gloo'" True, and group_cpu is always used in such a case. To be more specific, gpc.init_global_dist and gpc.init_parallel_groups() lead to connection issue. However, with above changes, distributed GPU training works fine.

@haofanwang haofanwang requested a review from feifeibear January 10, 2023 11:00
@FrankLeeeee
Copy link
Copy Markdown
Contributor

@feifeibear do we still need this CPU process group? Many users encounter environment issues when init this group.

@github-actions
Copy link
Copy Markdown
Contributor

Your pre-commit check failed, follow the steps to run pre-commit on your file for code style consistency.

  1. install pre-commit via "pip install pre-commit"
  2. install pre-commit hooks via "pre-commit install"
  3. run pre-commit on file with format error via "pre-commit run --files path" by replacing "path" with the actual file path
  4. commit and push to your branch

View your job at https://github.com/hpcaitech/ColossalAI/actions/runs/3890406692.
Read our "CONTRIBUTING.md" for more reference to the code style.

@kurisusnowdeng
Copy link
Copy Markdown
Contributor

Hi @haofanwang . In our design, get_group() returns the default group (which can be nccl, gloo, etc). Meanwhile, as the default group usually is used by gpu communication, we need get_cpu_group() to return a gloo group for cpu communication. In your case, just specify the default backend and group as gloo, and then get_cpu_group() will redirect you to the default group. However, you are duplicating the gloo group. The change is sadly not how it is supposed to work.

@feifeibear
Copy link
Copy Markdown
Contributor

@kurisusnowdeng @haofanwang Can we close this PR? It looks like the problem has been solved.

@haofanwang haofanwang closed this Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants