[hotfix] Fix typos for supporting distributed training by haofanwang · Pull Request #2415 · hpcaitech/ColossalAI

haofanwang · 2023-01-10T05:50:22Z

This PR fixes the distributed training problem mentioned in #2407.

feifeibear

@haofanwang thanks for your warm-hearting contribution. However, I believe the original version should not be modified. The original intention is that if the backend is gloo, we don't need to use group_cpu but just use group directly.

haofanwang · 2023-01-10T08:25:50Z

I see, but is it normal? As gloo backend is usually for distributed CPU training. @feifeibear

The default backend is nccl, which makes claim "dist.get_backend() != 'gloo'" True, and group_cpu is always used in such a case. To be more specific, gpc.init_global_dist and gpc.init_parallel_groups() lead to connection issue. However, with above changes, distributed GPU training works fine.

FrankLeeeee · 2023-01-11T07:03:50Z

@feifeibear do we still need this CPU process group? Many users encounter environment issues when init this group.

github-actions · 2023-01-11T07:05:56Z

Your pre-commit check failed, follow the steps to run pre-commit on your file for code style consistency.

install pre-commit via "pip install pre-commit"
install pre-commit hooks via "pre-commit install"
run pre-commit on file with format error via "pre-commit run --files path" by replacing "path" with the actual file path
commit and push to your branch

View your job at https://github.com/hpcaitech/ColossalAI/actions/runs/3890406692.
Read our "CONTRIBUTING.md" for more reference to the code style.

kurisusnowdeng · 2023-01-19T04:19:20Z

Hi @haofanwang . In our design, get_group() returns the default group (which can be nccl, gloo, etc). Meanwhile, as the default group usually is used by gpu communication, we need get_cpu_group() to return a gloo group for cpu communication. In your case, just specify the default backend and group as gloo, and then get_cpu_group() will redirect you to the default group. However, you are duplicating the gloo group. The change is sadly not how it is supposed to work.

feifeibear · 2023-02-03T08:05:35Z

@kurisusnowdeng @haofanwang Can we close this PR? It looks like the problem has been solved.

haofanwang added 9 commits January 10, 2023 13:42

Update initializer_1d.py

93b45a3

Update initializer_2d.py

ad20ab3

Update initializer_2p5d.py

e8a5f0f

Update initializer_3d.py

5542756

Update initializer_data.py

2704ccc

Update initializer_model.py

ce0edc3

Update initializer_pipeline.py

46292dd

Update initializer_sequence.py

ce76f98

Update initializer_tensor.py

9fa9f5d

feifeibear reviewed Jan 10, 2023

View reviewed changes

haofanwang requested a review from feifeibear January 10, 2023 11:00

Merge branch 'hpcaitech:main' into main

957de2f

haofanwang mentioned this pull request Jan 11, 2023

[BUG]: backend for supporting distributed training #2407

Closed

kurisusnowdeng mentioned this pull request Jan 19, 2023

Fix a typo in parallel_context.py #2408

Merged

haofanwang closed this Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hotfix] Fix typos for supporting distributed training#2415

[hotfix] Fix typos for supporting distributed training#2415
haofanwang wants to merge 10 commits intohpcaitech:mainfrom
haofanwang:main

haofanwang commented Jan 10, 2023

Uh oh!

feifeibear left a comment

Uh oh!

haofanwang commented Jan 10, 2023

Uh oh!

FrankLeeeee commented Jan 11, 2023

Uh oh!

github-actions Bot commented Jan 11, 2023

Uh oh!

kurisusnowdeng commented Jan 19, 2023

Uh oh!

feifeibear commented Feb 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

haofanwang commented Jan 10, 2023

Uh oh!

feifeibear left a comment

Choose a reason for hiding this comment

Uh oh!

haofanwang commented Jan 10, 2023

Uh oh!

FrankLeeeee commented Jan 11, 2023

Uh oh!

github-actions Bot commented Jan 11, 2023

Uh oh!

kurisusnowdeng commented Jan 19, 2023

Uh oh!

feifeibear commented Feb 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants