Skip to content

[BUG]: colossalai run is stuck during multi-nodes training #4164

@ver217

Description

@ver217

🐛 Describe the bug

When using colossalai run during multi-nodes training, it's stuck before initializing distributed process group.

This is because potentially wrong launch command.

Environment

Python 3.8.0
torch 1.12.1+cu113
CUDA 11.4

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions