Fail to initiation of communication

The initiation of communication will fail if the number of nodes is set to 3.

which occur in [get_group](https://github.com/microsoft/SuperScaler/blob/fa80ad02c1dc855ca85b591fb689a09598d2cb7e/runtime/megatron/mpu/initialize.py#L285)


the important of failure is followed: (run on 3 node and each node  has 2 GPUs)

```
Traceback (most recent call last):
  File "pretrain_gpt.py", line 149, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step_func,
  File "/workspace/aceso/runtime/megatron/training.py", line 113, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/aceso/runtime/megatron/initialize.py", line 86, in initialize_megatron
    finish_mpu_init()
  File "/workspace/aceso/runtime/megatron/initialize.py", line 67, in finish_mpu_init
    _initialize_distributed()
  File "/workspace/aceso/runtime/megatron/initialize.py", line 209, in _initialize_distributed
    mpu.initialize_model_parallel_flexpipe()
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 288, in initialize_model_parallel_flexpipe
    get_group(ranks)    
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 560, in get_group
    group_bits = bitmap(ranks)
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 555, in bitmap
    raise ValueError("rank {} out of range ({})".format(rank, len(bits)))
ValueError: rank 6 out of range (6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2618839) of binary: /usr/bin/python3
Traceback (most recent call last):
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail to initiation of communication #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fail to initiation of communication #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions