Skip to content
This repository was archived by the owner on Feb 6, 2025. It is now read-only.
This repository was archived by the owner on Feb 6, 2025. It is now read-only.

Fail to initiation of communication #22

@YuMJie

Description

@YuMJie

The initiation of communication will fail if the number of nodes is set to 3.

which occur in get_group

the important of failure is followed: (run on 3 node and each node has 2 GPUs)

Traceback (most recent call last):
  File "pretrain_gpt.py", line 149, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step_func,
  File "/workspace/aceso/runtime/megatron/training.py", line 113, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/aceso/runtime/megatron/initialize.py", line 86, in initialize_megatron
    finish_mpu_init()
  File "/workspace/aceso/runtime/megatron/initialize.py", line 67, in finish_mpu_init
    _initialize_distributed()
  File "/workspace/aceso/runtime/megatron/initialize.py", line 209, in _initialize_distributed
    mpu.initialize_model_parallel_flexpipe()
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 288, in initialize_model_parallel_flexpipe
    get_group(ranks)    
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 560, in get_group
    group_bits = bitmap(ranks)
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 555, in bitmap
    raise ValueError("rank {} out of range ({})".format(rank, len(bits)))
ValueError: rank 6 out of range (6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2618839) of binary: /usr/bin/python3
Traceback (most recent call last):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions