This repository was archived by the owner on Feb 6, 2025. It is now read-only.

Description
The initiation of communication will fail if the number of nodes is set to 3.
which occur in get_group
the important of failure is followed: (run on 3 node and each node has 2 GPUs)
Traceback (most recent call last):
File "pretrain_gpt.py", line 149, in <module>
pretrain(train_valid_test_datasets_provider, model_provider, forward_step_func,
File "/workspace/aceso/runtime/megatron/training.py", line 113, in pretrain
initialize_megatron(extra_args_provider=extra_args_provider,
File "/workspace/aceso/runtime/megatron/initialize.py", line 86, in initialize_megatron
finish_mpu_init()
File "/workspace/aceso/runtime/megatron/initialize.py", line 67, in finish_mpu_init
_initialize_distributed()
File "/workspace/aceso/runtime/megatron/initialize.py", line 209, in _initialize_distributed
mpu.initialize_model_parallel_flexpipe()
File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 288, in initialize_model_parallel_flexpipe
get_group(ranks)
File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 560, in get_group
group_bits = bitmap(ranks)
File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 555, in bitmap
raise ValueError("rank {} out of range ({})".format(rank, len(bits)))
ValueError: rank 6 out of range (6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2618839) of binary: /usr/bin/python3
Traceback (most recent call last):