Skip to content

test_distributed_sampler not working (and not exiting) #1650

@rijobro

Description

@rijobro

Describe the bug
./runtests.sh --quick (and more specifically python tests/test_distributed_sampler.py) is displays error, but doesn't exit and hangs there permanently.

Error is:

(base) rbrown@rb-monai-0:~/MONAI$ python tests/test_distributed_sampler.py 
Process Process-2:
Traceback (most recent call last):
  File "/home/rbrown/MONAI/tests/utils.py", line 285, in run_process
    raise e
  File "/home/rbrown/MONAI/tests/utils.py", line 272, in run_process
    torch.cuda.set_device(int(local_rank))
  File "/home/rbrown/.conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 263, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rbrown/.conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/rbrown/.conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rbrown/MONAI/tests/utils.py", line 289, in run_process
    dist.destroy_process_group()
  File "/home/rbrown/.conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 573, in destroy_process_group
    raise RuntimeError("Invalid process group specified")
RuntimeError: Invalid process group specified

My guess is that it's because I only have 1 GPU, yet it's trying to use devices 0 and 1, but that's just a guess. The test works fine on my local machine -- it doesn't have cuda so it doesn't arrive in the problematic statement:

if torch.cuda.is_available():
    torch.cuda.set_device(int(local_rank))

It seems to me that there are two problems:

  1. that it's not working
  2. that it's not exiting when the error is printed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CI/CDbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions