-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Closed
Labels
Description
Describe the bug
./runtests.sh --quick (and more specifically python tests/test_distributed_sampler.py) is displays error, but doesn't exit and hangs there permanently.
Error is:
(base) rbrown@rb-monai-0:~/MONAI$ python tests/test_distributed_sampler.py
Process Process-2:
Traceback (most recent call last):
File "/home/rbrown/MONAI/tests/utils.py", line 285, in run_process
raise e
File "/home/rbrown/MONAI/tests/utils.py", line 272, in run_process
torch.cuda.set_device(int(local_rank))
File "/home/rbrown/.conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 263, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/rbrown/.conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/rbrown/.conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/rbrown/MONAI/tests/utils.py", line 289, in run_process
dist.destroy_process_group()
File "/home/rbrown/.conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 573, in destroy_process_group
raise RuntimeError("Invalid process group specified")
RuntimeError: Invalid process group specifiedMy guess is that it's because I only have 1 GPU, yet it's trying to use devices 0 and 1, but that's just a guess. The test works fine on my local machine -- it doesn't have cuda so it doesn't arrive in the problematic statement:
if torch.cuda.is_available():
torch.cuda.set_device(int(local_rank))It seems to me that there are two problems:
- that it's not working
- that it's not exiting when the error is printed.