Error executing job with overrides: []
[rank0]:[E1113 13:57:32.273333750 ProcessGroupNCCL.cpp:1896] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7189217775e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x71892170c4a2 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7189248a5422 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7185754e5456 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7185754f56f0 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7185754f7282 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7185754f8e8d in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x718c40cdc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x718c41e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x1268c0 (0x718c41f268c0 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7189217775e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x71892170c4a2 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7189248a5422 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7185754e5456 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7185754f56f0 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7185754f7282 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7185754f8e8d in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x718c40cdc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x718c41e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x1268c0 (0x718c41f268c0 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1902 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7189217775e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xcc7a4e (0x7185754c7a4e in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9165ed (0x7185751165ed in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x718c40cdc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x718c41e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x1268c0 (0x718c41f268c0 in /lib/x86_64-linux-gnu/libc.so.6)
Error executing job with overrides: []
[rank1]:[E1113 13:57:32.346752417 ProcessGroupNCCL.cpp:1896] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7e3be53c75e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7e3be535c4a2 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7e3be54a5422 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7e38364e5456 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7e38364f56f0 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7e38364f7282 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e38364f8e8d in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7e3f022dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7e3f03494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x1268c0 (0x7e3f035268c0 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7e3be53c75e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7e3be535c4a2 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7e3be54a5422 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7e38364e5456 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7e38364f56f0 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7e38364f7282 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e38364f8e8d in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7e3f022dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7e3f03494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x1268c0 (0x7e3f035268c0 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1902 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7e3be53c75e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xcc7a4e (0x7e38364c7a4e in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9165ed (0x7e38361165ed in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7e3f022dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7e3f03494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x1268c0 (0x7e3f035268c0 in /lib/x86_64-linux-gnu/libc.so.6)
W1113 13:57:33.581000 2055138 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2055222 closing signal SIGTERM
E1113 13:57:33.846000 2055138 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 0 (pid: 2055221) of binary: /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/bin/python
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/run.py", line 896, in <module>
main()
File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
scripts/reinforcement_learning/rsl_rl/train.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-11-13_13:57:33
host : szdx-G7666-X6
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 2055221)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2055221
======================================================
[bug]
Describe the bug
When running multi-GPU distributed training (
nproc_per_node=2), the script fails for theIsaac-Velocity-Rough-G1-v0task, but works correctly for theIsaac-Reach-Franka-v0task.The error appears in both ranks (rank 0 and rank 1) and is a
CUDA error: an illegal memory access was encountered. This error originates from the PyTorch distributed backend (ProcessGroupNCCL) and causes the process group watchdog to terminate, ultimately leading to aChildFailedError(Signal 6, SIGABRT).Steps to reproduce
Run the following command (which works correctly) to verify the base distributed setup:
Run the failing command:
The following error occurs:
System Info
Describe the characteristic of your environment:
Ubuntu 22.04.5 LTS (Jammy Jellyfish)(Kernel:6.8.0-86-generic)RTX 5090580.82.09Additional context
The problem seems specific to the
Isaac-Velocity-Rough-G1-v0task environment when used in distributed mode. The fact thatIsaac-Reach-Franka-v0works suggests the base environment, PyTorch installation, and NCCL setup are likely correct, but some tensor operation or memory access specific to the G1 task is causing an illegal memory access on the GPU.Checklist
Acceptance Criteria
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Velocity-Rough-G1-v0 --headless --distributedruns successfully without theCUDA error: an illegal memory access was encountered.