Skip to content

[Bug Report] NCCL "illegal memory access" error during distributed training for Isaac-Velocity-Rough-G1-v0 #4011

@LZJ910

Description

@LZJ910

[bug]

Describe the bug

When running multi-GPU distributed training (nproc_per_node=2), the script fails for the Isaac-Velocity-Rough-G1-v0 task, but works correctly for the Isaac-Reach-Franka-v0 task.

The error appears in both ranks (rank 0 and rank 1) and is a CUDA error: an illegal memory access was encountered. This error originates from the PyTorch distributed backend (ProcessGroupNCCL) and causes the process group watchdog to terminate, ultimately leading to a ChildFailedError (Signal 6, SIGABRT).

Steps to reproduce

  1. Run the following command (which works correctly) to verify the base distributed setup:

    python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Reach-Franka-v0 --headless --distributed
  2. Run the failing command:

    python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Velocity-Rough-G1-v0 --headless --distributed
  3. The following error occurs:

    Error executing job with overrides: []
    [rank0]:[E1113 13:57:32.273333750 ProcessGroupNCCL.cpp:1896] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
    Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    
    Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7189217775e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x71892170c4a2 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7189248a5422 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
    frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7185754e5456 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7185754f56f0 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7185754f7282 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7185754f8e8d in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #7: <unknown function> + 0xdc253 (0x718c40cdc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
    frame #8: <unknown function> + 0x94ac3 (0x718c41e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #9: <unknown function> + 0x1268c0 (0x718c41f268c0 in /lib/x86_64-linux-gnu/libc.so.6)
    
    terminate called after throwing an instance of 'c10::DistBackendError'
      what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
    Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    
    Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7189217775e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x71892170c4a2 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7189248a5422 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
    frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7185754e5456 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7185754f56f0 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7185754f7282 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7185754f8e8d in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #7: <unknown function> + 0xdc253 (0x718c40cdc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
    frame #8: <unknown function> + 0x94ac3 (0x718c41e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #9: <unknown function> + 0x1268c0 (0x718c41f268c0 in /lib/x86_64-linux-gnu/libc.so.6)
    
    Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1902 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7189217775e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #1: <unknown function> + 0xcc7a4e (0x7185754c7a4e in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #2: <unknown function> + 0x9165ed (0x7185751165ed in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #3: <unknown function> + 0xdc253 (0x718c40cdc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
    frame #4: <unknown function> + 0x94ac3 (0x718c41e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #5: <unknown function> + 0x1268c0 (0x718c41f268c0 in /lib/x86_64-linux-gnu/libc.so.6)
    
    Error executing job with overrides: []
    [rank1]:[E1113 13:57:32.346752417 ProcessGroupNCCL.cpp:1896] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
    Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    
    Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7e3be53c75e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7e3be535c4a2 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7e3be54a5422 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
    frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7e38364e5456 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7e38364f56f0 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7e38364f7282 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e38364f8e8d in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #7: <unknown function> + 0xdc253 (0x7e3f022dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
    frame #8: <unknown function> + 0x94ac3 (0x7e3f03494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #9: <unknown function> + 0x1268c0 (0x7e3f035268c0 in /lib/x86_64-linux-gnu/libc.so.6)
    
    terminate called after throwing an instance of 'c10::DistBackendError'
      what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
    Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    
    Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7e3be53c75e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7e3be535c4a2 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7e3be54a5422 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
    frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7e38364e5456 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7e38364f56f0 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7e38364f7282 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e38364f8e8d in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #7: <unknown function> + 0xdc253 (0x7e3f022dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
    frame #8: <unknown function> + 0x94ac3 (0x7e3f03494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #9: <unknown function> + 0x1268c0 (0x7e3f035268c0 in /lib/x86_64-linux-gnu/libc.so.6)
    
    Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1902 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7e3be53c75e8 in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #1: <unknown function> + 0xcc7a4e (0x7e38364c7a4e in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #2: <unknown function> + 0x9165ed (0x7e38361165ed in /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
    frame #3: <unknown function> + 0xdc253 (0x7e3f022dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
    frame #4: <unknown function> + 0x94ac3 (0x7e3f03494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #5: <unknown function> + 0x1268c0 (0x7e3f035268c0 in /lib/x86_64-linux-gnu/libc.so.6)
    
    W1113 13:57:33.581000 2055138 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2055222 closing signal SIGTERM
    E1113 13:57:33.846000 2055138 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 0 (pid: 2055221) of binary: /mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/bin/python
    Traceback (most recent call last):
      File "<frozen runpy>", line 198, in _run_module_as_main
      File "<frozen runpy>", line 88, in _run_code
      File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/run.py", line 896, in <module>
        main()
      File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
        return f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^
      File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
        run(args)
      File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
        elastic_launch(
      File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/mnt/ssd2/homes/lzj/miniconda3/envs/env_isaaclab/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    ========================================================
    scripts/reinforcement_learning/rsl_rl/train.py FAILED
    --------------------------------------------------------
    Failures:
      <NO_OTHER_FAILURES>
    --------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2025-11-13_13:57:33
      host      : szdx-G7666-X6
      rank      : 0 (local_rank: 0)
      exitcode : -6 (pid: 2055221)
      error_file: <N/A>
      traceback : Signal 6 (SIGABRT) received by PID 2055221
    ======================================================

System Info

Describe the characteristic of your environment:

  • Commit: c32db68
  • Isaac Sim Version: 5.1.0
  • OS: Ubuntu 22.04.5 LTS (Jammy Jellyfish) (Kernel: 6.8.0-86-generic)
  • GPU: RTX 5090
  • CUDA: 12.8.93
  • GPU Driver: 580.82.09

Additional context

The problem seems specific to the Isaac-Velocity-Rough-G1-v0 task environment when used in distributed mode. The fact that Isaac-Reach-Franka-v0 works suggests the base environment, PyTorch installation, and NCCL setup are likely correct, but some tensor operation or memory access specific to the G1 task is causing an illegal memory access on the GPU.

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have checked that the issue is not in running Isaac Sim itself and is related to the repo

Acceptance Criteria

  • The command python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Velocity-Rough-G1-v0 --headless --distributed runs successfully without the CUDA error: an illegal memory access was encountered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions