diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index 2537e5eff25b..03277d26e6e1 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -124,6 +124,36 @@ To train with multiple GPUs, use the following command, where ``--nproc_per_node python -m skrl.utils.distributed.jax --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --ml_framework jax +.. _multi-gpu-nccl-troubleshooting: + +Troubleshooting NCCL Errors +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +On some Linux multi-GPU systems, distributed training may fail with +``CUDA error: an illegal memory access was encountered`` reported by ``ProcessGroupNCCL`` +during or shortly after communicator initialization. + +If this occurs, try disabling the NCCL shared-memory transport before launching training: + +.. code-block:: shell + + export NCCL_SHM_DISABLE=1 + +If the issue persists, additional NCCL fallbacks that may help are: + +.. code-block:: shell + + export NCCL_IB_DISABLE=1 + export NCCL_ALGO=Ring + +Then relaunch the distributed training command as usual. + +.. note:: + + These variables are NCCL-level workarounds intended for affected systems. They are not + required on all machines, and may change communication behavior or performance depending + on the hardware topology. + Multi-Node Training ------------------- diff --git a/docs/source/refs/troubleshooting.rst b/docs/source/refs/troubleshooting.rst index 8f3a82f3f150..d14e75f1fe3c 100644 --- a/docs/source/refs/troubleshooting.rst +++ b/docs/source/refs/troubleshooting.rst @@ -9,6 +9,14 @@ Tricks and Troubleshooting assistance. +Troubleshooting distributed training NCCL errors +------------------------------------------------ + +On some Linux multi-GPU systems, distributed training may fail with +``CUDA error: an illegal memory access was encountered`` reported by ``ProcessGroupNCCL``. +For documented NCCL workarounds, see :ref:`multi-gpu-nccl-troubleshooting`. + + Debugging physics simulation stability issues ---------------------------------------------