From 151f2445610ae90377c6744314210a65336415a3 Mon Sep 17 00:00:00 2001 From: bxwang Date: Tue, 7 Apr 2026 15:56:55 +0800 Subject: [PATCH 1/5] docs: add NCCL troubleshooting notes for multi-GPU training Signed-off-by: bxwang --- docs/source/features/multi_gpu.rst | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index 2537e5eff25b..157fb26ead8d 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -124,6 +124,35 @@ To train with multiple GPUs, use the following command, where ``--nproc_per_node python -m skrl.utils.distributed.jax --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --ml_framework jax +Troubleshooting NCCL Errors +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +On some Linux multi-GPU systems, distributed training may fail with +``CUDA error: an illegal memory access was encountered`` reported by +``ProcessGroupNCCL`` during or shortly after communicator initialization. + +If this occurs, try disabling the NCCL shared-memory transport before +launching training: + +.. code-block:: bash + + export NCCL_SHM_DISABLE=1 + +If the issue persists, additional NCCL fallbacks that may help are: + +.. code-block:: bash + + export NCCL_IB_DISABLE=1 + export NCCL_ALGO=Ring + +Then relaunch the distributed training command as usual. + +.. note:: + + These variables are NCCL-level workarounds intended for affected systems. + They are not required on all machines, and may change communication + behavior or performance depending on the hardware topology. + Multi-Node Training ------------------- From fce1e51ee1c63b66c3f959c878fa56ceb1b82cab Mon Sep 17 00:00:00 2001 From: bxwang Date: Tue, 7 Apr 2026 16:05:17 +0800 Subject: [PATCH 2/5] docs: reflow NCCL troubleshooting notes Signed-off-by: bxwang --- docs/source/features/multi_gpu.rst | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index 157fb26ead8d..59e543a60449 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -128,11 +128,10 @@ Troubleshooting NCCL Errors ^^^^^^^^^^^^^^^^^^^^^^^^^^^ On some Linux multi-GPU systems, distributed training may fail with -``CUDA error: an illegal memory access was encountered`` reported by -``ProcessGroupNCCL`` during or shortly after communicator initialization. +``CUDA error: an illegal memory access was encountered`` reported by ``ProcessGroupNCCL`` +during or shortly after communicator initialization. -If this occurs, try disabling the NCCL shared-memory transport before -launching training: +If this occurs, try disabling the NCCL shared-memory transport before launching training: .. code-block:: bash @@ -149,9 +148,9 @@ Then relaunch the distributed training command as usual. .. note:: - These variables are NCCL-level workarounds intended for affected systems. - They are not required on all machines, and may change communication - behavior or performance depending on the hardware topology. + These variables are NCCL-level workarounds intended for affected systems. They are not + required on all machines, and may change communication behavior or performance depending + on the hardware topology. Multi-Node Training ------------------- From d67f63df2e448ef4163a81be0df3a8af866c3787 Mon Sep 17 00:00:00 2001 From: bixiong wang Date: Tue, 7 Apr 2026 17:20:14 +0800 Subject: [PATCH 3/5] Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: bixiong wang --- docs/source/features/multi_gpu.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index 59e543a60449..3449646925f4 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -125,7 +125,7 @@ To train with multiple GPUs, use the following command, where ``--nproc_per_node python -m skrl.utils.distributed.jax --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --ml_framework jax Troubleshooting NCCL Errors -^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ On some Linux multi-GPU systems, distributed training may fail with ``CUDA error: an illegal memory access was encountered`` reported by ``ProcessGroupNCCL`` From b299919945ded116b51fbf0eef661b6e0727997f Mon Sep 17 00:00:00 2001 From: bixiong wang Date: Tue, 7 Apr 2026 17:20:23 +0800 Subject: [PATCH 4/5] Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: bixiong wang --- docs/source/features/multi_gpu.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index 3449646925f4..2425901346b9 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -133,13 +133,13 @@ during or shortly after communicator initialization. If this occurs, try disabling the NCCL shared-memory transport before launching training: -.. code-block:: bash +.. code-block:: shell export NCCL_SHM_DISABLE=1 If the issue persists, additional NCCL fallbacks that may help are: -.. code-block:: bash +.. code-block:: shell export NCCL_IB_DISABLE=1 export NCCL_ALGO=Ring From 5f4654eb759917900978a6ead147256fa453b068 Mon Sep 17 00:00:00 2001 From: bxwang Date: Tue, 7 Apr 2026 19:42:18 +0800 Subject: [PATCH 5/5] docs: link NCCL troubleshooting from general FAQ Signed-off-by: bxwang --- docs/source/features/multi_gpu.rst | 2 ++ docs/source/refs/troubleshooting.rst | 8 ++++++++ 2 files changed, 10 insertions(+) diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index 2425901346b9..03277d26e6e1 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -124,6 +124,8 @@ To train with multiple GPUs, use the following command, where ``--nproc_per_node python -m skrl.utils.distributed.jax --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --ml_framework jax +.. _multi-gpu-nccl-troubleshooting: + Troubleshooting NCCL Errors ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/docs/source/refs/troubleshooting.rst b/docs/source/refs/troubleshooting.rst index 8f3a82f3f150..d14e75f1fe3c 100644 --- a/docs/source/refs/troubleshooting.rst +++ b/docs/source/refs/troubleshooting.rst @@ -9,6 +9,14 @@ Tricks and Troubleshooting assistance. +Troubleshooting distributed training NCCL errors +------------------------------------------------ + +On some Linux multi-GPU systems, distributed training may fail with +``CUDA error: an illegal memory access was encountered`` reported by ``ProcessGroupNCCL``. +For documented NCCL workarounds, see :ref:`multi-gpu-nccl-troubleshooting`. + + Debugging physics simulation stability issues ---------------------------------------------