From 151f2445610ae90377c6744314210a65336415a3 Mon Sep 17 00:00:00 2001
From: bxwang <bixiong.wang@x-humanoid.com>
Date: Tue, 7 Apr 2026 15:56:55 +0800
Subject: [PATCH 1/5] docs: add NCCL troubleshooting notes for multi-GPU
 training

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
---
 docs/source/features/multi_gpu.rst | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst
index 2537e5eff25b..157fb26ead8d 100644
--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -124,6 +124,35 @@ To train with multiple GPUs, use the following command, where ``--nproc_per_node
 
                     python -m skrl.utils.distributed.jax --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --ml_framework jax
 
+Troubleshooting NCCL Errors
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+On some Linux multi-GPU systems, distributed training may fail with
+``CUDA error: an illegal memory access was encountered`` reported by
+``ProcessGroupNCCL`` during or shortly after communicator initialization.
+
+If this occurs, try disabling the NCCL shared-memory transport before
+launching training:
+
+.. code-block:: bash
+
+    export NCCL_SHM_DISABLE=1
+
+If the issue persists, additional NCCL fallbacks that may help are:
+
+.. code-block:: bash
+
+    export NCCL_IB_DISABLE=1
+    export NCCL_ALGO=Ring
+
+Then relaunch the distributed training command as usual.
+
+.. note::
+
+    These variables are NCCL-level workarounds intended for affected systems.
+    They are not required on all machines, and may change communication
+    behavior or performance depending on the hardware topology.
+
 Multi-Node Training
 -------------------
 

From fce1e51ee1c63b66c3f959c878fa56ceb1b82cab Mon Sep 17 00:00:00 2001
From: bxwang <bixiong.wang@x-humanoid.com>
Date: Tue, 7 Apr 2026 16:05:17 +0800
Subject: [PATCH 2/5] docs: reflow NCCL troubleshooting notes

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
---
 docs/source/features/multi_gpu.rst | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst
index 157fb26ead8d..59e543a60449 100644
--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -128,11 +128,10 @@ Troubleshooting NCCL Errors
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 On some Linux multi-GPU systems, distributed training may fail with
-``CUDA error: an illegal memory access was encountered`` reported by
-``ProcessGroupNCCL`` during or shortly after communicator initialization.
+``CUDA error: an illegal memory access was encountered`` reported by ``ProcessGroupNCCL``
+during or shortly after communicator initialization.
 
-If this occurs, try disabling the NCCL shared-memory transport before
-launching training:
+If this occurs, try disabling the NCCL shared-memory transport before launching training:
 
 .. code-block:: bash
 
@@ -149,9 +148,9 @@ Then relaunch the distributed training command as usual.
 
 .. note::
 
-    These variables are NCCL-level workarounds intended for affected systems.
-    They are not required on all machines, and may change communication
-    behavior or performance depending on the hardware topology.
+    These variables are NCCL-level workarounds intended for affected systems. They are not
+    required on all machines, and may change communication behavior or performance depending
+    on the hardware topology.
 
 Multi-Node Training
 -------------------

From d67f63df2e448ef4163a81be0df3a8af866c3787 Mon Sep 17 00:00:00 2001
From: bixiong wang <wangbx02@126.com>
Date: Tue, 7 Apr 2026 17:20:14 +0800
Subject: [PATCH 3/5] Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: bixiong wang <wangbx02@126.com>
---
 docs/source/features/multi_gpu.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst
index 59e543a60449..3449646925f4 100644
--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -125,7 +125,7 @@ To train with multiple GPUs, use the following command, where ``--nproc_per_node
                     python -m skrl.utils.distributed.jax --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --ml_framework jax
 
 Troubleshooting NCCL Errors
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 On some Linux multi-GPU systems, distributed training may fail with
 ``CUDA error: an illegal memory access was encountered`` reported by ``ProcessGroupNCCL``

From b299919945ded116b51fbf0eef661b6e0727997f Mon Sep 17 00:00:00 2001
From: bixiong wang <wangbx02@126.com>
Date: Tue, 7 Apr 2026 17:20:23 +0800
Subject: [PATCH 4/5] Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: bixiong wang <wangbx02@126.com>
---
 docs/source/features/multi_gpu.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst
index 3449646925f4..2425901346b9 100644
--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -133,13 +133,13 @@ during or shortly after communicator initialization.
 
 If this occurs, try disabling the NCCL shared-memory transport before launching training:
 
-.. code-block:: bash
+.. code-block:: shell
 
     export NCCL_SHM_DISABLE=1
 
 If the issue persists, additional NCCL fallbacks that may help are:
 
-.. code-block:: bash
+.. code-block:: shell
 
     export NCCL_IB_DISABLE=1
     export NCCL_ALGO=Ring

From 5f4654eb759917900978a6ead147256fa453b068 Mon Sep 17 00:00:00 2001
From: bxwang <bixiong.wang@x-humanoid.com>
Date: Tue, 7 Apr 2026 19:42:18 +0800
Subject: [PATCH 5/5] docs: link NCCL troubleshooting from general FAQ

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
---
 docs/source/features/multi_gpu.rst   | 2 ++
 docs/source/refs/troubleshooting.rst | 8 ++++++++
 2 files changed, 10 insertions(+)

diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst
index 2425901346b9..03277d26e6e1 100644
--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -124,6 +124,8 @@ To train with multiple GPUs, use the following command, where ``--nproc_per_node
 
                     python -m skrl.utils.distributed.jax --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --ml_framework jax
 
+.. _multi-gpu-nccl-troubleshooting:
+
 Troubleshooting NCCL Errors
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/docs/source/refs/troubleshooting.rst b/docs/source/refs/troubleshooting.rst
index 8f3a82f3f150..d14e75f1fe3c 100644
--- a/docs/source/refs/troubleshooting.rst
+++ b/docs/source/refs/troubleshooting.rst
@@ -9,6 +9,14 @@ Tricks and Troubleshooting
     assistance.
 
 
+Troubleshooting distributed training NCCL errors
+------------------------------------------------
+
+On some Linux multi-GPU systems, distributed training may fail with
+``CUDA error: an illegal memory access was encountered`` reported by ``ProcessGroupNCCL``.
+For documented NCCL workarounds, see :ref:`multi-gpu-nccl-troubleshooting`.
+
+
 Debugging physics simulation stability issues
 ---------------------------------------------