-
Notifications
You must be signed in to change notification settings - Fork 338
Description
I have observed that some of the benchmarks, e.g. alltoall, fail when launched on more than one node. The error message is from NCCL: internal error - please report this issue to the NCCL developers.
The issue does not occur when I set NCCL_PXN_DISABLE=1. I was able to find this out rather quickly due to the NCCL debug message transport/net.cc:514 NCCL WARN PXN should not use host buffers for data.
My question is if nccl-tests behaves as intended in these cases. I found remarks on the compatiblity of NCCL buffers and PXN also here: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/bufferreg.html?utm_source=chatgpt.com#buffer-registration-and-pxn.
If it is in fact incompatible, would it be possible to detect this incompatibility when the program is launched and deliver a useful error message?