Some benchmarks fail with PXN enabled

I have observed that some of the benchmarks, e.g. alltoall, fail when launched on more than one node. The error message is from NCCL: `internal error - please report this issue to the NCCL developers`. 
The issue does not occur when I set `NCCL_PXN_DISABLE=1`. I was able to find this out rather quickly due to the NCCL debug message `transport/net.cc:514 NCCL WARN PXN should not use host buffers for data`.

My question is if nccl-tests behaves as intended in these cases. I found remarks on the compatiblity of NCCL buffers and PXN also here: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/bufferreg.html?utm_source=chatgpt.com#buffer-registration-and-pxn. 

If it is in fact incompatible, would it be possible to detect this incompatibility when the program is launched and deliver a useful error message?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some benchmarks fail with PXN enabled #340

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some benchmarks fail with PXN enabled #340

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions