Skip to content

Some benchmarks fail with PXN enabled #340

@MeisterEule

Description

@MeisterEule

I have observed that some of the benchmarks, e.g. alltoall, fail when launched on more than one node. The error message is from NCCL: internal error - please report this issue to the NCCL developers.
The issue does not occur when I set NCCL_PXN_DISABLE=1. I was able to find this out rather quickly due to the NCCL debug message transport/net.cc:514 NCCL WARN PXN should not use host buffers for data.

My question is if nccl-tests behaves as intended in these cases. I found remarks on the compatiblity of NCCL buffers and PXN also here: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/bufferreg.html?utm_source=chatgpt.com#buffer-registration-and-pxn.

If it is in fact incompatible, would it be possible to detect this incompatibility when the program is launched and deliver a useful error message?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions