Skip to content

UnknownError when trying to profile NCCL all reduce perf test with Nsight Compute #339

@amoghdadhich

Description

@amoghdadhich

Hi, I'm trying to profile the allreduce kernel using Nsight compute to get statistics on occupancy/warp states/stalls. The larger goal is to compare these stats when the communication is co-located with other workloads. However, I run into an unhandled error when running the test with allreduce. The output from running NCCL_DEBUG=INFO ncu -c 1 -- ./build/all_reduce_perf -b 16M -e 16M -g 8 is below:

Output debug info
==PROF== Connected to process 254671 (/software/nccl-tests/tinkerbuild/all_reduce_perf)
#  Rank  0 Group  0 Pid 254671 on       user device  0 [0000:2d:00] NVIDIA L40S
#  Rank  1 Group  0 Pid 254671 on       user device  1 [0000:3a:00] NVIDIA L40S
#  Rank  2 Group  0 Pid 254671 on       user device  2 [0000:3b:00] NVIDIA L40S
#  Rank  3 Group  0 Pid 254671 on       user device  3 [0000:3c:00] NVIDIA L40S
#  Rank  4 Group  0 Pid 254671 on       user device  4 [0000:ad:00] NVIDIA L40S
#  Rank  5 Group  0 Pid 254671 on       user device  5 [0000:ae:00] NVIDIA L40S
#  Rank  6 Group  0 Pid 254671 on       user device  6 [0000:bd:00] NVIDIA L40S
#  Rank  7 Group  0 Pid 254671 on       user device  7 [0000:be:00] NVIDIA L40S
user:254671:254671 [0] NCCL INFO Bootstrap: Using enp3s0f0:130.245.160.52<0>
user:254671:254671 [0] NCCL INFO cudaDriverVersion 12090
user:254671:254671 [0] NCCL INFO NCCL version 2.27.5+cuda12.9
user:254671:254716 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
user:254671:254716 [7] NCCL INFO Failed to open libibverbs.so[.1]
user:254671:254716 [7] NCCL INFO NET/Socket : Using [0]enp3s0f0:130.245.160.52<0> [1]usb0:169.254.3.1<0>
user:254671:254716 [7] NCCL INFO Initialized NET plugin Socket
user:254671:254716 [7] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254716 [7] NCCL INFO Using network Socket
user:254671:254716 [7] NCCL INFO ncclCommInitAll comm 0x56007e9fc3e0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId be000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254713 [4] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254713 [4] NCCL INFO Using network Socket
user:254671:254713 [4] NCCL INFO ncclCommInitAll comm 0x56007e6bd3c0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId ad000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254709 [0] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254709 [0] NCCL INFO Using network Socket
user:254671:254709 [0] NCCL INFO ncclCommInitAll comm 0x56007e267860 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 2d000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254710 [1] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254710 [1] NCCL INFO Using network Socket
user:254671:254710 [1] NCCL INFO ncclCommInitAll comm 0x56007e37dff0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 3a000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254709 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
user:254671:254714 [5] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254714 [5] NCCL INFO Using network Socket
user:254671:254714 [5] NCCL INFO ncclCommInitAll comm 0x56007e7d2140 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ae000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254711 [2] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254711 [2] NCCL INFO Using network Socket
user:254671:254711 [2] NCCL INFO ncclCommInitAll comm 0x56007e492fb0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254712 [3] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254712 [3] NCCL INFO Using network Socket
user:254671:254715 [6] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254715 [6] NCCL INFO Using network Socket
user:254671:254712 [3] NCCL INFO ncclCommInitAll comm 0x56007e5a8110 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 3c000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254715 [6] NCCL INFO ncclCommInitAll comm 0x56007e8e72a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bd000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254710 [1] NCCL INFO Bootstrap timings total 0.128089 (create 0.000044, send 0.000132, recv 0.064933, ring 0.062713, delay 0.000000)
user:254671:254716 [7] NCCL INFO Bootstrap timings total 0.216511 (create 0.000071, send 0.000235, recv 0.061126, ring 0.006626, delay 0.000001)
user:254671:254712 [3] NCCL INFO Bootstrap timings total 0.061337 (create 0.000070, send 0.000207, recv 0.000481, ring 0.006602, delay 0.000000)
user:254671:254714 [5] NCCL INFO Bootstrap timings total 0.068409 (create 0.000073, send 0.000222, recv 0.061500, ring 0.006246, delay 0.000000)
user:254671:254711 [2] NCCL INFO Bootstrap timings total 0.063709 (create 0.000070, send 0.000234, recv 0.002716, ring 0.060366, delay 0.000000)
user:254671:254709 [0] NCCL INFO Bootstrap timings total 0.155804 (create 0.000070, send 0.000217, recv 0.027850, ring 0.126760, delay 0.000000)
user:254671:254715 [6] NCCL INFO Bootstrap timings total 0.060663 (create 0.000066, send 0.000209, recv 0.000468, ring 0.000265, delay 0.000000)
user:254671:254713 [4] NCCL INFO Bootstrap timings total 0.182884 (create 0.000069, send 0.000220, recv 0.114746, ring 0.060337, delay 0.000000)
user:254671:254710 [1] NCCL INFO Setting affinity for GPU 1 to 0-15,64-79
user:254671:254710 [1] NCCL INFO NVLS multicast support is not available on dev 1 (NVLS_NCHANNELS 0)
user:254671:254709 [0] NCCL INFO Setting affinity for GPU 0 to 0-15,64-79
user:254671:254709 [0] NCCL INFO NVLS multicast support is not available on dev 0 (NVLS_NCHANNELS 0)
user:254671:254711 [2] NCCL INFO Setting affinity for GPU 2 to 0-15,64-79
user:254671:254711 [2] NCCL INFO NVLS multicast support is not available on dev 2 (NVLS_NCHANNELS 0)
user:254671:254716 [7] NCCL INFO Setting affinity for GPU 7 to 32-47,96-111
user:254671:254716 [7] NCCL INFO NVLS multicast support is not available on dev 7 (NVLS_NCHANNELS 0)
user:254671:254713 [4] NCCL INFO Setting affinity for GPU 4 to 32-47,96-111
user:254671:254713 [4] NCCL INFO NVLS multicast support is not available on dev 4 (NVLS_NCHANNELS 0)
user:254671:254715 [6] NCCL INFO Setting affinity for GPU 6 to 32-47,96-111
user:254671:254715 [6] NCCL INFO NVLS multicast support is not available on dev 6 (NVLS_NCHANNELS 0)
user:254671:254714 [5] NCCL INFO Setting affinity for GPU 5 to 32-47,96-111
user:254671:254712 [3] NCCL INFO Setting affinity for GPU 3 to 0-15,64-79
user:254671:254712 [3] NCCL INFO NVLS multicast support is not available on dev 3 (NVLS_NCHANNELS 0)
user:254671:254714 [5] NCCL INFO NVLS multicast support is not available on dev 5 (NVLS_NCHANNELS 0)
user:254671:254711 [2] NCCL INFO comm 0x56007e492fb0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
user:254671:254713 [4] NCCL INFO comm 0x56007e6bd3c0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
user:254671:254711 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
user:254671:254711 [2] NCCL INFO P2P Chunksize set to 131072
user:254671:254713 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
user:254671:254713 [4] NCCL INFO P2P Chunksize set to 131072
user:254671:254716 [7] NCCL INFO comm 0x56007e9fc3e0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
user:254671:254710 [1] NCCL INFO comm 0x56007e37dff0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
user:254671:254710 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
user:254671:254710 [1] NCCL INFO P2P Chunksize set to 131072
user:254671:254711 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
user:254671:254715 [6] NCCL INFO comm 0x56007e8e72a0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
user:254671:254714 [5] NCCL INFO comm 0x56007e7d2140 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
user:254671:254714 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
user:254671:254714 [5] NCCL INFO P2P Chunksize set to 131072
user:254671:254709 [0] NCCL INFO comm 0x56007e267860 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
user:254671:254709 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7
user:254671:254709 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7
user:254671:254709 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
user:254671:254709 [0] NCCL INFO P2P Chunksize set to 131072
user:254671:254709 [0] NCCL INFO Check P2P Type isAllDirectP2p 0 directMode 1
user:254671:254716 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
user:254671:254716 [7] NCCL INFO P2P Chunksize set to 131072
user:254671:254715 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
user:254671:254715 [6] NCCL INFO P2P Chunksize set to 131072
user:254671:254713 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
user:254671:254712 [3] NCCL INFO comm 0x56007e5a8110 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
user:254671:254718 [2] NCCL INFO [Proxy Service] Device 2 CPU core 1
user:254671:254712 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
user:254671:254712 [3] NCCL INFO P2P Chunksize set to 131072
user:254671:254719 [5] NCCL INFO [Proxy Service] Device 5 CPU core 97
user:254671:254720 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 68
user:254671:254724 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 43
user:254671:254723 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 70
user:254671:254710 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
user:254671:254722 [0] NCCL INFO [Proxy Service] Device 0 CPU core 69
user:254671:254721 [6] NCCL INFO [Proxy Service] Device 6 CPU core 106
user:254671:254732 [1] NCCL INFO [Proxy Service] Device 1 CPU core 73
user:254671:254725 [4] NCCL INFO [Proxy Service] Device 4 CPU core 35
user:254671:254728 [7] NCCL INFO [Proxy Service] Device 7 CPU core 102
user:254671:254730 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 8
user:254671:254731 [7] NCCL INFO [Proxy Service UDS] Device 7 CPU core 103
user:254671:254726 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 36
user:254671:254729 [3] NCCL INFO [Proxy Service] Device 3 CPU core 71
user:254671:254727 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 101
user:254671:254733 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 10
user:254671:254713 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254713 [4] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer  
user:254671:254712 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254712 [3] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer  
user:254671:254714 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254714 [5] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer  
user:254671:254711 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254711 [2] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer  
user:254671:254710 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254710 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer  
user:254671:254715 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254715 [6] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer  
user:254671:254709 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254709 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer  
user:254671:254716 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254716 [7] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer  
user:254671:254709 [0] NCCL INFO CC Off, workFifoBytes 1048576
user:254671:254714 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
user:254671:254714 [5] NCCL INFO ncclCommInitAll comm 0x56007e7d2140 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ae000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254714 [5] NCCL INFO Init timings - ncclCommInitAll: rank 5 nranks 8 total 1.58 (kernels 1.31, alloc 0.00, bootstrap 0.07, allgathers 0.01, topo 0.09, graphs 0.05, connections 0.0
2, rest 0.02)
user:254671:254710 [1] NCCL INFO ncclCommInitAll comm 0x56007e37dff0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 3a000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254710 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 8 total 1.59 (kernels 1.25, alloc 0.00, bootstrap 0.13, allgathers 0.05, topo 0.07, graphs 0.04, connections 0.0
2, rest 0.02)
user:254671:254716 [7] NCCL INFO ncclCommInitAll comm 0x56007e9fc3e0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId be000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254711 [2] NCCL INFO ncclCommInitAll comm 0x56007e492fb0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254711 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 8 total 1.59 (kernels 1.32, alloc 0.00, bootstrap 0.06, allgathers 0.01, topo 0.08, graphs 0.07, connections 0.03, rest 0.01)
user:254671:254715 [6] NCCL INFO ncclCommInitAll comm 0x56007e8e72a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bd000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254715 [6] NCCL INFO Init timings - ncclCommInitAll: rank 6 nranks 8 total 1.58 (kernels 1.32, alloc 0.00, bootstrap 0.06, allgathers 0.02, topo 0.09, graphs 0.05, connections 0.03, rest 0.00)
user:254671:254716 [7] NCCL INFO Init timings - ncclCommInitAll: rank 7 nranks 8 total 1.58 (kernels 1.15, alloc 0.01, bootstrap 0.22, allgathers 0.04, topo 0.09, graphs 0.04, connections 0.03, rest 0.01)
user:254671:254709 [0] NCCL INFO ncclCommInitAll comm 0x56007e267860 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 2d000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254713 [4] NCCL INFO Init timings - ncclCommInitAll: rank 4 nranks 8 total 1.59 (kernels 1.20, alloc 0.00, bootstrap 0.18, allgathers 0.02, topo 0.09, graphs 0.06, connections 0.02, rest 0.02)
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
user:254671:254739 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
user:254671:254739 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
user:254671:254738 [3] NCCL INFO Channel 00 : 3[3] -> 4[4] via SHM/direct/direct
user:254671:254734 [7] NCCL INFO Channel 00 : 7[7] -> 0[0] via SHM/direct/direct
user:254671:254736 [5] NCCL INFO Channel 00 : 5[5] -> 6[6] via SHM/direct/direct
user:254671:254738 [3] NCCL INFO Channel 01 : 3[3] -> 4[4] via SHM/direct/direct
user:254671:254734 [7] NCCL INFO Channel 01 : 7[7] -> 0[0] via SHM/direct/direct
user:254671:254736 [5] NCCL INFO Channel 01 : 5[5] -> 6[6] via SHM/direct/direct
user:254671:254740 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
user:254671:254740 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
user:254671:254737 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/direct pointer
user:254671:254737 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/direct pointer
user:254671:254735 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/direct pointer
user:254671:254735 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/direct pointer
user:254671:254741 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
user:254671:254741 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
user:254671:254740 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254735 [6] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254734 [7] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254741 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254738 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254736 [5] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254737 [4] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254739 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
==PROF== Profiling "ncclDevKernel_AllReduce_Sum_f..." - 0 (1/1): 0%
==ERROR== UnknownError
==ERROR== Failed to profile "ncclDevKernel_AllReduce_Sum_f..." in process 254671
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
Nothing particularly stands out to me as problematic in the logs until the error is thrown. The all_reduce_perf test runs as expected when not being profiled with Nsight Compute.

Setup details:

  • CUDA version = 12.9
  • NCCL version 2.27.5+cuda12.9
  • CUDA driver version = 575.57.08
  • GPUs = 8 NVIDIA L40S GPUs on a single node (GPUs are in default mode, not running MPS)
  • OS Ubuntu 22.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions