-
Notifications
You must be signed in to change notification settings - Fork 339
Open
Description
Hi, I'm trying to profile the allreduce kernel using Nsight compute to get statistics on occupancy/warp states/stalls. The larger goal is to compare these stats when the communication is co-located with other workloads. However, I run into an unhandled error when running the test with allreduce. The output from running NCCL_DEBUG=INFO ncu -c 1 -- ./build/all_reduce_perf -b 16M -e 16M -g 8 is below:
Output debug info
==PROF== Connected to process 254671 (/software/nccl-tests/tinkerbuild/all_reduce_perf)
# Rank 0 Group 0 Pid 254671 on user device 0 [0000:2d:00] NVIDIA L40S
# Rank 1 Group 0 Pid 254671 on user device 1 [0000:3a:00] NVIDIA L40S
# Rank 2 Group 0 Pid 254671 on user device 2 [0000:3b:00] NVIDIA L40S
# Rank 3 Group 0 Pid 254671 on user device 3 [0000:3c:00] NVIDIA L40S
# Rank 4 Group 0 Pid 254671 on user device 4 [0000:ad:00] NVIDIA L40S
# Rank 5 Group 0 Pid 254671 on user device 5 [0000:ae:00] NVIDIA L40S
# Rank 6 Group 0 Pid 254671 on user device 6 [0000:bd:00] NVIDIA L40S
# Rank 7 Group 0 Pid 254671 on user device 7 [0000:be:00] NVIDIA L40S
user:254671:254671 [0] NCCL INFO Bootstrap: Using enp3s0f0:130.245.160.52<0>
user:254671:254671 [0] NCCL INFO cudaDriverVersion 12090
user:254671:254671 [0] NCCL INFO NCCL version 2.27.5+cuda12.9
user:254671:254716 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
user:254671:254716 [7] NCCL INFO Failed to open libibverbs.so[.1]
user:254671:254716 [7] NCCL INFO NET/Socket : Using [0]enp3s0f0:130.245.160.52<0> [1]usb0:169.254.3.1<0>
user:254671:254716 [7] NCCL INFO Initialized NET plugin Socket
user:254671:254716 [7] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254716 [7] NCCL INFO Using network Socket
user:254671:254716 [7] NCCL INFO ncclCommInitAll comm 0x56007e9fc3e0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId be000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254713 [4] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254713 [4] NCCL INFO Using network Socket
user:254671:254713 [4] NCCL INFO ncclCommInitAll comm 0x56007e6bd3c0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId ad000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254709 [0] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254709 [0] NCCL INFO Using network Socket
user:254671:254709 [0] NCCL INFO ncclCommInitAll comm 0x56007e267860 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 2d000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254710 [1] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254710 [1] NCCL INFO Using network Socket
user:254671:254710 [1] NCCL INFO ncclCommInitAll comm 0x56007e37dff0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 3a000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254709 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
user:254671:254714 [5] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254714 [5] NCCL INFO Using network Socket
user:254671:254714 [5] NCCL INFO ncclCommInitAll comm 0x56007e7d2140 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ae000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254711 [2] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254711 [2] NCCL INFO Using network Socket
user:254671:254711 [2] NCCL INFO ncclCommInitAll comm 0x56007e492fb0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254712 [3] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254712 [3] NCCL INFO Using network Socket
user:254671:254715 [6] NCCL INFO Assigned NET plugin Socket to comm
user:254671:254715 [6] NCCL INFO Using network Socket
user:254671:254712 [3] NCCL INFO ncclCommInitAll comm 0x56007e5a8110 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 3c000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254715 [6] NCCL INFO ncclCommInitAll comm 0x56007e8e72a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bd000 commId 0x2fa057d66cc99ae5 - Init START
user:254671:254710 [1] NCCL INFO Bootstrap timings total 0.128089 (create 0.000044, send 0.000132, recv 0.064933, ring 0.062713, delay 0.000000)
user:254671:254716 [7] NCCL INFO Bootstrap timings total 0.216511 (create 0.000071, send 0.000235, recv 0.061126, ring 0.006626, delay 0.000001)
user:254671:254712 [3] NCCL INFO Bootstrap timings total 0.061337 (create 0.000070, send 0.000207, recv 0.000481, ring 0.006602, delay 0.000000)
user:254671:254714 [5] NCCL INFO Bootstrap timings total 0.068409 (create 0.000073, send 0.000222, recv 0.061500, ring 0.006246, delay 0.000000)
user:254671:254711 [2] NCCL INFO Bootstrap timings total 0.063709 (create 0.000070, send 0.000234, recv 0.002716, ring 0.060366, delay 0.000000)
user:254671:254709 [0] NCCL INFO Bootstrap timings total 0.155804 (create 0.000070, send 0.000217, recv 0.027850, ring 0.126760, delay 0.000000)
user:254671:254715 [6] NCCL INFO Bootstrap timings total 0.060663 (create 0.000066, send 0.000209, recv 0.000468, ring 0.000265, delay 0.000000)
user:254671:254713 [4] NCCL INFO Bootstrap timings total 0.182884 (create 0.000069, send 0.000220, recv 0.114746, ring 0.060337, delay 0.000000)
user:254671:254710 [1] NCCL INFO Setting affinity for GPU 1 to 0-15,64-79
user:254671:254710 [1] NCCL INFO NVLS multicast support is not available on dev 1 (NVLS_NCHANNELS 0)
user:254671:254709 [0] NCCL INFO Setting affinity for GPU 0 to 0-15,64-79
user:254671:254709 [0] NCCL INFO NVLS multicast support is not available on dev 0 (NVLS_NCHANNELS 0)
user:254671:254711 [2] NCCL INFO Setting affinity for GPU 2 to 0-15,64-79
user:254671:254711 [2] NCCL INFO NVLS multicast support is not available on dev 2 (NVLS_NCHANNELS 0)
user:254671:254716 [7] NCCL INFO Setting affinity for GPU 7 to 32-47,96-111
user:254671:254716 [7] NCCL INFO NVLS multicast support is not available on dev 7 (NVLS_NCHANNELS 0)
user:254671:254713 [4] NCCL INFO Setting affinity for GPU 4 to 32-47,96-111
user:254671:254713 [4] NCCL INFO NVLS multicast support is not available on dev 4 (NVLS_NCHANNELS 0)
user:254671:254715 [6] NCCL INFO Setting affinity for GPU 6 to 32-47,96-111
user:254671:254715 [6] NCCL INFO NVLS multicast support is not available on dev 6 (NVLS_NCHANNELS 0)
user:254671:254714 [5] NCCL INFO Setting affinity for GPU 5 to 32-47,96-111
user:254671:254712 [3] NCCL INFO Setting affinity for GPU 3 to 0-15,64-79
user:254671:254712 [3] NCCL INFO NVLS multicast support is not available on dev 3 (NVLS_NCHANNELS 0)
user:254671:254714 [5] NCCL INFO NVLS multicast support is not available on dev 5 (NVLS_NCHANNELS 0)
user:254671:254711 [2] NCCL INFO comm 0x56007e492fb0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
user:254671:254713 [4] NCCL INFO comm 0x56007e6bd3c0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
user:254671:254711 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
user:254671:254711 [2] NCCL INFO P2P Chunksize set to 131072
user:254671:254713 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
user:254671:254713 [4] NCCL INFO P2P Chunksize set to 131072
user:254671:254716 [7] NCCL INFO comm 0x56007e9fc3e0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
user:254671:254710 [1] NCCL INFO comm 0x56007e37dff0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
user:254671:254710 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
user:254671:254710 [1] NCCL INFO P2P Chunksize set to 131072
user:254671:254711 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
user:254671:254715 [6] NCCL INFO comm 0x56007e8e72a0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
user:254671:254714 [5] NCCL INFO comm 0x56007e7d2140 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
user:254671:254714 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
user:254671:254714 [5] NCCL INFO P2P Chunksize set to 131072
user:254671:254709 [0] NCCL INFO comm 0x56007e267860 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
user:254671:254709 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7
user:254671:254709 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7
user:254671:254709 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
user:254671:254709 [0] NCCL INFO P2P Chunksize set to 131072
user:254671:254709 [0] NCCL INFO Check P2P Type isAllDirectP2p 0 directMode 1
user:254671:254716 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
user:254671:254716 [7] NCCL INFO P2P Chunksize set to 131072
user:254671:254715 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
user:254671:254715 [6] NCCL INFO P2P Chunksize set to 131072
user:254671:254713 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
user:254671:254712 [3] NCCL INFO comm 0x56007e5a8110 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
user:254671:254718 [2] NCCL INFO [Proxy Service] Device 2 CPU core 1
user:254671:254712 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
user:254671:254712 [3] NCCL INFO P2P Chunksize set to 131072
user:254671:254719 [5] NCCL INFO [Proxy Service] Device 5 CPU core 97
user:254671:254720 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 68
user:254671:254724 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 43
user:254671:254723 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 70
user:254671:254710 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
user:254671:254722 [0] NCCL INFO [Proxy Service] Device 0 CPU core 69
user:254671:254721 [6] NCCL INFO [Proxy Service] Device 6 CPU core 106
user:254671:254732 [1] NCCL INFO [Proxy Service] Device 1 CPU core 73
user:254671:254725 [4] NCCL INFO [Proxy Service] Device 4 CPU core 35
user:254671:254728 [7] NCCL INFO [Proxy Service] Device 7 CPU core 102
user:254671:254730 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 8
user:254671:254731 [7] NCCL INFO [Proxy Service UDS] Device 7 CPU core 103
user:254671:254726 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 36
user:254671:254729 [3] NCCL INFO [Proxy Service] Device 3 CPU core 71
user:254671:254727 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 101
user:254671:254733 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 10
user:254671:254713 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254713 [4] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:254671:254712 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254712 [3] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:254671:254714 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254714 [5] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:254671:254711 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254711 [2] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:254671:254710 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254710 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:254671:254715 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254715 [6] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:254671:254709 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254709 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:254671:254716 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:254671:254716 [7] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:254671:254709 [0] NCCL INFO CC Off, workFifoBytes 1048576
user:254671:254714 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
user:254671:254714 [5] NCCL INFO ncclCommInitAll comm 0x56007e7d2140 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ae000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254714 [5] NCCL INFO Init timings - ncclCommInitAll: rank 5 nranks 8 total 1.58 (kernels 1.31, alloc 0.00, bootstrap 0.07, allgathers 0.01, topo 0.09, graphs 0.05, connections 0.0
2, rest 0.02)
user:254671:254710 [1] NCCL INFO ncclCommInitAll comm 0x56007e37dff0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 3a000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254710 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 8 total 1.59 (kernels 1.25, alloc 0.00, bootstrap 0.13, allgathers 0.05, topo 0.07, graphs 0.04, connections 0.0
2, rest 0.02)
user:254671:254716 [7] NCCL INFO ncclCommInitAll comm 0x56007e9fc3e0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId be000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254711 [2] NCCL INFO ncclCommInitAll comm 0x56007e492fb0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254711 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 8 total 1.59 (kernels 1.32, alloc 0.00, bootstrap 0.06, allgathers 0.01, topo 0.08, graphs 0.07, connections 0.03, rest 0.01)
user:254671:254715 [6] NCCL INFO ncclCommInitAll comm 0x56007e8e72a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bd000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254715 [6] NCCL INFO Init timings - ncclCommInitAll: rank 6 nranks 8 total 1.58 (kernels 1.32, alloc 0.00, bootstrap 0.06, allgathers 0.02, topo 0.09, graphs 0.05, connections 0.03, rest 0.00)
user:254671:254716 [7] NCCL INFO Init timings - ncclCommInitAll: rank 7 nranks 8 total 1.58 (kernels 1.15, alloc 0.01, bootstrap 0.22, allgathers 0.04, topo 0.09, graphs 0.04, connections 0.03, rest 0.01)
user:254671:254709 [0] NCCL INFO ncclCommInitAll comm 0x56007e267860 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 2d000 commId 0x2fa057d66cc99ae5 - Init COMPLETE
user:254671:254713 [4] NCCL INFO Init timings - ncclCommInitAll: rank 4 nranks 8 total 1.59 (kernels 1.20, alloc 0.00, bootstrap 0.18, allgathers 0.02, topo 0.09, graphs 0.06, connections 0.02, rest 0.02)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
user:254671:254739 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
user:254671:254739 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
user:254671:254738 [3] NCCL INFO Channel 00 : 3[3] -> 4[4] via SHM/direct/direct
user:254671:254734 [7] NCCL INFO Channel 00 : 7[7] -> 0[0] via SHM/direct/direct
user:254671:254736 [5] NCCL INFO Channel 00 : 5[5] -> 6[6] via SHM/direct/direct
user:254671:254738 [3] NCCL INFO Channel 01 : 3[3] -> 4[4] via SHM/direct/direct
user:254671:254734 [7] NCCL INFO Channel 01 : 7[7] -> 0[0] via SHM/direct/direct
user:254671:254736 [5] NCCL INFO Channel 01 : 5[5] -> 6[6] via SHM/direct/direct
user:254671:254740 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
user:254671:254740 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
user:254671:254737 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/direct pointer
user:254671:254737 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/direct pointer
user:254671:254735 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/direct pointer
user:254671:254735 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/direct pointer
user:254671:254741 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
user:254671:254741 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
user:254671:254740 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254735 [6] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254734 [7] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254741 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254738 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254736 [5] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254737 [4] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
user:254671:254739 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
==PROF== Profiling "ncclDevKernel_AllReduce_Sum_f..." - 0 (1/1): 0%
==ERROR== UnknownError
==ERROR== Failed to profile "ncclDevKernel_AllReduce_Sum_f..." in process 254671
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).Setup details:
- CUDA version = 12.9
- NCCL version 2.27.5+cuda12.9
- CUDA driver version = 575.57.08
- GPUs = 8 NVIDIA L40S GPUs on a single node (GPUs are in default mode, not running MPS)
- OS Ubuntu 22.04
Metadata
Metadata
Assignees
Labels
No labels