-
Notifications
You must be signed in to change notification settings - Fork 339
Description
I have tested Multinode ( 2 node ) NCCL Tests on Ubuntu 5.15.x Kernel and 6.8.X kernel. NCCL performnce is very low on 6.8.x kernel. I have mentioned the results below. ANy specific reason for it ? for both of them I have used nvidia-driver-570 and Cuda-12.8
5.15.X Kernel
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=bond0
export NCCL_IB_GID_INDEX=3
export NCCL_MIN_NCHANNELS=32
export NCCL_MAX_NCHANNELS=32
export NCCL_IB_HCA="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1, mlx5_11:1"
$Launch: 8 ranks per node (total 16). Ensure you give each host 8 slots.
mpirun --map-by ppr:8:node --bind-to none -np 16
--host que-srv-hpc-49p:8,que-srv-hpc-30p:8
-x NCCL_DEBUG -x NCCL_IB_DISABLE -x NCCL_SOCKET_IFNAME
-x NCCL_IB_HCA -x NCCL_IB_GID_INDEX
-x NCCL_MIN_NCHANNELS -x NCCL_MAX_NCHANNELS
-x LD_LIBRARY_PATH -x PATH
~/nccl-tests/build/all_reduce_perf -b 16G -e 16G -f 2 -g 1
Out of bounds values : 0 OK
Avg bus bandwidth : 481
6.8.X Kernel
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=bond0
export NCCL_IB_GID_INDEX=3
export NCCL_MIN_NCHANNELS=32
export NCCL_MAX_NCHANNELS=32
export NCCL_IB_HCA="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1, mlx5_11:1"
mpirun --map-by ppr:8:node --bind-to none -np 16
--host que-srv-hpc-49p:8,que-srv-hpc-30p:8
-x NCCL_DEBUG -x NCCL_IB_DISABLE -x NCCL_SOCKET_IFNAME
-x NCCL_IB_HCA -x NCCL_IB_GID_INDEX
-x NCCL_MIN_NCHANNELS -x NCCL_MAX_NCHANNELS
-x LD_LIBRARY_PATH -x PATH
~/nccl-tests/build/all_reduce_perf -b 16G -e 16G -f 2 -g 1
Out of bounds values : 0 OK
Avg bus bandwidth : 104