Skip to content

Multi-node all_reduce_perf / alltoall_perf failures: socketFinalizeAccept wrong type 4 != 3 over TCP and IB QP invalid request errors #360

@AlexanderNekrasov

Description

@AlexanderNekrasov

I'm trying to run all_reduce_perf script and it fails in any configuration. For example following script:

#!/usr/bin/env bash
set -euo pipefail

IFACE="ens1"
BIN="./build/all_reduce_perf_mpi"

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET
export NCCL_ASYNC_ERROR_HANDLING=1

export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME="${IFACE}"

mpirun --allow-run-as-root --mca orte_base_help_aggregate 0 --mca oob_tcp_if_include ens1 --mca btl_tcp_if_include ens1 \
  -H "${HEAD_HOST}:8,${WORKER_HOST}:8" \
  -np 16 \
  "${BIN}" -b 8M -e 128M -f 2 -g 1

gives this output:

node-0:5124:5182 [7] NCCL INFO Connected to proxy localRank 7 -> connection 0x7c82e0002238
node-0:5124:5182 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM
node-0:5124:5166 [7] NCCL INFO New proxy send connection 46 from local rank 7, transport 0
node-0:5124:5182 [7] NCCL INFO Connected to proxy localRank 7 -> connection 0x7c82e00022b0
node-0:5124:5182 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM
node-0:5124:5166 [7] NCCL INFO New proxy send connection 47 from local rank 7, transport 0
node-0:5124:5182 [7] NCCL INFO Connected to proxy localRank 7 -> connection 0x7c82e0002328
node-0:5113:5183 [3] NCCL INFO Connected to proxy localRank 2 -> connection 0x7b6fc0002b98
node-0:5111:5176 [2] NCCL INFO New proxy send connection 65 from local rank 3, transport 0
node-0:5121:5169 [6] NCCL INFO New proxy send connection 65 from local rank 7, transport 0
node-0:5124:5166 [7] NCCL INFO New proxy send connection 48 from local rank 6, transport 0
node-0:5124:5182 [7] NCCL INFO Connected to proxy localRank 6 -> connection 0x76d3d0002b98
node-0:5121:5186 [6] NCCL INFO Connected to proxy localRank 7 -> connection 0x7c82e00023a0
node-0:5118:5168 [5] NCCL INFO New proxy send connection 65 from local rank 6, transport 0
node-0:5121:5186 [6] NCCL INFO Connected to proxy localRank 5 -> connection 0x7dc2cc002b98
node-0:5124:5166 [7] NCCL INFO New proxy send connection 49 from local rank 0, transport 0
node-0:5109:5188 [0] NCCL INFO Connected to proxy localRank 7 -> connection 0x7c82e0002418

node-0:5109:5177 [0] misc/socket.cc:502 NCCL WARN socketFinalizeAccept: wrong type 4 != 3
node-0:5109:5177 [0] NCCL INFO misc/socket.cc:628 -> 3
node-0:5109:5177 [0] NCCL INFO misc/socket.cc:653 -> 3
node-0:5109:5177 [0] NCCL INFO transport/net_socket.cc:403 -> 3
node-0:5109:5177 [0] NCCL INFO transport/net.cc:883 -> 3

node-0:5109:5177 [0] misc/socket.cc:502 NCCL WARN socketFinalizeAccept: wrong type 4 != 3
node-0:5109:5177 [0] NCCL INFO misc/socket.cc:628 -> 3
node-0:5109:5177 [0] NCCL INFO misc/socket.cc:653 -> 3
node-0:5109:5177 [0] NCCL INFO transport/net_socket.cc:403 -> 3
node-0:5109:5177 [0] NCCL INFO transport/net.cc:883 -> 3

node-0:5111:5176 [2] misc/socket.cc:502 NCCL WARN socketFinalizeAccept: wrong type 4 != 3
node-0:5111:5176 [2] NCCL INFO misc/socket.cc:628 -> 3
node-0:5111:5176 [2] NCCL INFO misc/socket.cc:653 -> 3
node-0:5111:5176 [2] NCCL INFO transport/net_socket.cc:403 -> 3
node-0:5111:5176 [2] NCCL INFO transport/net.cc:883 -> 3

node-0:5111:5176 [2] misc/socket.cc:502 NCCL WARN socketFinalizeAccept: wrong type 4 != 3
node-0:5111:5176 [2] NCCL INFO misc/socket.cc:628 -> 3
node-0:5111:5176 [2] NCCL INFO misc/socket.cc:653 -> 3
node-0:5111:5176 [2] NCCL INFO transport/net_socket.cc:403 -> 3
node-0:5111:5176 [2] NCCL INFO transport/net.cc:883 -> 3

Complete log is in the attachment.

log_no_ib.txt

Enabling IB somehow makes all_reduce_perf pass, but when I change to alltoall_perf I encounter different issue:

node-0:5898:6033 [0] NCCL INFO NET/IB: IbDev 0 Port 1 qpn 2740 set_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0}

node-0:5911:5971 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_3:1 async fatal event on QP (0x732d78661238): invalid request local work queue error

node-0:5911:5971 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_3:1 async fatal event on QP (0x732d78f4d338): invalid request local work queue error

node-0:5911:5971 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_3:1 async fatal event on QP (0x732d78f9bc38): invalid request local work queue error

node-0:5914:6002 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_7:1 async fatal event on QP (0x777e20ed1a48): invalid request local work queue error

node-0:5914:6002 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_7:1 async fatal event on QP (0x777e2033fa48): invalid request local work queue error

node-0:5900:5962 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_2:1 async fatal event on QP (0x71b3442663d8): invalid request local work queue error

node-0:5902:5992 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_6:1 async fatal event on QP (0x7bf530835c08): invalid request local work queue error

node-0:5907:6008 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_5:1 async fatal event on QP (0x741c98c56c08): invalid request local work queue error

node-0:5907:6008 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_5:1 async fatal event on QP (0x741c985a0be8): invalid request local work queue error

node-0:5899:5969 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_4:1 async fatal event on QP (0x762cfc06f8c8): invalid request local work queue error

node-0:5899:5969 [0] transport/net_ib.cc:154 NCCL WARN NET/IB : mlx5_4:1 async fatal event on QP (0x762cfc0e23c8): invalid request local work queue error

node-0:5899:6045 [1] transport/net_ib.cc:113 NCCL WARN communicator encountered a fatal error (detected in ncclIbIrecv)

node-0:5899:6045 [1] NCCL INFO transport/net_ib.cc:2117 -> 2
node-0:5899:6045 [1] NCCL INFO transport/net.cc:1335 -> 2
node-0:5899:6045 [1] NCCL INFO proxy.cc:728 -> 2
node-0:5899:6045 [1] NCCL INFO proxy.cc:912 -> 2 [Progress Thread]

node-0:5899:6045 [1] transport/net_ib.cc:113 NCCL WARN communicator encountered a fatal error (detected in ncclIbIrecv)

Full log:

log_ib.txt

All single node tests pass without any issues.

Is it a known symptom of misconfiguration or does it point to something else?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions