Skip to content

Conversation

@saifhhasan
Copy link

Background

Bootstrap operations enable Communicator to initialize. It happens as part of startup of a Job when PG/Communicators are created.

Bootstrap offers APIs to perform I/O over Sockets (e.g. AllGathers, Barrier, Send/Recv). For large scale jobs the order of bootstrap operations would have higher skew (e.g. late joiners) as well as overall I/O (number of messages & data-size) is large.

In our large scale training, we observed that Bootstrap chokes CPU at 100% (one core per process) and lasts much longer as job scaled. This competes with the other CPU intensive resources especially during startup phase of the job.

We found this by tracing CPU for the duration of Job's initialization. And socket.cc being the highest contributor for the same.

Screenshot 2025-07-01 at 11 06 42 AM

Fix

For non-blocking Socket Read/Writes, the function socketProgress which in-turn calls socketProgressOpt will return if there are no bytes to READ or WRITE. And socketWait will loop the socketProgress again causing the busy read/write loop.

The fix is to use poll for socketWait, that'll wait until the socket becomes readable / writable to avoid busy loop, and also permit us to abort the operation within 1ms duration. This is similar to what the existing code do for socketPollConnect.

@sjeaugey
Copy link
Member

sjeaugey commented Jul 2, 2025

Thanks Saif. I think I'd simplify / rework this a little bit to make it less intrusive in the main code:

  • Extract the poll call to a separate function.
  • Remove the op == NCCL_SOCKET_SEND || op == NCCL_SOCKET_RECV given those are the only two options.
  • Also add a NCCL_PARAM to control whether we want socket wait operations to be blocking or actively polling.

So something like:

NCCL_PARAM(SockWaitSleep, "SOCK_WAIT_SLEEP", 1);

static void sleepOnFd(int fd, int op) {
 struct pollfd pfd;
 int timeout = 1;
 pfd.fd = fd;
 pfd.events =  op == NCCL_SOCKET_SEND ? POLLOUT : POLLIN;
 poll(&pfd, 1, timeout);
}

static ncclResult_t socketWait(int op, struct ncclSocket* sock, void* ptr, int size, int* offset) {
 while (*offset < size) {
   NCCLCHECK(socketProgress(op, sock, ptr, size, offset));
   if (*offset < size && ncclParamSockWaitSleep()) sleepOnFd(sock->fd);
 }
 return ncclSuccess;
}

Let me know what you think. I didn't test the code (it may be broken), I only wrote it to illustrate the idea.

@saifhhasan
Copy link
Author

Hi @sjeaugey, thank you for the feedback. Let me iterate the pull request to address your comments. I'll name param as SOCKET_POLL_TIMEOUT_MS just to be semantically correct as it won't really sleep but rather wait until that timeout before returning.

p.s. Pardon me for delayed response as I've been on a leave for a while.

@saifhhasan saifhhasan force-pushed the meta-saif-socket-poll branch from 218f618 to 7b90715 Compare July 14, 2025 18:46
@saifhhasan
Copy link
Author

Hi @sjeaugey, I have updated the PR as per your suggestion.

  • Introduced a new param
  • Refactored out the socket poll in a helper function
  • Retested by running AllReduce perf exercising this code path

@sjeaugey
Copy link
Member

That patch looks pretty clean, plus the parameter to set the sleep time is nice. I don't think we'd enable it by default right away, but other than that, it seems like a good idea.

Build and run nccl_allreduce_perf ensuring Initialization path is
exercised for testing.
@saifhhasan saifhhasan force-pushed the meta-saif-socket-poll branch from 7b90715 to 3f5073b Compare July 15, 2025 17:52
@saifhhasan
Copy link
Author

Sounds good @sjeaugey .. Set the param 0 to disable by default. Thank you for helping review the diffs.

@saifhhasan
Copy link
Author

Hi @sjeaugey - Wondering if you can help take a look at this patch and get it going. Happy to iterate to ensure it adheres to NCCL team's practices and expectations.

@marksantesson
Copy link
Collaborator

marksantesson commented Jul 21, 2025

Hi @saifhhasan , I'm on the NCCL team and this is on my list of tasks to take up soon.

@marksantesson
Copy link
Collaborator

@saifhhasan, this has been merged into the v2.29 branch. Unfortunately, I did not get it approved in time for it to go through the v2.28 testing.

marksantesson added a commit that referenced this pull request Dec 24, 2025
Device API Improvements:
- Supports Device API struct versioning for backwards compatibility with future versions.
- Adds ncclCommQueryProperties to allow Device API users to check supported features before creating a DevComm.
- Adds host-accessible device pointer functions from symmetric registered ncclWindows.
- Adds improved GIN documentation to clarify the support matrix.

New One-Sided Host APIs:
- Adds new host APIs (ncclPutSignal, ncclWaitSignal, etc) for both network and NVL using zero-SM.
- One-sided communication operation writes data from the local buffer to a remote peer’s registered memory window without explicit participation from the target process.
- Utilizes CopyEngine for NVL transfer and CPU proxy for network.
- Requires CUDA 12.5 or greater.

New Experimental Python language binding (NCCL4Py):
- Pythonic NCCL API for Python applications - native collectives, P2P and other NCCL operations.
- Interoperable with CUDA Python ecosystem: DLPack/CUDA Array Interface, and special support for PyTorch and CuPy.
- Automatic cleanup of NCCL-managed resources (GPU buffers, registered buffers/windows, custom reduction operations).

New LLVM intermediate representation (IR) support:
- Exposes NCCL Device APIs through LLVM IR to enable consumption by diverse code generation systems.
- Example usages include high-level languages, Just-In-Time (JIT) compilers, and domain-specific languages (DSL).
- Build with EMIT_LLVM_IR=1 to generate LLVM IR bitcode.
- Requires CUDA 12 and Clang 21.

Built-in hybrid (LSA+GIN) symmetric kernel for AllGather:
- Adds a new hierarchical kernel using MCRing (NVLS multicast + Ring) to improve performance and scalability of AllGather.
- Requires symmetric memory registration and GIN.

New ncclCommGrow API:
- Adds the ability to dynamically and efficiently add ranks to an existing NCCL communicator.
- Use ncclCommGrow with ncclCommShrink to adjust membership of communicators in response to failing and recovering nodes.
- Also addresses the need for elastic applications to expand a running job by integrating new ranks.

Multi-segment registration:
- Expands buffer registration to support multiple segments of physical memory mapped to one contiguous VA space for the p2p, ib and nvls transports.
- Enables support for expandable segments in PyTorch.

Improves scalability of AllGatherV pattern:
- Adds support for a scalable allgatherv pattern (group of broadcasts).
- Adds new scheduler path and new kernels to improve performance at large scale.

Debuggability & Observability Improvements:
- RAS supports realtime monitoring to continuously track peer status changes.
- Inspector adds support for Prometheus format output (with NCCL_INSPECTOR_PROM_DUMP=1), in addition to the existing JSON format.
- Adds profiler support for CopyEngine(CE) based collectives.

Community Engagement:
- Adds contribution guide: https://github.com/NVIDIA/nccl/blob/master/CONTRIBUTING.md
- Adds NCCL_SOCKET_POLL_TIMEOUT_MSEC which allows waiting instead of spinning during bootstrap in order to reduce CPU usage. (Github PR #1759)
- Fixes segfault in ncclGin initialization that can happen if ncclGinIbGdaki.devices() fails after init() succeeds. (Github PR #1881)
- Fixes crash that can happen when calling p2p and then collectives while using the same user buffer. (Github Issue #1859)
- Fixes bug that was lowering performance on some sm80 or earlier machines with one NIC per GPU. (Github Issue #1876)
- Clears non-fatal CUDA errors so they do not propagate. (Pytorch Issue #164402)

Other Improvements:
- Improves performance of large-size AllGather operations using symmetric memory buffers on Blackwell by transparently switching to CE collectives.
- Improves the default number of channels per net peer for all-to-all, send, and recv to achieve better performance.
- Improves performance tuning of 256M-512M message sizes on Blackwell for AllReduce.
- Enables built-in symmetric kernels only on fully connected nvlink systems, as PCIE systems do not perform as well.
- Prints git branch and commit checksum at the INFO level during NCCL initialization.
- Improves support for symmetric window registrations on CUDA versions prior to 12.1.
- Relaxes symmetric buffer registration requirements for collectives so that users can leverage the symmetric kernels with only one of the buffers being registered, when possible.
- All2all, send, recv now obey NCCL_NETDEVS_POLICY. For these operations, NCCL will now by default use a subset of available network devices as dictated by the Network Device Policy.
- Fixes a hang on GB200/300 + CX8 when the user disables GDR.
- Fixes a bug that could cause AllReduce on ncclFloat8e4m3 to yield “no algorithm/protocol available”.
- ncclCommWindowRegister will now return a NULL window if the system does not support window registration.
- More prominent error when cuMulticastBind fails and NCCL_NVLS_ENABLE=2.
- Upgrades to doca gpunetio v1.1.

Known Limitations:
- Since Device API was experimental in 2.28.x, applications that use the Device API in v2.28 may need modifications to work with v2.29.
- One-sided host APIs (e.g. ncclPutSignal) currently do not support graph capture. Future releases will add cuda graph support.
- The improved AllGatherV support breaks the NCCL profiler support for ncclBroadcast operations, limiting visibility to API events. NCCL_ALLGATHERV_ENABLE=0 can be used as a workaround until it is fixed in a future release.
- NCCL4Py (experimental) has a known issue with cuda.core 0.5.0. We currently recommend using cuda.core 0.4.1 with nccl4py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants