-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Use poll to avoid busy loops for bootstrap socket-io #1759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thanks Saif. I think I'd simplify / rework this a little bit to make it less intrusive in the main code:
So something like: Let me know what you think. I didn't test the code (it may be broken), I only wrote it to illustrate the idea. |
|
Hi @sjeaugey, thank you for the feedback. Let me iterate the pull request to address your comments. I'll name param as p.s. Pardon me for delayed response as I've been on a leave for a while. |
218f618 to
7b90715
Compare
|
Hi @sjeaugey, I have updated the PR as per your suggestion.
|
|
That patch looks pretty clean, plus the parameter to set the sleep time is nice. I don't think we'd enable it by default right away, but other than that, it seems like a good idea. |
Build and run nccl_allreduce_perf ensuring Initialization path is exercised for testing.
7b90715 to
3f5073b
Compare
|
Sounds good @sjeaugey .. Set the param |
|
Hi @sjeaugey - Wondering if you can help take a look at this patch and get it going. Happy to iterate to ensure it adheres to NCCL team's practices and expectations. |
|
Hi @saifhhasan , I'm on the NCCL team and this is on my list of tasks to take up soon. |
|
@saifhhasan, this has been merged into the v2.29 branch. Unfortunately, I did not get it approved in time for it to go through the v2.28 testing. |
Device API Improvements: - Supports Device API struct versioning for backwards compatibility with future versions. - Adds ncclCommQueryProperties to allow Device API users to check supported features before creating a DevComm. - Adds host-accessible device pointer functions from symmetric registered ncclWindows. - Adds improved GIN documentation to clarify the support matrix. New One-Sided Host APIs: - Adds new host APIs (ncclPutSignal, ncclWaitSignal, etc) for both network and NVL using zero-SM. - One-sided communication operation writes data from the local buffer to a remote peer’s registered memory window without explicit participation from the target process. - Utilizes CopyEngine for NVL transfer and CPU proxy for network. - Requires CUDA 12.5 or greater. New Experimental Python language binding (NCCL4Py): - Pythonic NCCL API for Python applications - native collectives, P2P and other NCCL operations. - Interoperable with CUDA Python ecosystem: DLPack/CUDA Array Interface, and special support for PyTorch and CuPy. - Automatic cleanup of NCCL-managed resources (GPU buffers, registered buffers/windows, custom reduction operations). New LLVM intermediate representation (IR) support: - Exposes NCCL Device APIs through LLVM IR to enable consumption by diverse code generation systems. - Example usages include high-level languages, Just-In-Time (JIT) compilers, and domain-specific languages (DSL). - Build with EMIT_LLVM_IR=1 to generate LLVM IR bitcode. - Requires CUDA 12 and Clang 21. Built-in hybrid (LSA+GIN) symmetric kernel for AllGather: - Adds a new hierarchical kernel using MCRing (NVLS multicast + Ring) to improve performance and scalability of AllGather. - Requires symmetric memory registration and GIN. New ncclCommGrow API: - Adds the ability to dynamically and efficiently add ranks to an existing NCCL communicator. - Use ncclCommGrow with ncclCommShrink to adjust membership of communicators in response to failing and recovering nodes. - Also addresses the need for elastic applications to expand a running job by integrating new ranks. Multi-segment registration: - Expands buffer registration to support multiple segments of physical memory mapped to one contiguous VA space for the p2p, ib and nvls transports. - Enables support for expandable segments in PyTorch. Improves scalability of AllGatherV pattern: - Adds support for a scalable allgatherv pattern (group of broadcasts). - Adds new scheduler path and new kernels to improve performance at large scale. Debuggability & Observability Improvements: - RAS supports realtime monitoring to continuously track peer status changes. - Inspector adds support for Prometheus format output (with NCCL_INSPECTOR_PROM_DUMP=1), in addition to the existing JSON format. - Adds profiler support for CopyEngine(CE) based collectives. Community Engagement: - Adds contribution guide: https://github.com/NVIDIA/nccl/blob/master/CONTRIBUTING.md - Adds NCCL_SOCKET_POLL_TIMEOUT_MSEC which allows waiting instead of spinning during bootstrap in order to reduce CPU usage. (Github PR #1759) - Fixes segfault in ncclGin initialization that can happen if ncclGinIbGdaki.devices() fails after init() succeeds. (Github PR #1881) - Fixes crash that can happen when calling p2p and then collectives while using the same user buffer. (Github Issue #1859) - Fixes bug that was lowering performance on some sm80 or earlier machines with one NIC per GPU. (Github Issue #1876) - Clears non-fatal CUDA errors so they do not propagate. (Pytorch Issue #164402) Other Improvements: - Improves performance of large-size AllGather operations using symmetric memory buffers on Blackwell by transparently switching to CE collectives. - Improves the default number of channels per net peer for all-to-all, send, and recv to achieve better performance. - Improves performance tuning of 256M-512M message sizes on Blackwell for AllReduce. - Enables built-in symmetric kernels only on fully connected nvlink systems, as PCIE systems do not perform as well. - Prints git branch and commit checksum at the INFO level during NCCL initialization. - Improves support for symmetric window registrations on CUDA versions prior to 12.1. - Relaxes symmetric buffer registration requirements for collectives so that users can leverage the symmetric kernels with only one of the buffers being registered, when possible. - All2all, send, recv now obey NCCL_NETDEVS_POLICY. For these operations, NCCL will now by default use a subset of available network devices as dictated by the Network Device Policy. - Fixes a hang on GB200/300 + CX8 when the user disables GDR. - Fixes a bug that could cause AllReduce on ncclFloat8e4m3 to yield “no algorithm/protocol available”. - ncclCommWindowRegister will now return a NULL window if the system does not support window registration. - More prominent error when cuMulticastBind fails and NCCL_NVLS_ENABLE=2. - Upgrades to doca gpunetio v1.1. Known Limitations: - Since Device API was experimental in 2.28.x, applications that use the Device API in v2.28 may need modifications to work with v2.29. - One-sided host APIs (e.g. ncclPutSignal) currently do not support graph capture. Future releases will add cuda graph support. - The improved AllGatherV support breaks the NCCL profiler support for ncclBroadcast operations, limiting visibility to API events. NCCL_ALLGATHERV_ENABLE=0 can be used as a workaround until it is fixed in a future release. - NCCL4Py (experimental) has a known issue with cuda.core 0.5.0. We currently recommend using cuda.core 0.4.1 with nccl4py.
Background
Bootstrap operations enable Communicator to initialize. It happens as part of startup of a Job when PG/Communicators are created.
Bootstrap offers APIs to perform I/O over Sockets (e.g. AllGathers, Barrier, Send/Recv). For large scale jobs the order of bootstrap operations would have higher skew (e.g. late joiners) as well as overall I/O (number of messages & data-size) is large.
In our large scale training, we observed that Bootstrap chokes CPU at 100% (one core per process) and lasts much longer as job scaled. This competes with the other CPU intensive resources especially during startup phase of the job.
We found this by tracing CPU for the duration of Job's initialization. And socket.cc being the highest contributor for the same.
Fix
For non-blocking Socket Read/Writes, the function
socketProgresswhich in-turn callssocketProgressOptwill return if there are no bytes to READ or WRITE. AndsocketWaitwill loop thesocketProgressagain causing the busy read/write loop.The fix is to use
pollforsocketWait, that'll wait until the socket becomes readable / writable to avoid busy loop, and also permit us to abort the operation within 1ms duration. This is similar to what the existing code do forsocketPollConnect.