Skip to content

Conversation

@romerojosh
Copy link
Collaborator

@romerojosh romerojosh commented Apr 11, 2025

For large scale cases with many tasks, the pipelined backends require launching many (usually small) individual pack/unpack kernels to overlap with the communication operations. In some cases, the time it takes to launch the full set of packing kernels can cause delays in launching the first communication operation in the pipeline, resulting in reduced overlap efficiency.

This PR adds the ability to use CUDA Graphs APIs to capture/replay the sequence of packing kernel launches for the pipelined backends. This reduces the time it takes to launch all the packing kernels and as a result, improves overlap efficiency.

For now, this feature is currently opt-in via a new environment variable CUDECOMP_ENABLE_CUDA_GRAPHS. This feature may be enabled by default in the future.

@romerojosh romerojosh merged commit 1c0edde into main Apr 16, 2025
@romerojosh romerojosh deleted the graphs branch July 8, 2025 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants