Improve packing kernel launch efficiency for pipelined backends using CUDA graphs. #68

romerojosh · 2025-04-11T19:01:36Z

For large scale cases with many tasks, the pipelined backends require launching many (usually small) individual pack/unpack kernels to overlap with the communication operations. In some cases, the time it takes to launch the full set of packing kernels can cause delays in launching the first communication operation in the pipeline, resulting in reduced overlap efficiency.

This PR adds the ability to use CUDA Graphs APIs to capture/replay the sequence of packing kernel launches for the pipelined backends. This reduces the time it takes to launch all the packing kernels and as a result, improves overlap efficiency.

For now, this feature is currently opt-in via a new environment variable CUDECOMP_ENABLE_CUDA_GRAPHS. This feature may be enabled by default in the future.

…raphs.

…hs by default.

…between each pdim/backend combination.

romerojosh added 8 commits April 10, 2025 10:03

Improve packing launch efficiency for pipelined backends using CUDA g…

3a9aaed

…raphs.

Fixes.

5748e91

Destroy stored cuda graph exec entries in graphCache destructor.

6e32a23

Clear CUDA graph cache during transpose autotuning. Disable CUDA grap…

fa38b9b

…hs by default.

Update docs.

89b3e75

Formatting.

f3e9fce

Fix issue with graph cache clearing during autotuning. Need to clear …

6792fab

…between each pdim/backend combination.

Add datatype to graph cache key.

8d7cca0

romerojosh merged commit 1c0edde into main Apr 16, 2025

romerojosh deleted the graphs branch July 8, 2025 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve packing kernel launch efficiency for pipelined backends using CUDA graphs. #68

Improve packing kernel launch efficiency for pipelined backends using CUDA graphs. #68

Uh oh!

romerojosh commented Apr 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve packing kernel launch efficiency for pipelined backends using CUDA graphs. #68

Improve packing kernel launch efficiency for pipelined backends using CUDA graphs. #68

Uh oh!

Conversation

romerojosh commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

romerojosh commented Apr 11, 2025 •

edited

Loading