Improve packing kernel launch efficiency for pipelined backends using CUDA graphs. #68
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For large scale cases with many tasks, the pipelined backends require launching many (usually small) individual pack/unpack kernels to overlap with the communication operations. In some cases, the time it takes to launch the full set of packing kernels can cause delays in launching the first communication operation in the pipeline, resulting in reduced overlap efficiency.
This PR adds the ability to use CUDA Graphs APIs to capture/replay the sequence of packing kernel launches for the pipelined backends. This reduces the time it takes to launch all the packing kernels and as a result, improves overlap efficiency.
For now, this feature is currently opt-in via a new environment variable
CUDECOMP_ENABLE_CUDA_GRAPHS. This feature may be enabled by default in the future.