Skip to content

Add GQA fusion for CUDA EP#24335

Merged
nenad1002 merged 42 commits intomainfrom
nebanfic/gqa-fusion-matmul
Apr 16, 2025
Merged

Add GQA fusion for CUDA EP#24335
nenad1002 merged 42 commits intomainfrom
nebanfic/gqa-fusion-matmul

Conversation

@nenad1002
Copy link
Contributor

@nenad1002 nenad1002 commented Apr 7, 2025

Description

Most models can benefit from fusing the pre-GQA nodes into a single MatMul or MatMulNBits. This change will detect the patterns possible to fuse and execute the fusion on CUDA EPs.

Motivation and Context

This will enable publishing of a single GPU model going forward.

@nenad1002 nenad1002 requested a review from tianleiwu April 16, 2025 16:29
Copy link
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to support 8 bits later.

@nenad1002 nenad1002 merged commit 9ab6b87 into main Apr 16, 2025
85 of 90 checks passed
@nenad1002 nenad1002 deleted the nebanfic/gqa-fusion-matmul branch April 16, 2025 17:42
ashrit-ms pushed a commit that referenced this pull request Apr 24, 2025
### Description
<!-- Describe your changes. -->
Most models can benefit from fusing the pre-GQA nodes into a single
MatMul or MatMulNBits. This change will detect the patterns possible to
fuse and execute the fusion on CUDA EPs.


### Motivation and Context
This will enable publishing of a single GPU model going forward.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants