Add GQA fusion for CUDA EP#24335

Merged

nenad1002 merged 42 commits intomainfrom

nebanfic/gqa-fusion-matmul

Apr 16, 2025

Contributor

nenad1002 commented Apr 7, 2025 •

edited

Loading

Description

Most models can benefit from fusing the pre-GQA nodes into a single MatMul or MatMulNBits. This change will detect the patterns possible to fuse and execute the fusion on CUDA EPs.

Motivation and Context

This will enable publishing of a single GPU model going forward.

nenad1002 added 30 commits

March 19, 2025 10:34


          init commit gqa fusion

43dfc50


          init commit gqa fusion

797563b


          Move matmul fuse methods to utils

735ecc7


          Start adding new nodes

38e8516


          Add sin and cos cache arg

47e108c


          Fetch all arguments

701d549


          Add custom dequantization

296adc2


          Add MatMulNBits node

6ddfbdc


          More stuff

3f1000c


          Fix a bug where graph was not resolving properly

65756f5


          fix some bugs

beecbd5


          Empty string for K and V

e9b6883


          Finally graph is resolving

f76b3ad


          Variable name changes

70354b6


          Make sure matmul_output name is unique

a60d205


          Fix some bugs failing phi3.5

ebdd39b


          Refactor mergematmul method

6fc4676


          Revert changes in other files

57a50dd


          More refactoring

fe45f12


          Add a test for GQA

99d3cca


          gqa_fusion onnx file add

74c3746


          add matmul to gqa fusion

8ca49ac


          introduce kv_hidden_size and q_hidden_size

c56a9df


          more tests

dc6c264


          More refactoring

b05b65b


          More refactoring

a6d77f2


          More refactoring + new methods

0d91858


          More changes

f665fad


          Clean code

c60fff7


          more onnx files

c8e6bfd

nenad1002 added 3 commits

April 7, 2025 17:03


          fix pipeline compiler templating order issue

cc3284c


          Introduce gsl::narrow

a693587


          make system includes first

9dea70e

tianleiwu reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Outdated Show resolved Hide resolved

tianleiwu reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Show resolved Hide resolved

tianleiwu reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Show resolved Hide resolved

tianleiwu reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Show resolved Hide resolved

tianleiwu reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Show resolved Hide resolved

tianleiwu reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Show resolved Hide resolved

tianleiwu reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Outdated Show resolved Hide resolved

tianleiwu reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Show resolved Hide resolved

tianleiwu reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Outdated Show resolved Hide resolved

tianleiwu reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/core/optimizer/graph_transformer_utils.cc Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

onnxruntime/core/optimizer/group_query_attention_fusion.cc Outdated Show resolved Hide resolved

nenad1002 added 6 commits

April 14, 2025 19:25


          Resolve some comments

b54c41e


          Resolve comments

fb1a70a


          nit comments fixes

f9009e3


          Refactor finalize node fusion

75e2b31


          PreGQA function

1ae6c19

fmt

37cd819

nenad1002 requested a review from tianleiwu

April 16, 2025 16:29

tianleiwu approved these changes

View reviewed changes

Contributor

tianleiwu left a comment

It would be nice to support 8 bits later.

nenad1002 merged commit 9ab6b87 into main

85 of 90 checks passed

nenad1002 deleted the nebanfic/gqa-fusion-matmul branch

April 16, 2025 17:42

ashrit-ms pushed a commit that referenced this pull request


          Add GQA fusion for CUDA EP (#24335)

353e525

### Description
<!-- Describe your changes. -->
Most models can benefit from fusing the pre-GQA nodes into a single
MatMul or MatMulNBits. This change will detect the patterns possible to
fuse and execute the fusion on CUDA EPs.


### Motivation and Context
This will enable publishing of a single GPU model going forward.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet