Skip to content

GQA Rotary and Packed QKV with Flash#18906

Merged
aciddelgado merged 58 commits intomainfrom
aciddelgado/gqa_rotary_packed
Jan 24, 2024
Merged

GQA Rotary and Packed QKV with Flash#18906
aciddelgado merged 58 commits intomainfrom
aciddelgado/gqa_rotary_packed

Conversation

@aciddelgado
Copy link
Contributor

Description

These changes add rotary embedding and packed qkv input to gqa. As of now, the changes are only supported with Flash-Attention (SM >= 80) but should soon be supported with Memory Efficient Attention as well.

Motivation and Context

With the fusion of rotary embedding into this Attention op, we hope to observe some perf gain. The packed QKV should also provide some perf gain in the context of certain models, like Llama2, that would benefit from running ops on the fused QKV matrix, rather than the separate Q, K, and V.

tianleiwu
tianleiwu previously approved these changes Jan 22, 2024
tianleiwu
tianleiwu previously approved these changes Jan 22, 2024
@aciddelgado aciddelgado merged commit cbb29d8 into main Jan 24, 2024
@aciddelgado aciddelgado deleted the aciddelgado/gqa_rotary_packed branch January 24, 2024 00:34
YUNQIUGUO pushed a commit that referenced this pull request Jan 30, 2024
### Description
These changes add rotary embedding and packed qkv input to gqa. As of
now, the changes are only supported with Flash-Attention (SM >= 80) but
should soon be supported with Memory Efficient Attention as well.



### Motivation and Context
With the fusion of rotary embedding into this Attention op, we hope to
observe some perf gain. The packed QKV should also provide some perf
gain in the context of certain models, like Llama2, that would benefit
from running ops on the fused QKV matrix, rather than the separate Q, K,
and V.

---------

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
kunal-vaishnavi added a commit that referenced this pull request Mar 13, 2024
### Description
This PR updates the replacement of MultiHeadAttention (MHA) with
GroupQueryAttention (GQA). It is related to the changes in [this
PR](#18906).

### Motivation and Context
The updated replacement of MHA with GQA includes the following fusion
changes.
- Apply sliding window within GQA
- Fuse the rotary embeddings within GQA
- Fuse the 3 MatMuls into 1 packed MatMul if possible
- Fuse the 3 Adds into 1 packed Add if possible
YUNQIUGUO pushed a commit that referenced this pull request Mar 21, 2024
### Description
This PR updates the replacement of MultiHeadAttention (MHA) with
GroupQueryAttention (GQA). It is related to the changes in [this
PR](#18906).

### Motivation and Context
The updated replacement of MHA with GQA includes the following fusion
changes.
- Apply sliding window within GQA
- Fuse the rotary embeddings within GQA
- Fuse the 3 MatMuls into 1 packed MatMul if possible
- Fuse the 3 Adds into 1 packed Add if possible
rohan11235813 pushed a commit to quadric-io/onnxruntime that referenced this pull request Aug 19, 2025
### Description
This PR updates the replacement of MultiHeadAttention (MHA) with
GroupQueryAttention (GQA). It is related to the changes in [this
PR](microsoft/onnxruntime#18906).

### Motivation and Context
The updated replacement of MHA with GQA includes the following fusion
changes.
- Apply sliding window within GQA
- Fuse the rotary embeddings within GQA
- Fuse the 3 MatMuls into 1 packed MatMul if possible
- Fuse the 3 Adds into 1 packed Add if possible
@snnn
Copy link
Contributor

snnn commented Sep 5, 2025

This PR has been cherry-picked into the rel-1.17.0 branch in PR #19327. Removing the release:1.17.0 label.

rohan11235813 pushed a commit to quadric-io/onnxruntime that referenced this pull request Sep 15, 2025
### Description
This PR updates the replacement of MultiHeadAttention (MHA) with
GroupQueryAttention (GQA). It is related to the changes in [this
PR](microsoft/onnxruntime#18906).

### Motivation and Context
The updated replacement of MHA with GQA includes the following fusion
changes.
- Apply sliding window within GQA
- Fuse the rotary embeddings within GQA
- Fuse the 3 MatMuls into 1 packed MatMul if possible
- Fuse the 3 Adds into 1 packed Add if possible
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants