-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Unity][BYOC] Add fused patterns for stacked attention #14608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In some models, the input Q, K and V for attention ops are from a stacked tensor initially, and then they are splitted and reshaped to call attention op, like stacked_qkv -> split -> reshape -> attention. Actually, we could to skip the split and reshape ops, by manipulating the layout parameters in codegen. This PR adds the such fused patterns for stacked attention in BYOC. So that we are able to codegen directly from stacked_qkv.
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
| *make_attention_pattern(with_bias=True), | ||
| ), | ||
| ( | ||
| "cutlass.stacked_attention", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the order of patterns here matter? If we have a subgraph containing both reshape and attention, will cutlass.attention that matches only a single attention operation be selected first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the order matters here. I tried to change the order with stacked attention first, however, the original attention matches first.
* [Unity][BYOC] Add fused patterns for stacked attention In some models, the input Q, K and V for attention ops are from a stacked tensor initially, and then they are splitted and reshaped to call attention op, like stacked_qkv -> split -> reshape -> attention. Actually, we could to skip the split and reshape ops, by manipulating the layout parameters in codegen. This PR adds the such fused patterns for stacked attention in BYOC. So that we are able to codegen directly from stacked_qkv. * fix lint * fix lint
This PR expands the support for fused stacked attention patterns strating with `strided_slice`. Initially, we only support fused stacked attention pattern starting with `split` in apache#14608. But with the help of apache#14583, we may have similar patterns starting with `strided_slice` as well.
* [Unity][BYOC] Fuse attention pattern with `strided_slice` This PR expands the support for fused stacked attention patterns strating with `strided_slice`. Initially, we only support fused stacked attention pattern starting with `split` in #14608. But with the help of #14583, we may have similar patterns starting with `strided_slice` as well. * remove useless code
This PR is a follow up for apache#14608 and apache#14649. In this PR, we add the checks for the fused stacked attention patterns. So we only enable the fusion of `stacked_qkv` with `ndim=3` and the `split/strided_slice axis=2`.
* [Unity][BYOC] Add check for stacked attention patterns This PR is a follow up for #14608 and #14649. In this PR, we add the checks for the fused stacked attention patterns. So we only enable the fusion of `stacked_qkv` with `ndim=3` and the `split/strided_slice axis=2`. * check the order of strided_slice
In some models, the input Q, K and V for attention ops are from a stacked tensor initially, and then they are splitted and reshaped to call attention op, like
stacked_qkv -> split -> reshape -> attention.
Actually, we could to skip the split and reshape ops, by manipulating the layout parameters in codegen.
This PR adds the such fused patterns for stacked attention in BYOC. So that we are able to codegen directly from stacked_qkv.