Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request optimizes the AttentionPrepare operation in the WebGPU BERT Attention operator by replacing a custom QKV preparation kernel with a more modular approach using MatMul followed by a dedicated SplitPackedQKV kernel. This refactoring improves performance (from 751.67ms to 128.88ms in phi4-vision model) by leveraging optimized MatMul operations and enhances code maintainability through better separation of concerns and reusability.
Key changes:
- Replaced custom AttentionPrepare kernel with MatMul + SplitPackedQKV approach
- Moved SplitPackedQKV implementation from group_query_attention.cc to attention.cc for broader reuse
- Enhanced SplitPackedQKV with vectorization support and an additional kv_hidden_size parameter
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/contrib_ops/webgpu/bert/group_query_attention.h | Removed SplitPackedQKVProgram class declaration (moved to attention.h) |
| onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc | Removed SplitPackedQKV function implementation and updated call site to include new kv_hidden_size parameter |
| onnxruntime/contrib_ops/webgpu/bert/attention_common.h | Added SplitPackedQKV function declaration for shared use across attention operators |
| onnxruntime/contrib_ops/webgpu/bert/attention.h | Added SplitPackedQKVProgram class declaration with updated uniform variables including input_size |
| onnxruntime/contrib_ops/webgpu/bert/attention.cc | Implemented new PrepareQKV using MatMul + SplitPackedQKV, added vectorization support, refactored to create Q/K/V in BSD format first before converting to BNSH for non-flash attention path |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
guschmue
approved these changes
Jan 5, 2026
alex-spacemit
pushed a commit
to spacemit-com/onnxruntime
that referenced
this pull request
Jan 20, 2026
This pull request refactors and streamlines the computation of Q, K, V tensors in the WebGPU BERT Attention operator. The main changes include removing a custom QKV preparation kernel in favor of a more modular approach using a MatMul operation followed by a dedicated split kernel, and generalizing the QKV splitting logic for broader reuse. This improves maintainability, code reuse, and performance since we have done many optimization on MatMul op. With this change, PrepareQKV becomes 128.88 ms from 751.67 ms in phi4-vision model. Before Kernel | Time (ms) | Percentage (%) -- | -- | -- Attention\|AttentionPrepare | 751.67 | 49.91 After Kernel | Time (ms) | Percentage (%) -- | -- | -- Attention\|MatMul | 120.87 | 19.77 Attention\|SplitPackedQKV | 1.94 | 0.32
alex-spacemit
pushed a commit
to spacemit-com/onnxruntime
that referenced
this pull request
Jan 27, 2026
This pull request refactors and streamlines the computation of Q, K, V tensors in the WebGPU BERT Attention operator. The main changes include removing a custom QKV preparation kernel in favor of a more modular approach using a MatMul operation followed by a dedicated split kernel, and generalizing the QKV splitting logic for broader reuse. This improves maintainability, code reuse, and performance since we have done many optimization on MatMul op. With this change, PrepareQKV becomes 128.88 ms from 751.67 ms in phi4-vision model. Before Kernel | Time (ms) | Percentage (%) -- | -- | -- Attention\|AttentionPrepare | 751.67 | 49.91 After Kernel | Time (ms) | Percentage (%) -- | -- | -- Attention\|MatMul | 120.87 | 19.77 Attention\|SplitPackedQKV | 1.94 | 0.32
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request refactors and streamlines the computation of Q, K, V tensors in the WebGPU BERT Attention operator. The main changes include removing a custom QKV preparation kernel in favor of a more modular approach using a MatMul operation followed by a dedicated split kernel, and generalizing the QKV splitting logic for broader reuse. This improves maintainability, code reuse, and performance since we have done many optimization on MatMul op.
With this change, PrepareQKV becomes 128.88 ms from 751.67 ms in phi4-vision model.
Before
After