[webgpu] Optimize AttentionPrepare by qjia7 · Pull Request #26850 · microsoft/onnxruntime

qjia7 · 2025-12-22T13:42:00Z

This pull request refactors and streamlines the computation of Q, K, V tensors in the WebGPU BERT Attention operator. The main changes include removing a custom QKV preparation kernel in favor of a more modular approach using a MatMul operation followed by a dedicated split kernel, and generalizing the QKV splitting logic for broader reuse. This improves maintainability, code reuse, and performance since we have done many optimization on MatMul op.

With this change, PrepareQKV becomes 128.88 ms from 751.67 ms in phi4-vision model.

Before

Kernel	Time (ms)	Percentage (%)
Attention\|AttentionPrepare	751.67	49.91

After

Kernel	Time (ms)	Percentage (%)
Attention\|MatMul	120.87	19.77
Attention\|SplitPackedQKV	1.94	0.32

Copilot

Pull request overview

This pull request optimizes the AttentionPrepare operation in the WebGPU BERT Attention operator by replacing a custom QKV preparation kernel with a more modular approach using MatMul followed by a dedicated SplitPackedQKV kernel. This refactoring improves performance (from 751.67ms to 128.88ms in phi4-vision model) by leveraging optimized MatMul operations and enhances code maintainability through better separation of concerns and reusability.

Key changes:

Replaced custom AttentionPrepare kernel with MatMul + SplitPackedQKV approach
Moved SplitPackedQKV implementation from group_query_attention.cc to attention.cc for broader reuse
Enhanced SplitPackedQKV with vectorization support and an additional kv_hidden_size parameter

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
onnxruntime/contrib_ops/webgpu/bert/group_query_attention.h	Removed SplitPackedQKVProgram class declaration (moved to attention.h)
onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc	Removed SplitPackedQKV function implementation and updated call site to include new kv_hidden_size parameter
onnxruntime/contrib_ops/webgpu/bert/attention_common.h	Added SplitPackedQKV function declaration for shared use across attention operators
onnxruntime/contrib_ops/webgpu/bert/attention.h	Added SplitPackedQKVProgram class declaration with updated uniform variables including input_size
onnxruntime/contrib_ops/webgpu/bert/attention.cc	Implemented new PrepareQKV using MatMul + SplitPackedQKV, added vectorization support, refactored to create Q/K/V in BSD format first before converting to BNSH for non-flash attention path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/contrib_ops/webgpu/bert/attention.cc

qjia7 added 7 commits December 18, 2025 17:43

[webgpu] Optimize attentionPrepare

e15bd72

change the splitQKV to BNSH format

8d38f51

fix the bugs

dfb98e3

remove debugging codes

cbb8ef2

SplitPackedQKV with BSD format

0067a2d

make SplitPackedQKV work on GQA

efdc783

Add component support to SplitPackedQKV

ceea4d8

qjia7 marked this pull request as ready for review December 23, 2025 02:40

qjia7 requested review from Copilot and fs-eire December 23, 2025 02:40

Copilot started reviewing on behalf of qjia7 December 23, 2025 02:41 View session

qjia7 requested review from guschmue and xiaofeihan1 December 23, 2025 02:42

Copilot AI reviewed Dec 23, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/attention.cc Outdated Show resolved Hide resolved

address comments from Copilot

419e263

guschmue added the ep:WebGPU ort-web webgpu provider label Dec 29, 2025

guschmue approved these changes Jan 5, 2026

View reviewed changes

qjia7 merged commit 5bc10a3 into main Jan 6, 2026
94 of 95 checks passed

qjia7 deleted the attention_prepare branch January 6, 2026 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webgpu] Optimize AttentionPrepare#26850

[webgpu] Optimize AttentionPrepare#26850
qjia7 merged 8 commits intomainfrom
attention_prepare

qjia7 commented Dec 22, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qjia7 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qjia7 commented Dec 22, 2025 •

edited

Loading