[CUDA] GQA CUDA Kernel Fusion and Performance Optimization by tianleiwu · Pull Request #26920 · microsoft/onnxruntime

tianleiwu · 2026-01-06T09:44:50Z

Summary

This PR significantly improves GroupQueryAttention (GQA) performance on CUDA by fusing multiple kernel launches, improving memory access patterns, and cleaning up sequence length semantics.

Key Changes

1. Fused Kernels for Reduced Launch Overhead

New Kernel	Operations Fused	Kernels Saved
`UnpackQKVWithRoPEAndAppendKV`	Unpack packed QKV + RoPE Q/K + KV cache append	4-5
`ConcatNewToPastKVFused`	K append + V append (separate buffer mode)	1
`ConcatKVInPlaceFused`	K append + V append (shared buffer mode)	1

2. New `RotaryDispatcher` Template (`rotary_common.cuh`)

Reusable RoPE implementation for fused kernels supporting:

float, half, BFloat16 element types
float2, float4 vector types
Interleaved and half-split rotation modes

3. Sequence Length Semantics Cleanup

Before: Confusing seqlens_k / seqlens_k_buff with overloaded meanings.

After: Clear separation:

past_seq_lens - offset where new tokens are appended
total_seq_lens - total valid tokens after append
padded_seq_lens - padded length for first prompt masking

4. FlashAttention Fast Decode Path

New optimized path for token generation (sequence_length == 1, shared buffer):

Bypasses GetSequenceLengths kernel
Passes past_seq_lens directly to Flash Attention
Controlled by ORT_DISABLE_FLASH_DECODE env var

5. Integer Overflow Prevention

All KV cache index calculations use int64_t to handle large batch * heads * seq * head_size products.

6. BFloat16 Vectorization

Added float4 (8 elements) vectorized path for BFloat16 in ConcatTensorToTensor.

Environment Variables

Variable	Default	Description
`ORT_DISABLE_FLASH_DECODE`	`false`	Disable fast decode optimization
`ORT_DISABLE_FUSED_KV`	`false`	Use unfused K/V append kernels

Test Changes

Improved Test Coverage Strategy

Restructured gqa_cuda_prompt_test_cases() and gqa_cuda_past_test_cases() to explicitly iterate over kernel code path parameters:

# NEW: Primary iteration over kernel code paths
for h in h_sizes_to_test:
    for packed in packed_opts:
        for rotary, rotary_interleaved in rotary_opts:
            for share_buffer in share_buffer_opts:
                # Secondary params (batch, seq, heads) rotate via modulo

Mode	Before	After
Pipeline	16 tests, 4/12 combos	42 tests, 8/12 combos
Comprehensive	81 tests, 4/12 combos	178 tests, 12/12 combos

New Test Parameters

Added seqs = [(1, 1)] for edge case testing
Added heads = [(3, 1)] for non-standard GQA ratios
Added h_sizes = [40] for non-power-of-2 head sizes (tests rotary skip logic)

New Test Configurations

share_buffer config option (tests both buffer modes)
has_position_ids testing on CUDA
Padding prompt parity test
Fused vs unfused kernel parity tests (TestFusedKernelParity)
Decoding from empty cache test case (1, 1)

Files Changed

Core:

group_query_attention_impl.cu - Main implementation refactoring
attention_kv_cache.cu - Fused append kernels
flash_api.cc - Packed QKV stride handling

New:

rotary_common.cuh - Reusable RoPE dispatcher

Tests:

test_gqa.py - Extended test coverage

Performance

For decoding or subsequent prompt, we still use original flash attention kernel, so the performance is almost same as baseline. Here we only show the results of first prompt.

Below are results of benchmark_gqa.py on H200 GPU. Note that the latency is measured from onnx model of a GQA node, so the latency includes extra cost. The kernel speed up can be larger (See profiling results below).

prompt-sm90-Llama3-8B-b1-h32_8x128-float16

Configuration: batch=1, prompt (past_seq=0), num_heads=32, kv_heads=8, head_size=128, dtype=float16, gpu=H200

Dense mean Q, K and V are separated inputs. Packed means Q, K and V are packed into one input.

Sequence Length	Dense Base (ms)	Dense Treat (ms)	Dense Speedup	Packed Base (ms)	Packed Treat (ms)	Packed Speedup
1024	0.470	0.277	1.70x	0.468	0.320	1.46x
2048	1.001	0.517	1.94x	0.990	0.590	1.68x
4096	2.691	1.174	2.29x	1.504	1.242	1.21x
8192	7.780	2.292	3.39x	7.933	4.004	1.98x

prompt-sm90-Llama3-8B-b1-h32_8x128-bfloat16

Configuration: batch=1, prompt (past_seq=0), num_heads=32, kv_heads=8, head_size=128, dtype=bfloat16, gpu=H200

Sequence Length	Dense Base (ms)	Dense Treat (ms)	Dense Speedup	Packed Base (ms)	Packed Treat (ms)	Packed Speedup
1024	0.477	0.274	1.74x	0.486	0.332	1.46x
2048	1.078	0.500	2.16x	1.087	0.601	1.81x
4096	2.633	1.144	2.30x	3.017	1.282	2.35x
8192	7.933	2.712	2.93x	7.933	4.003	1.98x

Profiling Comparison (Prompt Phase)

Summary:
Switching from flash_fwd_splitkv_kernel to standard flash_fwd_kernel for the prompt phase (SeqLen=2048) results in a ~3x reduction in attention kernel latency and a ~2x improvement in total operator latency.

1. Packed QKV

Configuration: batch=1, seq_len=2048, past_seq=0, num_heads=32, kv_heads=8, head_size=128

Metric	Baseline	Treatment	Delta
Total Latency	639.3 us	287.0 us	2.23x Speedup
Attention Kernel	`flash_fwd_splitkv_kernel` 567.10 us	`flash_fwd_kernel` 187.70 us	3.08x Speedup
Helper Kernels	`ConcatNewToPastKV`: 4.71 us	`UnpackQKVWithRoPEAndAppendKV`: 32.44 us `GetSequenceLengths`: 1.63 us	Fused ops added

Note: The Treatment implementation introduces a fused UnpackQKVWithRoPEAndAppendKV kernel which performs necessary pre-processing. Despite this added cost (~29 us), the massive gain from using the efficient flash_fwd_kernel instead of flash_fwd_splitkv_kernel yields a significant net speedup.

2. Dense (Separated QKV)

Configuration: batch=1, seq_len=2048, past_seq=0, num_heads=32, kv_heads=8, head_size=128

Metric	Baseline	Treatment	Delta
Total Latency	0.6468 ms	0.3226 ms	2.00x Speedup
Attention Kernel	`flash_fwd_splitkv_kernel` 567.25 us	`flash_fwd_kernel` 184.29 us	3.08x Speedup
Helper Kernels	`ConcatNewToPastKV`: 4.68 us	`RotaryEmbeddingBSNH`: 48.94 us `ConcatNewToPastKVFused`: 13.04 us `GetSequenceLengths`: 1.52 us	See below

Note: Similar to the Packed case, the switch to the standard Flash Attention forward kernel drives the performance improvement. The pre-processing is handled by RotaryEmbeddingBSNH and ConcatNewToPastKVFused in the treatment.

onnxruntime/contrib_ops/cuda/bert/attention_impl.cu

Copilot

Pull request overview

This PR introduces a significant CUDA kernel optimization for GroupQueryAttention (GQA) by implementing a fused kernel that combines QKV unpacking, Rotary Position Embeddings (RoPE), and KV cache append operations into a single kernel launch. This reduces kernel overhead and memory bandwidth requirements for the first prompt phase.

Key Changes:

Introduces UnpackQKVWithRoPEAndAppendKV fused kernel that consolidates 4-5 separate operations
Adds FlashAttentionDecoding fast path for subsequent prompts/token generation with shared KV buffers
Refactors sequence length handling to use past_seq_lens, total_seq_lens, and padded_seq_lens arrays

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
test_gqa.py	Adds test for padding scenarios, updates tolerances, adds share_buffer and position_ids test coverage
group_query_attention_impl.cu	Implements fused UnpackQKV+RoPE+KVAppend kernel, adds FlashAttentionDecoding fast path
attention_kv_cache.cu	Adds fused KV concat with RoPE support, refactors to use past_seq_lens/total_seq_lens
rotary_embedding_impl.cu	Adds position_ids_format=2 for implicit position computation from past_seq_lens
flash_api.cc	Adds packed QKV support in flash attention API
group_query_attention.cc	Refactors buffer allocation and sequence length handling
attention_data.h	Updates data structure for new sequence length arrays and position_ids

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/contrib_ops/cuda/bert/attention_kv_cache.cu

onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu

onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_api.cc

onnxruntime/test/python/transformers/test_gqa.py

onnxruntime/contrib_ops/cuda/bert/attention_kv_cache.cu

onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu

Copilot

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/contrib_ops/cuda/bert/attention_kv_cache.cu

onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc

kunal-vaishnavi · 2026-01-09T22:18:24Z

Should we consider fusing RMSNorm inside RoPE as well? Some models such as Qwen-3 apply RMSNorm to Q and K after MatMul and before RoPE.

tianleiwu · 2026-01-09T22:50:54Z

Should we consider fusing RMSNorm inside RoPE as well? Some models such as Qwen-3 apply RMSNorm to Q and K after MatMul and before RoPE.

It is feasible. We can support it in the future.

…#26920) ## Summary This PR significantly improves GroupQueryAttention (GQA) performance on CUDA by fusing multiple kernel launches, improving memory access patterns, and cleaning up sequence length semantics. ## Key Changes ### 1. Fused Kernels for Reduced Launch Overhead | New Kernel | Operations Fused | Kernels Saved | |------------|------------------|---------------| | `UnpackQKVWithRoPEAndAppendKV` | Unpack packed QKV + RoPE Q/K + KV cache append | 4-5 | | `ConcatNewToPastKVFused` | K append + V append (separate buffer mode) | 1 | | `ConcatKVInPlaceFused` | K append + V append (shared buffer mode) | 1 | ### 2. New `RotaryDispatcher` Template (`rotary_common.cuh`) Reusable RoPE implementation for fused kernels supporting: - `float`, `half`, `BFloat16` element types - `float2`, `float4` vector types - Interleaved and half-split rotation modes ### 3. Sequence Length Semantics Cleanup **Before:** Confusing `seqlens_k` / `seqlens_k_buff` with overloaded meanings. **After:** Clear separation: - `past_seq_lens` - offset where new tokens are appended - `total_seq_lens` - total valid tokens after append - `padded_seq_lens` - padded length for first prompt masking ### 4. FlashAttention Fast Decode Path New optimized path for token generation (`sequence_length == 1`, shared buffer): - Bypasses `GetSequenceLengths` kernel - Passes `past_seq_lens` directly to Flash Attention - Controlled by `ORT_DISABLE_FLASH_DECODE` env var ### 5. Integer Overflow Prevention All KV cache index calculations use `int64_t` to handle large `batch * heads * seq * head_size` products. ### 6. BFloat16 Vectorization Added `float4` (8 elements) vectorized path for BFloat16 in `ConcatTensorToTensor`. ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `ORT_DISABLE_FLASH_DECODE` | `false` | Disable fast decode optimization | | `ORT_DISABLE_FUSED_KV` | `false` | Use unfused K/V append kernels | ## Test Changes ### Improved Test Coverage Strategy Restructured `gqa_cuda_prompt_test_cases()` and `gqa_cuda_past_test_cases()` to explicitly iterate over kernel code path parameters: ```python # NEW: Primary iteration over kernel code paths for h in h_sizes_to_test: for packed in packed_opts: for rotary, rotary_interleaved in rotary_opts: for share_buffer in share_buffer_opts: # Secondary params (batch, seq, heads) rotate via modulo ``` | Mode | Before | After | |------|--------|-------| | Pipeline | 16 tests, 4/12 combos | 42 tests, 8/12 combos | | Comprehensive | 81 tests, 4/12 combos | 178 tests, 12/12 combos | ### New Test Parameters - Added `seqs = [(1, 1)]` for edge case testing - Added `heads = [(3, 1)]` for non-standard GQA ratios - Added `h_sizes = [40]` for non-power-of-2 head sizes (tests rotary skip logic) ### New Test Configurations - `share_buffer` config option (tests both buffer modes) - `has_position_ids` testing on CUDA - Padding prompt parity test - Fused vs unfused kernel parity tests (`TestFusedKernelParity`) - Decoding from empty cache test case `(1, 1)` ## Files Changed **Core:** - `group_query_attention_impl.cu` - Main implementation refactoring - `attention_kv_cache.cu` - Fused append kernels - `flash_api.cc` - Packed QKV stride handling **New:** - `rotary_common.cuh` - Reusable RoPE dispatcher **Tests:** - `test_gqa.py` - Extended test coverage ## Performance For decoding or subsequent prompt, we still use original flash attention kernel, so the performance is almost same as baseline. Here we only show the results of first prompt. Below are results of benchmark_gqa.py on H200 GPU. Note that the latency is measured from onnx model of a GQA node, so the latency includes extra cost. The kernel speed up can be larger (See profiling results below). ### prompt-sm90-Llama3-8B-b1-h32_8x128-float16 **Configuration**: `batch=1, prompt (past_seq=0), num_heads=32, kv_heads=8, head_size=128, dtype=float16, gpu=H200` Dense mean Q, K and V are separated inputs. Packed means Q, K and V are packed into one input. | Sequence Length | Dense Base (ms) | Dense Treat (ms) | **Dense Speedup** | Packed Base (ms) | Packed Treat (ms) | **Packed Speedup** | | --------------: | --------------: | ---------------: | :---------------- | ---------------: | ----------------: | :----------------- | | 1024 | 0.470 | 0.277 | **1.70x** | 0.468 | 0.320 | **1.46x** | | 2048 | 1.001 | 0.517 | **1.94x** | 0.990 | 0.590 | **1.68x** | | 4096 | 2.691 | 1.174 | **2.29x** | 1.504 | 1.242 | **1.21x** | | 8192 | 7.780 | 2.292 | **3.39x** | 7.933 | 4.004 | **1.98x** | ### prompt-sm90-Llama3-8B-b1-h32_8x128-bfloat16 **Configuration**: `batch=1, prompt (past_seq=0), num_heads=32, kv_heads=8, head_size=128, dtype=bfloat16, gpu=H200` | Sequence Length | Dense Base (ms) | Dense Treat (ms) | **Dense Speedup** | Packed Base (ms) | Packed Treat (ms) | **Packed Speedup** | | --------------: | --------------: | ---------------: | :---------------- | ---------------: | ----------------: | :----------------- | | 1024 | 0.477 | 0.274 | **1.74x** | 0.486 | 0.332 | **1.46x** | | 2048 | 1.078 | 0.500 | **2.16x** | 1.087 | 0.601 | **1.81x** | | 4096 | 2.633 | 1.144 | **2.30x** | 3.017 | 1.282 | **2.35x** | | 8192 | 7.933 | 2.712 | **2.93x** | 7.933 | 4.003 | **1.98x** | # Profiling Comparison (Prompt Phase) **Summary**: Switching from `flash_fwd_splitkv_kernel` to standard `flash_fwd_kernel` for the prompt phase (SeqLen=2048) results in a **~3x reduction in attention kernel latency** and a **~2x improvement in total operator latency**. ## 1. Packed QKV **Configuration**: `batch=1, seq_len=2048, past_seq=0, num_heads=32, kv_heads=8, head_size=128` | Metric | Baseline | Treatment | Delta | | :--- | :--- | :--- | :--- | | **Total Latency** | **639.3 us** | **287.0 us** | **2.23x Speedup** | | **Attention Kernel** | `flash_fwd_splitkv_kernel` 567.10 us | `flash_fwd_kernel` 187.70 us | **3.08x Speedup** | | **Helper Kernels** | `ConcatNewToPastKV`: 4.71 us | `UnpackQKVWithRoPEAndAppendKV`: 32.44 us `GetSequenceLengths`: 1.63 us | *Fused ops added* | > **Note**: The Treatment implementation introduces a fused `UnpackQKVWithRoPEAndAppendKV` kernel which performs necessary pre-processing. Despite this added cost (~29 us), the massive gain from using the efficient `flash_fwd_kernel` instead of `flash_fwd_splitkv_kernel` yields a significant net speedup. ## 2. Dense (Separated QKV) **Configuration**: `batch=1, seq_len=2048, past_seq=0, num_heads=32, kv_heads=8, head_size=128` | Metric | Baseline | Treatment | Delta | | :--- | :--- | :--- | :--- | | **Total Latency** | **0.6468 ms** | **0.3226 ms** | **2.00x Speedup** | | **Attention Kernel** | `flash_fwd_splitkv_kernel` 567.25 us | `flash_fwd_kernel` 184.29 us | **3.08x Speedup** | | **Helper Kernels** | `ConcatNewToPastKV`: 4.68 us | `RotaryEmbeddingBSNH`: 48.94 us `ConcatNewToPastKVFused`: 13.04 us `GetSequenceLengths`: 1.52 us | *See below* | > **Note**: Similar to the Packed case, the switch to the standard Flash Attention forward kernel drives the performance improvement. The pre-processing is handled by `RotaryEmbeddingBSNH` and `ConcatNewToPastKVFused` in the treatment.

…#26920) This PR significantly improves GroupQueryAttention (GQA) performance on CUDA by fusing multiple kernel launches, improving memory access patterns, and cleaning up sequence length semantics. | New Kernel | Operations Fused | Kernels Saved | |------------|------------------|---------------| | `UnpackQKVWithRoPEAndAppendKV` | Unpack packed QKV + RoPE Q/K + KV cache append | 4-5 | | `ConcatNewToPastKVFused` | K append + V append (separate buffer mode) | 1 | | `ConcatKVInPlaceFused` | K append + V append (shared buffer mode) | 1 | Reusable RoPE implementation for fused kernels supporting: - `float`, `half`, `BFloat16` element types - `float2`, `float4` vector types - Interleaved and half-split rotation modes **Before:** Confusing `seqlens_k` / `seqlens_k_buff` with overloaded meanings. **After:** Clear separation: - `past_seq_lens` - offset where new tokens are appended - `total_seq_lens` - total valid tokens after append - `padded_seq_lens` - padded length for first prompt masking New optimized path for token generation (`sequence_length == 1`, shared buffer): - Bypasses `GetSequenceLengths` kernel - Passes `past_seq_lens` directly to Flash Attention - Controlled by `ORT_DISABLE_FLASH_DECODE` env var All KV cache index calculations use `int64_t` to handle large `batch * heads * seq * head_size` products. Added `float4` (8 elements) vectorized path for BFloat16 in `ConcatTensorToTensor`. | Variable | Default | Description | |----------|---------|-------------| | `ORT_DISABLE_FLASH_DECODE` | `false` | Disable fast decode optimization | | `ORT_DISABLE_FUSED_KV` | `false` | Use unfused K/V append kernels | Restructured `gqa_cuda_prompt_test_cases()` and `gqa_cuda_past_test_cases()` to explicitly iterate over kernel code path parameters: ```python for h in h_sizes_to_test: for packed in packed_opts: for rotary, rotary_interleaved in rotary_opts: for share_buffer in share_buffer_opts: # Secondary params (batch, seq, heads) rotate via modulo ``` | Mode | Before | After | |------|--------|-------| | Pipeline | 16 tests, 4/12 combos | 42 tests, 8/12 combos | | Comprehensive | 81 tests, 4/12 combos | 178 tests, 12/12 combos | - Added `seqs = [(1, 1)]` for edge case testing - Added `heads = [(3, 1)]` for non-standard GQA ratios - Added `h_sizes = [40]` for non-power-of-2 head sizes (tests rotary skip logic) - `share_buffer` config option (tests both buffer modes) - `has_position_ids` testing on CUDA - Padding prompt parity test - Fused vs unfused kernel parity tests (`TestFusedKernelParity`) - Decoding from empty cache test case `(1, 1)` **Core:** - `group_query_attention_impl.cu` - Main implementation refactoring - `attention_kv_cache.cu` - Fused append kernels - `flash_api.cc` - Packed QKV stride handling **New:** - `rotary_common.cuh` - Reusable RoPE dispatcher **Tests:** - `test_gqa.py` - Extended test coverage For decoding or subsequent prompt, we still use original flash attention kernel, so the performance is almost same as baseline. Here we only show the results of first prompt. Below are results of benchmark_gqa.py on H200 GPU. Note that the latency is measured from onnx model of a GQA node, so the latency includes extra cost. The kernel speed up can be larger (See profiling results below). **Configuration**: `batch=1, prompt (past_seq=0), num_heads=32, kv_heads=8, head_size=128, dtype=float16, gpu=H200` Dense mean Q, K and V are separated inputs. Packed means Q, K and V are packed into one input. | Sequence Length | Dense Base (ms) | Dense Treat (ms) | **Dense Speedup** | Packed Base (ms) | Packed Treat (ms) | **Packed Speedup** | | --------------: | --------------: | ---------------: | :---------------- | ---------------: | ----------------: | :----------------- | | 1024 | 0.470 | 0.277 | **1.70x** | 0.468 | 0.320 | **1.46x** | | 2048 | 1.001 | 0.517 | **1.94x** | 0.990 | 0.590 | **1.68x** | | 4096 | 2.691 | 1.174 | **2.29x** | 1.504 | 1.242 | **1.21x** | | 8192 | 7.780 | 2.292 | **3.39x** | 7.933 | 4.004 | **1.98x** | **Configuration**: `batch=1, prompt (past_seq=0), num_heads=32, kv_heads=8, head_size=128, dtype=bfloat16, gpu=H200` | Sequence Length | Dense Base (ms) | Dense Treat (ms) | **Dense Speedup** | Packed Base (ms) | Packed Treat (ms) | **Packed Speedup** | | --------------: | --------------: | ---------------: | :---------------- | ---------------: | ----------------: | :----------------- | | 1024 | 0.477 | 0.274 | **1.74x** | 0.486 | 0.332 | **1.46x** | | 2048 | 1.078 | 0.500 | **2.16x** | 1.087 | 0.601 | **1.81x** | | 4096 | 2.633 | 1.144 | **2.30x** | 3.017 | 1.282 | **2.35x** | | 8192 | 7.933 | 2.712 | **2.93x** | 7.933 | 4.003 | **1.98x** | **Summary**: Switching from `flash_fwd_splitkv_kernel` to standard `flash_fwd_kernel` for the prompt phase (SeqLen=2048) results in a **~3x reduction in attention kernel latency** and a **~2x improvement in total operator latency**. **Configuration**: `batch=1, seq_len=2048, past_seq=0, num_heads=32, kv_heads=8, head_size=128` | Metric | Baseline | Treatment | Delta | | :--- | :--- | :--- | :--- | | **Total Latency** | **639.3 us** | **287.0 us** | **2.23x Speedup** | | **Attention Kernel** | `flash_fwd_splitkv_kernel` 567.10 us | `flash_fwd_kernel` 187.70 us | **3.08x Speedup** | | **Helper Kernels** | `ConcatNewToPastKV`: 4.71 us | `UnpackQKVWithRoPEAndAppendKV`: 32.44 us `GetSequenceLengths`: 1.63 us | *Fused ops added* | > **Note**: The Treatment implementation introduces a fused `UnpackQKVWithRoPEAndAppendKV` kernel which performs necessary pre-processing. Despite this added cost (~29 us), the massive gain from using the efficient `flash_fwd_kernel` instead of `flash_fwd_splitkv_kernel` yields a significant net speedup. **Configuration**: `batch=1, seq_len=2048, past_seq=0, num_heads=32, kv_heads=8, head_size=128` | Metric | Baseline | Treatment | Delta | | :--- | :--- | :--- | :--- | | **Total Latency** | **0.6468 ms** | **0.3226 ms** | **2.00x Speedup** | | **Attention Kernel** | `flash_fwd_splitkv_kernel` 567.25 us | `flash_fwd_kernel` 184.29 us | **3.08x Speedup** | | **Helper Kernels** | `ConcatNewToPastKV`: 4.68 us | `RotaryEmbeddingBSNH`: 48.94 us `ConcatNewToPastKVFused`: 13.04 us `GetSequenceLengths`: 1.52 us | *See below* | > **Note**: Similar to the Packed case, the switch to the standard Flash Attention forward kernel drives the performance improvement. The pre-processing is handled by `RotaryEmbeddingBSNH` and `ConcatNewToPastKVFused` in the treatment.

tianleiwu added 2 commits January 6, 2026 00:39

GQA cuda fused kernel for kv cache and rotary

a412553

use fused kernel for packed qkv, rotary and first prompt

6a3742d

tianleiwu marked this pull request as draft January 6, 2026 09:44

flash attention fast decode

5ee35da

tianleiwu marked this pull request as ready for review January 7, 2026 00:11

tianleiwu requested review from apsonawane, Copilot, kunal-vaishnavi and nenad1002 January 7, 2026 00:18

Copilot started reviewing on behalf of tianleiwu January 7, 2026 00:34 View session

apsonawane reviewed Jan 7, 2026

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/attention_impl.cu Show resolved Hide resolved

update #include

9f6f073

Copilot AI reviewed Jan 7, 2026

View reviewed changes

review feedback

8d20af5

apsonawane previously approved these changes Jan 7, 2026

View reviewed changes

kunal-vaishnavi previously approved these changes Jan 7, 2026

View reviewed changes

Merge branch 'main' into tlwu/cuda_gqa_fused_kernel

eb5b183

tianleiwu marked this pull request as draft January 8, 2026 17:12

tianleiwu changed the title ~~[CUDA] GQA Fused Kernel for QKV Unpack, RoPE, and KV Cache Append~~ [CUDA] GQA CUDA Kernel Fusion and Performance Optimization Jan 8, 2026

Improve kernel, document and tests

e768ee0

tianleiwu dismissed stale reviews from kunal-vaishnavi and apsonawane via e768ee0 January 8, 2026 22:20

tianleiwu added 3 commits January 8, 2026 14:58

avoid overflow

699e395

clean up and assert alignment

69087ca

optimize buffer size

684c7cb

tianleiwu marked this pull request as ready for review January 9, 2026 06:38

tianleiwu requested review from apsonawane, Copilot and kunal-vaishnavi January 9, 2026 06:38

Copilot started reviewing on behalf of tianleiwu January 9, 2026 06:39 View session

Copilot AI reviewed Jan 9, 2026

View reviewed changes

kunal-vaishnavi reviewed Jan 9, 2026

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/attention_kv_cache.cu Show resolved Hide resolved

kunal-vaishnavi reviewed Jan 9, 2026

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc Show resolved Hide resolved

kunal-vaishnavi approved these changes Jan 9, 2026

View reviewed changes

tianleiwu merged commit 39d8520 into main Jan 9, 2026
119 of 121 checks passed

tianleiwu deleted the tlwu/cuda_gqa_fused_kernel branch January 9, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] GQA CUDA Kernel Fusion and Performance Optimization#26920

[CUDA] GQA CUDA Kernel Fusion and Performance Optimization#26920
tianleiwu merged 10 commits intomainfrom
tlwu/cuda_gqa_fused_kernel

tianleiwu commented Jan 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

kunal-vaishnavi commented Jan 9, 2026

Uh oh!

tianleiwu commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tianleiwu commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. Fused Kernels for Reduced Launch Overhead

2. New RotaryDispatcher Template (rotary_common.cuh)

3. Sequence Length Semantics Cleanup

4. FlashAttention Fast Decode Path

5. Integer Overflow Prevention

6. BFloat16 Vectorization

Environment Variables

Test Changes

Improved Test Coverage Strategy

New Test Parameters

New Test Configurations

Files Changed

Performance

prompt-sm90-Llama3-8B-b1-h32_8x128-float16

prompt-sm90-Llama3-8B-b1-h32_8x128-bfloat16

Profiling Comparison (Prompt Phase)

1. Packed QKV

2. Dense (Separated QKV)

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

kunal-vaishnavi commented Jan 9, 2026

Uh oh!

tianleiwu commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianleiwu commented Jan 6, 2026 •

edited

Loading

2. New `RotaryDispatcher` Template (`rotary_common.cuh`)