Skip to content

Sync with Microsoft ONNX Runtime - 14/01/2026#902

Merged
ankitm3k merged 27 commits intoovep-developfrom
sync_msft_14012026
Jan 14, 2026
Merged

Sync with Microsoft ONNX Runtime - 14/01/2026#902
ankitm3k merged 27 commits intoovep-developfrom
sync_msft_14012026

Conversation

@Jaswanth51
Copy link

Description

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

nieubank and others added 27 commits January 9, 2026 03:04
### Description
Version update for security fixes.



### Motivation and Context
Version update for security fixes.
…icrosoft#25865)

### Description

 - Case-2 LPBQ pattern omits QuantizeLinear node in LPBQ packing pattern
- Modify LPBQ fusion logic in QNN EP implemented for Gemma and MatMul
nodes to gracefully handle the optional QuantizeLinear node in LPBQ
packing pattern.
- Add unit tests to verify Case-2 LPBQ pattern fusion for Gemm and
MatMul nodes.



### Motivation and Context
- QuantizeLinear node in LowPowerBlockQuantization encoding packing
pattern can be optional as it helps to keep the weights in INT datatype
and further helps to reduce the size of model.

---------

Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>
### Description
CudaPinned allocator uses hardcoded to DeviceAllocator memory type.
The PR comes to allow choosing memory type for CudaPinned allocator
between DeviceAllocator and ArenaAllocator.

### Motivation and Context
Fixed issue microsoft#26887

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…#26920)

## Summary

This PR significantly improves GroupQueryAttention (GQA) performance on
CUDA by fusing multiple kernel launches, improving memory access
patterns, and cleaning up sequence length semantics.

## Key Changes

### 1. Fused Kernels for Reduced Launch Overhead

| New Kernel | Operations Fused | Kernels Saved |
|------------|------------------|---------------|
| `UnpackQKVWithRoPEAndAppendKV` | Unpack packed QKV + RoPE Q/K + KV
cache append | 4-5 |
| `ConcatNewToPastKVFused` | K append + V append (separate buffer mode)
| 1 |
| `ConcatKVInPlaceFused` | K append + V append (shared buffer mode) | 1
|

### 2. New `RotaryDispatcher` Template (`rotary_common.cuh`)

Reusable RoPE implementation for fused kernels supporting:
- `float`, `half`, `BFloat16` element types
- `float2`, `float4` vector types
- Interleaved and half-split rotation modes

### 3. Sequence Length Semantics Cleanup

**Before:** Confusing `seqlens_k` / `seqlens_k_buff` with overloaded
meanings.

**After:** Clear separation:
- `past_seq_lens` - offset where new tokens are appended
- `total_seq_lens` - total valid tokens after append
- `padded_seq_lens` - padded length for first prompt masking

### 4. FlashAttention Fast Decode Path

New optimized path for token generation (`sequence_length == 1`, shared
buffer):
- Bypasses `GetSequenceLengths` kernel
- Passes `past_seq_lens` directly to Flash Attention
- Controlled by `ORT_DISABLE_FLASH_DECODE` env var

### 5. Integer Overflow Prevention

All KV cache index calculations use `int64_t` to handle large `batch *
heads * seq * head_size` products.

### 6. BFloat16 Vectorization

Added `float4` (8 elements) vectorized path for BFloat16 in
`ConcatTensorToTensor`.

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `ORT_DISABLE_FLASH_DECODE` | `false` | Disable fast decode
optimization |
| `ORT_DISABLE_FUSED_KV` | `false` | Use unfused K/V append kernels |

## Test Changes

### Improved Test Coverage Strategy

Restructured `gqa_cuda_prompt_test_cases()` and
`gqa_cuda_past_test_cases()` to explicitly iterate over kernel code path
parameters:

```python
# NEW: Primary iteration over kernel code paths
for h in h_sizes_to_test:
    for packed in packed_opts:
        for rotary, rotary_interleaved in rotary_opts:
            for share_buffer in share_buffer_opts:
                # Secondary params (batch, seq, heads) rotate via modulo
```

| Mode | Before | After |
|------|--------|-------|
| Pipeline | 16 tests, 4/12 combos | 42 tests, 8/12 combos |
| Comprehensive | 81 tests, 4/12 combos | 178 tests, 12/12 combos |

### New Test Parameters

- Added `seqs = [(1, 1)]` for edge case testing
- Added `heads = [(3, 1)]` for non-standard GQA ratios
- Added `h_sizes = [40]` for non-power-of-2 head sizes (tests rotary
skip logic)

### New Test Configurations

- `share_buffer` config option (tests both buffer modes)
- `has_position_ids` testing on CUDA
- Padding prompt parity test
- Fused vs unfused kernel parity tests (`TestFusedKernelParity`)
- Decoding from empty cache test case `(1, 1)`

## Files Changed

**Core:**
- `group_query_attention_impl.cu` - Main implementation refactoring
- `attention_kv_cache.cu` - Fused append kernels
- `flash_api.cc` - Packed QKV stride handling

**New:**
- `rotary_common.cuh` - Reusable RoPE dispatcher

**Tests:**
- `test_gqa.py` - Extended test coverage

## Performance

For decoding or subsequent prompt, we still use original flash attention
kernel, so the performance is almost same as baseline. Here we only show
the results of first prompt.

Below are results of benchmark_gqa.py on H200 GPU. Note that the latency
is measured from onnx model of a GQA node, so the latency includes extra
cost. The kernel speed up can be larger (See profiling results below).

### prompt-sm90-Llama3-8B-b1-h32_8x128-float16
**Configuration**: `batch=1, prompt (past_seq=0), num_heads=32,
kv_heads=8, head_size=128, dtype=float16, gpu=H200`

Dense mean Q, K and V are separated inputs. Packed means Q, K and V are
packed into one input.

| Sequence Length | Dense Base (ms) | Dense Treat (ms) | **Dense
Speedup** | Packed Base (ms) | Packed Treat (ms) | **Packed Speedup** |
| --------------: | --------------: | ---------------: |
:---------------- | ---------------: | ----------------: |
:----------------- |
| 1024 | 0.470 | 0.277 | **1.70x** | 0.468 | 0.320 | **1.46x** |
| 2048 | 1.001 | 0.517 | **1.94x** | 0.990 | 0.590 | **1.68x** |
| 4096 | 2.691 | 1.174 | **2.29x** | 1.504 | 1.242 | **1.21x** |
| 8192 | 7.780 | 2.292 | **3.39x** | 7.933 | 4.004 | **1.98x** |

### prompt-sm90-Llama3-8B-b1-h32_8x128-bfloat16
**Configuration**: `batch=1, prompt (past_seq=0), num_heads=32,
kv_heads=8, head_size=128, dtype=bfloat16, gpu=H200`

| Sequence Length | Dense Base (ms) | Dense Treat (ms) | **Dense
Speedup** | Packed Base (ms) | Packed Treat (ms) | **Packed Speedup** |
| --------------: | --------------: | ---------------: |
:---------------- | ---------------: | ----------------: |
:----------------- |
| 1024 | 0.477 | 0.274 | **1.74x** | 0.486 | 0.332 | **1.46x** |
| 2048 | 1.078 | 0.500 | **2.16x** | 1.087 | 0.601 | **1.81x** |
| 4096 | 2.633 | 1.144 | **2.30x** | 3.017 | 1.282 | **2.35x** |
| 8192 | 7.933 | 2.712 | **2.93x** | 7.933 | 4.003 | **1.98x** |

# Profiling Comparison (Prompt Phase)

**Summary**:
Switching from `flash_fwd_splitkv_kernel` to standard `flash_fwd_kernel`
for the prompt phase (SeqLen=2048) results in a **~3x reduction in
attention kernel latency** and a **~2x improvement in total operator
latency**.

## 1. Packed QKV
**Configuration**: `batch=1, seq_len=2048, past_seq=0, num_heads=32,
kv_heads=8, head_size=128`

| Metric | Baseline | Treatment | Delta |
| :--- | :--- | :--- | :--- |
| **Total Latency** | **639.3 us** | **287.0 us** | **2.23x Speedup** |
| **Attention Kernel** | `flash_fwd_splitkv_kernel`<br>567.10 us |
`flash_fwd_kernel`<br>187.70 us | **3.08x Speedup** |
| **Helper Kernels** | `ConcatNewToPastKV`: 4.71 us |
`UnpackQKVWithRoPEAndAppendKV`: 32.44 us<br>`GetSequenceLengths`: 1.63
us | *Fused ops added* |

> **Note**: The Treatment implementation introduces a fused
`UnpackQKVWithRoPEAndAppendKV` kernel which performs necessary
pre-processing. Despite this added cost (~29 us), the massive gain from
using the efficient `flash_fwd_kernel` instead of
`flash_fwd_splitkv_kernel` yields a significant net speedup.

## 2. Dense (Separated QKV)
**Configuration**: `batch=1, seq_len=2048, past_seq=0, num_heads=32,
kv_heads=8, head_size=128`

| Metric | Baseline | Treatment | Delta |
| :--- | :--- | :--- | :--- |
| **Total Latency** | **0.6468 ms** | **0.3226 ms** | **2.00x Speedup**
|
| **Attention Kernel** | `flash_fwd_splitkv_kernel`<br>567.25 us |
`flash_fwd_kernel`<br> 184.29 us | **3.08x Speedup** |
| **Helper Kernels** | `ConcatNewToPastKV`: 4.68 us |
`RotaryEmbeddingBSNH`: 48.94 us<br>`ConcatNewToPastKVFused`: 13.04
us<br>`GetSequenceLengths`: 1.52 us | *See below* |

> **Note**: Similar to the Packed case, the switch to the standard Flash
Attention forward kernel drives the performance improvement. The
pre-processing is handled by `RotaryEmbeddingBSNH` and
`ConcatNewToPastKVFused` in the treatment.
…osoft#26927)

### Description
Adds APIs to enable plugin EPs to create and register kernels for
control flow operators (If, Loop, and Scan). The implementation provides
ORT-managed kernel implementations that handle subgraph execution while
allowing EPs to provide device-specific helper functions for operations
like tensor concatenation and transposition.

Key changes:

- Adds four EP API functions: `CreateIfKernel`, `CreateLoopKernel`,
`CreateScanKernel`, and `ReleaseKernelImpl` for creating control flow
kernel implementations
- Introduces public helper structures (`OrtLoopKernelHelper`,
`OrtScanKernelHelper`) that EPs implement to provide device-specific
operations
- Updates the example kernel-based EP with kernel registrations for all
control flow operators and adds corresponding test models.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
## Summary
This PR updates the Flash Attention implementation in ONNX Runtime,
syncing with newer kernel sources in
https://github.com/Dao-AILab/flash-attention, and extending the internal
API to support additional features required for advanced caching
scenarios. It also aligns specific kernels with the official
implementation.

## Changes
- **Flash Attention Kernels**: Updated/Added Flash Attention forward
kernels and headers in
`onnxruntime/contrib_ops/cuda/bert/flash_attention/`.
- **API Extension**: Updated `mha_fwd` and `mha_fwd_kvcache` in
`flash_api.h` and `flash_api.cc` to accept two new optional parameters:
- `cache_batch_idx`: Indices to index into the KV cache (support for
non-contiguous batch indices).
    - `leftpad_k`: Support for left-padding in the key sequence.
- **Alignment & Fixes**:
- **Cleanup**: Removed redundant `kInfinity` definition in
`flash_fwd_kernel.h`.
- **Includes**: Added missing
`<core/providers/cuda/shared_inc/cuda_call.h>` in
`flash_fwd_launch_template.h`.
- **Integration**: Updated `group_query_attention_impl.cu` to align with
the new `mha_fwd_kvcache` signature.
- **Build Configuration**: Adjusted `onnxruntime_providers_cpu.cmake` to
update the exclusion list for Flash Attention kernels in quick build
mode.

## Implementation Details
- The `run_mha_fwd` helper now checks if `cache_batch_idx` is provided
alongside `k_new` to determine if the split kernel should be forced.
- New parameters are propagated through the call stack to the underlying
Flash Attention kernels.
### Description
<!-- Describe your changes. -->
1. platform.cpp missed inclusion of sys/auxv.h (for elf_aux_info)
and machine/cpu.h (for PPC_FEATURE2_ARCH_3_00). I missed that in my
previous commit.
2. Same as on AIX, __vector int32_t is not defined and __vector int
needs to be used.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes build on FreeBSD / powerpc64le platform.
…icrosoft#26303)

The original check enforces both the present_key and the past_key must
be present. But with IO-binding there may be an issue: The past_key can
be nullptr even when present_key is allocated. In reality, the kernel
should just do the computation when it has the data, or when the output
is requested.

---------

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
The coreml gather MLProgram operator supports fp16, but the check was
missing in `GatherOpBuilder::HasSupportedInputsImpl()`


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We use `Gather` in LeelaChessZero.

---------

Co-authored-by: borg323 <borg323@users.noreply.github.com>
…oft#26696)

### Description
When multiple devices are provided in `AppendExecutionProviders_V2`,
default to the NPU device, instead of picking the last device in the
list.
### Description
<!-- Describe your changes. -->

NOTE: Need microsoft#26597

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This pull-request removes the:

- onnxruntime_providers_cuda.dll
- onnxruntime_providers_qnn.dll
- All the QNN binaries

from the FL nuget package.
### Description
- Support pre-opset11 `Pad` in the QNN op builder.
- Add GPU backend tests for `Pad`.

### Motivation and Context
- Enables `Pad` translation in models using older opsets.
### Description
- Transposes are inserted for Softmax with axis != output_rank-1 for the
HTP backend.
- The GPU backend also has this requirement on the axis param, so this
change enables the layout transformation for the GPU as well.

### Motivation and Context
- Enables more models with GPU backend.
### Description
Support for Cuda Graph on by default



### Motivation and Context
We wanted to have Cuda Graph support on by default for NV TRT-RTX EP.
Added the use of RTX graph capture and removed the access external
checks for the same
Made compute capability as kCURRENT by default

For better performance, benchmarking and based on current state most use
cases today to be build and run on same device or same SM.
- Updated stale issue policy to include 'no stale' label checks and
modified reply message.
- Do not reopen closed stale issues automatically, as it is just noisy.
Instead, we could encourage issue owners to reopen manually or create
new issues.
### Description
 - Add standalone RMSNorm op translation in QNN EP
 - Add unit tests



### Motivation and Context
- This fixes the CPU fallback of ONNX RMSNormalization operator when
running inferences using QNN EP
### Description
<!-- Describe your changes. -->

`js/` linting is already done in the Web CI pipeline so removing
redundant check as its signaling false-positives

Co-authored-by: Prathik Rao <prathikrao@microsoft.com>
### Description

Update Qnn default version to 2.42.0.251225

Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com>
### Description
Create new class to handle HTP power config updates
Only update if there are changes to power config settings
On dynamic htp perf mode updates, disable DSPQ polling if perf mode is
not burst

### Motivation and Context
Currently, if a session has set the performance mode to burst then
changed the performance mode to anything else, DSPQ polling will be
enabled and never disabled. This change is to allow disabling of DSPQ
polling when the performance mode is not burst, even on updates.

---------

Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
microsoft#26622)

### Description
- Adding comprehensive logging to the QNN EP that displays detailed
information about all the graph inputs, outputs and initializers.
- Information is dumped into a json file during graph composition only
if we are dumping the json qnn graph.


### Motivation and Context
- Useful for debugging and understanding if we are hitting peak memory
during inference
Fix VS2026 build.

### Description
<!-- Describe your changes. -->
Fix some correctness issues.
Fix VS 2026 build.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…ft#26995)

### Description
Remove the limitations on using onnxruntime_USE_KLEIDIAI in a Windows on
Arm environment.

### Motivation and Context
Historically the KleidiAI build had difficulties with using Microsoft
compiler for Arm environments (MSVC). As a result a hard exclusion of
onnxruntime_USE_KLEIDIAI and MSVC was added and subsequently
consolidated into cmake/CMakeLists.txt by
[this](microsoft@2e8a45a)
commit.

The problems in KleidiAI were resolved in their v1.14.0 release. v1.15.0
was introduced via
[this](microsoft@8fe4804)
commit. This PR removes the limitation, allowing MSVC to be used to
compile with onnxruntime_USE_KLEIDIAI enabled in a Winodws on Arm
environment.

In addition there were legacy restrictions in CMakeLists.txt relating to
DOTPROD and I8MM CPU features. This is already handled in the KleidiAI
build.
### Verification
Following the Windows build instructions
[here](https://onnxruntime.ai/docs/build/inferencing.html#windows)
KleidiAI and its associated logic in MLAS will be built when ARM64 is
detected.

**Note**: As is made clear in these build instructions MSVC must include
support for ARM64. Both Python and Cmake must be native ARM64.

Signed-off-by: Colm Donelan <colm.donelan@arm.com>
@Jaswanth51 Jaswanth51 requested a review from ankitm3k January 14, 2026 03:23
@ankitm3k ankitm3k merged commit cd20d5d into ovep-develop Jan 14, 2026
6 of 7 checks passed
@ankitm3k ankitm3k deleted the sync_msft_14012026 branch January 14, 2026 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.