Sync with Microsoft ONNX Runtime - 14/01/2026 by Jaswanth51 · Pull Request #902 · intel/onnxruntime

Jaswanth51 · 2026-01-14T03:23:58Z

Description

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

### Description Version update for security fixes. ### Motivation and Context Version update for security fixes.

…icrosoft#25865) ### Description - Case-2 LPBQ pattern omits QuantizeLinear node in LPBQ packing pattern - Modify LPBQ fusion logic in QNN EP implemented for Gemma and MatMul nodes to gracefully handle the optional QuantizeLinear node in LPBQ packing pattern. - Add unit tests to verify Case-2 LPBQ pattern fusion for Gemm and MatMul nodes. ### Motivation and Context - QuantizeLinear node in LowPowerBlockQuantization encoding packing pattern can be optional as it helps to keep the weights in INT datatype and further helps to reduce the size of model. --------- Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>

### Description CudaPinned allocator uses hardcoded to DeviceAllocator memory type. The PR comes to allow choosing memory type for CudaPinned allocator between DeviceAllocator and ArenaAllocator. ### Motivation and Context Fixed issue microsoft#26887 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…#26920) ## Summary This PR significantly improves GroupQueryAttention (GQA) performance on CUDA by fusing multiple kernel launches, improving memory access patterns, and cleaning up sequence length semantics. ## Key Changes ### 1. Fused Kernels for Reduced Launch Overhead | New Kernel | Operations Fused | Kernels Saved | |------------|------------------|---------------| | `UnpackQKVWithRoPEAndAppendKV` | Unpack packed QKV + RoPE Q/K + KV cache append | 4-5 | | `ConcatNewToPastKVFused` | K append + V append (separate buffer mode) | 1 | | `ConcatKVInPlaceFused` | K append + V append (shared buffer mode) | 1 | ### 2. New `RotaryDispatcher` Template (`rotary_common.cuh`) Reusable RoPE implementation for fused kernels supporting: - `float`, `half`, `BFloat16` element types - `float2`, `float4` vector types - Interleaved and half-split rotation modes ### 3. Sequence Length Semantics Cleanup **Before:** Confusing `seqlens_k` / `seqlens_k_buff` with overloaded meanings. **After:** Clear separation: - `past_seq_lens` - offset where new tokens are appended - `total_seq_lens` - total valid tokens after append - `padded_seq_lens` - padded length for first prompt masking ### 4. FlashAttention Fast Decode Path New optimized path for token generation (`sequence_length == 1`, shared buffer): - Bypasses `GetSequenceLengths` kernel - Passes `past_seq_lens` directly to Flash Attention - Controlled by `ORT_DISABLE_FLASH_DECODE` env var ### 5. Integer Overflow Prevention All KV cache index calculations use `int64_t` to handle large `batch * heads * seq * head_size` products. ### 6. BFloat16 Vectorization Added `float4` (8 elements) vectorized path for BFloat16 in `ConcatTensorToTensor`. ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `ORT_DISABLE_FLASH_DECODE` | `false` | Disable fast decode optimization | | `ORT_DISABLE_FUSED_KV` | `false` | Use unfused K/V append kernels | ## Test Changes ### Improved Test Coverage Strategy Restructured `gqa_cuda_prompt_test_cases()` and `gqa_cuda_past_test_cases()` to explicitly iterate over kernel code path parameters: ```python # NEW: Primary iteration over kernel code paths for h in h_sizes_to_test: for packed in packed_opts: for rotary, rotary_interleaved in rotary_opts: for share_buffer in share_buffer_opts: # Secondary params (batch, seq, heads) rotate via modulo ``` | Mode | Before | After | |------|--------|-------| | Pipeline | 16 tests, 4/12 combos | 42 tests, 8/12 combos | | Comprehensive | 81 tests, 4/12 combos | 178 tests, 12/12 combos | ### New Test Parameters - Added `seqs = [(1, 1)]` for edge case testing - Added `heads = [(3, 1)]` for non-standard GQA ratios - Added `h_sizes = [40]` for non-power-of-2 head sizes (tests rotary skip logic) ### New Test Configurations - `share_buffer` config option (tests both buffer modes) - `has_position_ids` testing on CUDA - Padding prompt parity test - Fused vs unfused kernel parity tests (`TestFusedKernelParity`) - Decoding from empty cache test case `(1, 1)` ## Files Changed **Core:** - `group_query_attention_impl.cu` - Main implementation refactoring - `attention_kv_cache.cu` - Fused append kernels - `flash_api.cc` - Packed QKV stride handling **New:** - `rotary_common.cuh` - Reusable RoPE dispatcher **Tests:** - `test_gqa.py` - Extended test coverage ## Performance For decoding or subsequent prompt, we still use original flash attention kernel, so the performance is almost same as baseline. Here we only show the results of first prompt. Below are results of benchmark_gqa.py on H200 GPU. Note that the latency is measured from onnx model of a GQA node, so the latency includes extra cost. The kernel speed up can be larger (See profiling results below). ### prompt-sm90-Llama3-8B-b1-h32_8x128-float16 **Configuration**: `batch=1, prompt (past_seq=0), num_heads=32, kv_heads=8, head_size=128, dtype=float16, gpu=H200` Dense mean Q, K and V are separated inputs. Packed means Q, K and V are packed into one input. | Sequence Length | Dense Base (ms) | Dense Treat (ms) | **Dense Speedup** | Packed Base (ms) | Packed Treat (ms) | **Packed Speedup** | | --------------: | --------------: | ---------------: | :---------------- | ---------------: | ----------------: | :----------------- | | 1024 | 0.470 | 0.277 | **1.70x** | 0.468 | 0.320 | **1.46x** | | 2048 | 1.001 | 0.517 | **1.94x** | 0.990 | 0.590 | **1.68x** | | 4096 | 2.691 | 1.174 | **2.29x** | 1.504 | 1.242 | **1.21x** | | 8192 | 7.780 | 2.292 | **3.39x** | 7.933 | 4.004 | **1.98x** | ### prompt-sm90-Llama3-8B-b1-h32_8x128-bfloat16 **Configuration**: `batch=1, prompt (past_seq=0), num_heads=32, kv_heads=8, head_size=128, dtype=bfloat16, gpu=H200` | Sequence Length | Dense Base (ms) | Dense Treat (ms) | **Dense Speedup** | Packed Base (ms) | Packed Treat (ms) | **Packed Speedup** | | --------------: | --------------: | ---------------: | :---------------- | ---------------: | ----------------: | :----------------- | | 1024 | 0.477 | 0.274 | **1.74x** | 0.486 | 0.332 | **1.46x** | | 2048 | 1.078 | 0.500 | **2.16x** | 1.087 | 0.601 | **1.81x** | | 4096 | 2.633 | 1.144 | **2.30x** | 3.017 | 1.282 | **2.35x** | | 8192 | 7.933 | 2.712 | **2.93x** | 7.933 | 4.003 | **1.98x** | # Profiling Comparison (Prompt Phase) **Summary**: Switching from `flash_fwd_splitkv_kernel` to standard `flash_fwd_kernel` for the prompt phase (SeqLen=2048) results in a **~3x reduction in attention kernel latency** and a **~2x improvement in total operator latency**. ## 1. Packed QKV **Configuration**: `batch=1, seq_len=2048, past_seq=0, num_heads=32, kv_heads=8, head_size=128` | Metric | Baseline | Treatment | Delta | | :--- | :--- | :--- | :--- | | **Total Latency** | **639.3 us** | **287.0 us** | **2.23x Speedup** | | **Attention Kernel** | `flash_fwd_splitkv_kernel` 567.10 us | `flash_fwd_kernel` 187.70 us | **3.08x Speedup** | | **Helper Kernels** | `ConcatNewToPastKV`: 4.71 us | `UnpackQKVWithRoPEAndAppendKV`: 32.44 us `GetSequenceLengths`: 1.63 us | *Fused ops added* | > **Note**: The Treatment implementation introduces a fused `UnpackQKVWithRoPEAndAppendKV` kernel which performs necessary pre-processing. Despite this added cost (~29 us), the massive gain from using the efficient `flash_fwd_kernel` instead of `flash_fwd_splitkv_kernel` yields a significant net speedup. ## 2. Dense (Separated QKV) **Configuration**: `batch=1, seq_len=2048, past_seq=0, num_heads=32, kv_heads=8, head_size=128` | Metric | Baseline | Treatment | Delta | | :--- | :--- | :--- | :--- | | **Total Latency** | **0.6468 ms** | **0.3226 ms** | **2.00x Speedup** | | **Attention Kernel** | `flash_fwd_splitkv_kernel` 567.25 us | `flash_fwd_kernel` 184.29 us | **3.08x Speedup** | | **Helper Kernels** | `ConcatNewToPastKV`: 4.68 us | `RotaryEmbeddingBSNH`: 48.94 us `ConcatNewToPastKVFused`: 13.04 us `GetSequenceLengths`: 1.52 us | *See below* | > **Note**: Similar to the Packed case, the switch to the standard Flash Attention forward kernel drives the performance improvement. The pre-processing is handled by `RotaryEmbeddingBSNH` and `ConcatNewToPastKVFused` in the treatment.

…osoft#26927) ### Description Adds APIs to enable plugin EPs to create and register kernels for control flow operators (If, Loop, and Scan). The implementation provides ORT-managed kernel implementations that handle subgraph execution while allowing EPs to provide device-specific helper functions for operations like tensor concatenation and transposition. Key changes: - Adds four EP API functions: `CreateIfKernel`, `CreateLoopKernel`, `CreateScanKernel`, and `ReleaseKernelImpl` for creating control flow kernel implementations - Introduces public helper structures (`OrtLoopKernelHelper`, `OrtScanKernelHelper`) that EPs implement to provide device-specific operations - Updates the example kernel-based EP with kernel registrations for all control flow operators and adds corresponding test models. ### Motivation and Context

## Summary This PR updates the Flash Attention implementation in ONNX Runtime, syncing with newer kernel sources in https://github.com/Dao-AILab/flash-attention, and extending the internal API to support additional features required for advanced caching scenarios. It also aligns specific kernels with the official implementation. ## Changes - **Flash Attention Kernels**: Updated/Added Flash Attention forward kernels and headers in `onnxruntime/contrib_ops/cuda/bert/flash_attention/`. - **API Extension**: Updated `mha_fwd` and `mha_fwd_kvcache` in `flash_api.h` and `flash_api.cc` to accept two new optional parameters: - `cache_batch_idx`: Indices to index into the KV cache (support for non-contiguous batch indices). - `leftpad_k`: Support for left-padding in the key sequence. - **Alignment & Fixes**: - **Cleanup**: Removed redundant `kInfinity` definition in `flash_fwd_kernel.h`. - **Includes**: Added missing `<core/providers/cuda/shared_inc/cuda_call.h>` in `flash_fwd_launch_template.h`. - **Integration**: Updated `group_query_attention_impl.cu` to align with the new `mha_fwd_kvcache` signature. - **Build Configuration**: Adjusted `onnxruntime_providers_cpu.cmake` to update the exclusion list for Flash Attention kernels in quick build mode. ## Implementation Details - The `run_mha_fwd` helper now checks if `cache_batch_idx` is provided alongside `k_new` to determine if the split kernel should be forced. - New parameters are propagated through the call stack to the underlying Flash Attention kernels.

### Description  1. platform.cpp missed inclusion of sys/auxv.h (for elf_aux_info) and machine/cpu.h (for PPC_FEATURE2_ARCH_3_00). I missed that in my previous commit. 2. Same as on AIX, __vector int32_t is not defined and __vector int needs to be used. ### Motivation and Context  Fixes build on FreeBSD / powerpc64le platform.

…icrosoft#26303) The original check enforces both the present_key and the past_key must be present. But with IO-binding there may be an issue: The past_key can be nullptr even when present_key is allocated. In reality, the kernel should just do the computation when it has the data, or when the output is requested. --------- Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

### Description  The coreml gather MLProgram operator supports fp16, but the check was missing in `GatherOpBuilder::HasSupportedInputsImpl()` ### Motivation and Context  We use `Gather` in LeelaChessZero. --------- Co-authored-by: borg323 <borg323@users.noreply.github.com>

### Description Fixes microsoft#26865

…oft#26696) ### Description When multiple devices are provided in `AppendExecutionProviders_V2`, default to the NPU device, instead of picking the last device in the list.

### Description  NOTE: Need microsoft#26597 ### Motivation and Context

This pull-request removes the: - onnxruntime_providers_cuda.dll - onnxruntime_providers_qnn.dll - All the QNN binaries from the FL nuget package.

### Description - Support pre-opset11 `Pad` in the QNN op builder. - Add GPU backend tests for `Pad`. ### Motivation and Context - Enables `Pad` translation in models using older opsets.

### Description - Transposes are inserted for Softmax with axis != output_rank-1 for the HTP backend. - The GPU backend also has this requirement on the axis param, so this change enables the layout transformation for the GPU as well. ### Motivation and Context - Enables more models with GPU backend.

### Description Support for Cuda Graph on by default ### Motivation and Context We wanted to have Cuda Graph support on by default for NV TRT-RTX EP. Added the use of RTX graph capture and removed the access external checks for the same

Made compute capability as kCURRENT by default For better performance, benchmarking and based on current state most use cases today to be build and run on same device or same SM.

- Updated stale issue policy to include 'no stale' label checks and modified reply message. - Do not reopen closed stale issues automatically, as it is just noisy. Instead, we could encourage issue owners to reopen manually or create new issues.

### Description - Add standalone RMSNorm op translation in QNN EP - Add unit tests ### Motivation and Context - This fixes the CPU fallback of ONNX RMSNormalization operator when running inferences using QNN EP

### Description  `js/` linting is already done in the Web CI pipeline so removing redundant check as its signaling false-positives Co-authored-by: Prathik Rao <prathikrao@microsoft.com>

### Description Update Qnn default version to 2.42.0.251225 Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com>

### Description Create new class to handle HTP power config updates Only update if there are changes to power config settings On dynamic htp perf mode updates, disable DSPQ polling if perf mode is not burst ### Motivation and Context Currently, if a session has set the performance mode to burst then changed the performance mode to anything else, DSPQ polling will be enabled and never disabled. This change is to allow disabling of DSPQ polling when the performance mode is not burst, even on updates. --------- Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>

microsoft#26622) ### Description - Adding comprehensive logging to the QNN EP that displays detailed information about all the graph inputs, outputs and initializers. - Information is dumped into a json file during graph composition only if we are dumping the json qnn graph. ### Motivation and Context - Useful for debugging and understanding if we are hitting peak memory during inference

Fix VS2026 build. ### Description  Fix some correctness issues. Fix VS 2026 build. ### Motivation and Context

…ft#26995) ### Description Remove the limitations on using onnxruntime_USE_KLEIDIAI in a Windows on Arm environment. ### Motivation and Context Historically the KleidiAI build had difficulties with using Microsoft compiler for Arm environments (MSVC). As a result a hard exclusion of onnxruntime_USE_KLEIDIAI and MSVC was added and subsequently consolidated into cmake/CMakeLists.txt by [this](microsoft@2e8a45a) commit. The problems in KleidiAI were resolved in their v1.14.0 release. v1.15.0 was introduced via [this](microsoft@8fe4804) commit. This PR removes the limitation, allowing MSVC to be used to compile with onnxruntime_USE_KLEIDIAI enabled in a Winodws on Arm environment. In addition there were legacy restrictions in CMakeLists.txt relating to DOTPROD and I8MM CPU features. This is already handled in the KleidiAI build. ### Verification Following the Windows build instructions [here](https://onnxruntime.ai/docs/build/inferencing.html#windows) KleidiAI and its associated logic in MLAS will be built when ARM64 is detected. **Note**: As is made clear in these build instructions MSVC must include support for ARM64. Both Python and Cmake must be native ARM64. Signed-off-by: Colm Donelan <colm.donelan@arm.com>

nieubank and others added 27 commits January 9, 2026 03:04

Add OrtExternalResourceImporter API for D3D12 shared resource import (m…

c54be3c

…icrosoft#26828)

Update protobuf references from 3.20.3 to 4.25.8 (microsoft#26910)

e6110d0

### Description Version update for security fixes. ### Motivation and Context Version update for security fixes.

[js/node] fix incorrect output name mapping (microsoft#26939)

ce4a7c5

### Description Fixes microsoft#26865

[QNN EP] Default to HTP instead of using last provided device (micros…

addc812

…oft#26696) ### Description When multiple devices are provided in `AppendExecutionProviders_V2`, default to the NPU device, instead of picking the last device in the list.

Integration with ONNX==1.20.1 (microsoft#26579)

de398c4

### Description  NOTE: Need microsoft#26597 ### Motivation and Context

Minimize binary footprint of the FL nuget (microsoft#26961)

a5831a3

This pull-request removes the: - onnxruntime_providers_cuda.dll - onnxruntime_providers_qnn.dll - All the QNN binaries from the FL nuget package.

[QNN-EP] Support pad op pre-opset11 (microsoft#26248)

15efe1c

### Description - Support pre-opset11 `Pad` in the QNN op builder. - Add GPU backend tests for `Pad`. ### Motivation and Context - Enables `Pad` translation in models using older opsets.

Cuda Graph on by default (microsoft#26929)

0a93edb

### Description Support for Cuda Graph on by default ### Motivation and Context We wanted to have Cuda Graph support on by default for NV TRT-RTX EP. Added the use of RTX graph capture and removed the access external checks for the same

Compute cpabality as kCURRENT by default (microsoft#26663)

912f652

Made compute capability as kCURRENT by default For better performance, benchmarking and based on current state most use cases today to be build and run on same device or same SM.

[QNN EP] Add RMSNorm Op support in QNN EP (microsoft#26853)

e5e9174

### Description - Add standalone RMSNorm op translation in QNN EP - Add unit tests ### Motivation and Context - This fixes the CPU fallback of ONNX RMSNormalization operator when running inferences using QNN EP

remove lint for /js/ folder (microsoft#26984)

f86a0ed

### Description  `js/` linting is already done in the Web CI pipeline so removing redundant check as its signaling false-positives Co-authored-by: Prathik Rao <prathikrao@microsoft.com>

[QNN EP] Upgrade QNN to 2.42.0 (microsoft#26958)

669128a

### Description Update Qnn default version to 2.42.0.251225 Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com>

Merge branch 'master' into sync_msft_14012026

48c737c

Jaswanth51 requested a review from ankitm3k January 14, 2026 03:23

ankitm3k approved these changes Jan 14, 2026

View reviewed changes

ankitm3k merged commit cd20d5d into ovep-develop Jan 14, 2026
6 of 7 checks passed

ankitm3k deleted the sync_msft_14012026 branch January 14, 2026 05:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 14/01/2026#902

Sync with Microsoft ONNX Runtime - 14/01/2026#902
ankitm3k merged 27 commits intoovep-developfrom
sync_msft_14012026

Jaswanth51 commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Jaswanth51 commented Jan 14, 2026

Description

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants