Sync with Microsoft ONNX Runtime - 14/01/2026#902
Merged
ankitm3k merged 27 commits intoovep-developfrom Jan 14, 2026
Merged
Conversation
### Description Version update for security fixes. ### Motivation and Context Version update for security fixes.
…icrosoft#25865) ### Description - Case-2 LPBQ pattern omits QuantizeLinear node in LPBQ packing pattern - Modify LPBQ fusion logic in QNN EP implemented for Gemma and MatMul nodes to gracefully handle the optional QuantizeLinear node in LPBQ packing pattern. - Add unit tests to verify Case-2 LPBQ pattern fusion for Gemm and MatMul nodes. ### Motivation and Context - QuantizeLinear node in LowPowerBlockQuantization encoding packing pattern can be optional as it helps to keep the weights in INT datatype and further helps to reduce the size of model. --------- Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>
### Description CudaPinned allocator uses hardcoded to DeviceAllocator memory type. The PR comes to allow choosing memory type for CudaPinned allocator between DeviceAllocator and ArenaAllocator. ### Motivation and Context Fixed issue microsoft#26887 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…#26920) ## Summary This PR significantly improves GroupQueryAttention (GQA) performance on CUDA by fusing multiple kernel launches, improving memory access patterns, and cleaning up sequence length semantics. ## Key Changes ### 1. Fused Kernels for Reduced Launch Overhead | New Kernel | Operations Fused | Kernels Saved | |------------|------------------|---------------| | `UnpackQKVWithRoPEAndAppendKV` | Unpack packed QKV + RoPE Q/K + KV cache append | 4-5 | | `ConcatNewToPastKVFused` | K append + V append (separate buffer mode) | 1 | | `ConcatKVInPlaceFused` | K append + V append (shared buffer mode) | 1 | ### 2. New `RotaryDispatcher` Template (`rotary_common.cuh`) Reusable RoPE implementation for fused kernels supporting: - `float`, `half`, `BFloat16` element types - `float2`, `float4` vector types - Interleaved and half-split rotation modes ### 3. Sequence Length Semantics Cleanup **Before:** Confusing `seqlens_k` / `seqlens_k_buff` with overloaded meanings. **After:** Clear separation: - `past_seq_lens` - offset where new tokens are appended - `total_seq_lens` - total valid tokens after append - `padded_seq_lens` - padded length for first prompt masking ### 4. FlashAttention Fast Decode Path New optimized path for token generation (`sequence_length == 1`, shared buffer): - Bypasses `GetSequenceLengths` kernel - Passes `past_seq_lens` directly to Flash Attention - Controlled by `ORT_DISABLE_FLASH_DECODE` env var ### 5. Integer Overflow Prevention All KV cache index calculations use `int64_t` to handle large `batch * heads * seq * head_size` products. ### 6. BFloat16 Vectorization Added `float4` (8 elements) vectorized path for BFloat16 in `ConcatTensorToTensor`. ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `ORT_DISABLE_FLASH_DECODE` | `false` | Disable fast decode optimization | | `ORT_DISABLE_FUSED_KV` | `false` | Use unfused K/V append kernels | ## Test Changes ### Improved Test Coverage Strategy Restructured `gqa_cuda_prompt_test_cases()` and `gqa_cuda_past_test_cases()` to explicitly iterate over kernel code path parameters: ```python # NEW: Primary iteration over kernel code paths for h in h_sizes_to_test: for packed in packed_opts: for rotary, rotary_interleaved in rotary_opts: for share_buffer in share_buffer_opts: # Secondary params (batch, seq, heads) rotate via modulo ``` | Mode | Before | After | |------|--------|-------| | Pipeline | 16 tests, 4/12 combos | 42 tests, 8/12 combos | | Comprehensive | 81 tests, 4/12 combos | 178 tests, 12/12 combos | ### New Test Parameters - Added `seqs = [(1, 1)]` for edge case testing - Added `heads = [(3, 1)]` for non-standard GQA ratios - Added `h_sizes = [40]` for non-power-of-2 head sizes (tests rotary skip logic) ### New Test Configurations - `share_buffer` config option (tests both buffer modes) - `has_position_ids` testing on CUDA - Padding prompt parity test - Fused vs unfused kernel parity tests (`TestFusedKernelParity`) - Decoding from empty cache test case `(1, 1)` ## Files Changed **Core:** - `group_query_attention_impl.cu` - Main implementation refactoring - `attention_kv_cache.cu` - Fused append kernels - `flash_api.cc` - Packed QKV stride handling **New:** - `rotary_common.cuh` - Reusable RoPE dispatcher **Tests:** - `test_gqa.py` - Extended test coverage ## Performance For decoding or subsequent prompt, we still use original flash attention kernel, so the performance is almost same as baseline. Here we only show the results of first prompt. Below are results of benchmark_gqa.py on H200 GPU. Note that the latency is measured from onnx model of a GQA node, so the latency includes extra cost. The kernel speed up can be larger (See profiling results below). ### prompt-sm90-Llama3-8B-b1-h32_8x128-float16 **Configuration**: `batch=1, prompt (past_seq=0), num_heads=32, kv_heads=8, head_size=128, dtype=float16, gpu=H200` Dense mean Q, K and V are separated inputs. Packed means Q, K and V are packed into one input. | Sequence Length | Dense Base (ms) | Dense Treat (ms) | **Dense Speedup** | Packed Base (ms) | Packed Treat (ms) | **Packed Speedup** | | --------------: | --------------: | ---------------: | :---------------- | ---------------: | ----------------: | :----------------- | | 1024 | 0.470 | 0.277 | **1.70x** | 0.468 | 0.320 | **1.46x** | | 2048 | 1.001 | 0.517 | **1.94x** | 0.990 | 0.590 | **1.68x** | | 4096 | 2.691 | 1.174 | **2.29x** | 1.504 | 1.242 | **1.21x** | | 8192 | 7.780 | 2.292 | **3.39x** | 7.933 | 4.004 | **1.98x** | ### prompt-sm90-Llama3-8B-b1-h32_8x128-bfloat16 **Configuration**: `batch=1, prompt (past_seq=0), num_heads=32, kv_heads=8, head_size=128, dtype=bfloat16, gpu=H200` | Sequence Length | Dense Base (ms) | Dense Treat (ms) | **Dense Speedup** | Packed Base (ms) | Packed Treat (ms) | **Packed Speedup** | | --------------: | --------------: | ---------------: | :---------------- | ---------------: | ----------------: | :----------------- | | 1024 | 0.477 | 0.274 | **1.74x** | 0.486 | 0.332 | **1.46x** | | 2048 | 1.078 | 0.500 | **2.16x** | 1.087 | 0.601 | **1.81x** | | 4096 | 2.633 | 1.144 | **2.30x** | 3.017 | 1.282 | **2.35x** | | 8192 | 7.933 | 2.712 | **2.93x** | 7.933 | 4.003 | **1.98x** | # Profiling Comparison (Prompt Phase) **Summary**: Switching from `flash_fwd_splitkv_kernel` to standard `flash_fwd_kernel` for the prompt phase (SeqLen=2048) results in a **~3x reduction in attention kernel latency** and a **~2x improvement in total operator latency**. ## 1. Packed QKV **Configuration**: `batch=1, seq_len=2048, past_seq=0, num_heads=32, kv_heads=8, head_size=128` | Metric | Baseline | Treatment | Delta | | :--- | :--- | :--- | :--- | | **Total Latency** | **639.3 us** | **287.0 us** | **2.23x Speedup** | | **Attention Kernel** | `flash_fwd_splitkv_kernel`<br>567.10 us | `flash_fwd_kernel`<br>187.70 us | **3.08x Speedup** | | **Helper Kernels** | `ConcatNewToPastKV`: 4.71 us | `UnpackQKVWithRoPEAndAppendKV`: 32.44 us<br>`GetSequenceLengths`: 1.63 us | *Fused ops added* | > **Note**: The Treatment implementation introduces a fused `UnpackQKVWithRoPEAndAppendKV` kernel which performs necessary pre-processing. Despite this added cost (~29 us), the massive gain from using the efficient `flash_fwd_kernel` instead of `flash_fwd_splitkv_kernel` yields a significant net speedup. ## 2. Dense (Separated QKV) **Configuration**: `batch=1, seq_len=2048, past_seq=0, num_heads=32, kv_heads=8, head_size=128` | Metric | Baseline | Treatment | Delta | | :--- | :--- | :--- | :--- | | **Total Latency** | **0.6468 ms** | **0.3226 ms** | **2.00x Speedup** | | **Attention Kernel** | `flash_fwd_splitkv_kernel`<br>567.25 us | `flash_fwd_kernel`<br> 184.29 us | **3.08x Speedup** | | **Helper Kernels** | `ConcatNewToPastKV`: 4.68 us | `RotaryEmbeddingBSNH`: 48.94 us<br>`ConcatNewToPastKVFused`: 13.04 us<br>`GetSequenceLengths`: 1.52 us | *See below* | > **Note**: Similar to the Packed case, the switch to the standard Flash Attention forward kernel drives the performance improvement. The pre-processing is handled by `RotaryEmbeddingBSNH` and `ConcatNewToPastKVFused` in the treatment.
…osoft#26927) ### Description Adds APIs to enable plugin EPs to create and register kernels for control flow operators (If, Loop, and Scan). The implementation provides ORT-managed kernel implementations that handle subgraph execution while allowing EPs to provide device-specific helper functions for operations like tensor concatenation and transposition. Key changes: - Adds four EP API functions: `CreateIfKernel`, `CreateLoopKernel`, `CreateScanKernel`, and `ReleaseKernelImpl` for creating control flow kernel implementations - Introduces public helper structures (`OrtLoopKernelHelper`, `OrtScanKernelHelper`) that EPs implement to provide device-specific operations - Updates the example kernel-based EP with kernel registrations for all control flow operators and adds corresponding test models. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
## Summary This PR updates the Flash Attention implementation in ONNX Runtime, syncing with newer kernel sources in https://github.com/Dao-AILab/flash-attention, and extending the internal API to support additional features required for advanced caching scenarios. It also aligns specific kernels with the official implementation. ## Changes - **Flash Attention Kernels**: Updated/Added Flash Attention forward kernels and headers in `onnxruntime/contrib_ops/cuda/bert/flash_attention/`. - **API Extension**: Updated `mha_fwd` and `mha_fwd_kvcache` in `flash_api.h` and `flash_api.cc` to accept two new optional parameters: - `cache_batch_idx`: Indices to index into the KV cache (support for non-contiguous batch indices). - `leftpad_k`: Support for left-padding in the key sequence. - **Alignment & Fixes**: - **Cleanup**: Removed redundant `kInfinity` definition in `flash_fwd_kernel.h`. - **Includes**: Added missing `<core/providers/cuda/shared_inc/cuda_call.h>` in `flash_fwd_launch_template.h`. - **Integration**: Updated `group_query_attention_impl.cu` to align with the new `mha_fwd_kvcache` signature. - **Build Configuration**: Adjusted `onnxruntime_providers_cpu.cmake` to update the exclusion list for Flash Attention kernels in quick build mode. ## Implementation Details - The `run_mha_fwd` helper now checks if `cache_batch_idx` is provided alongside `k_new` to determine if the split kernel should be forced. - New parameters are propagated through the call stack to the underlying Flash Attention kernels.
### Description <!-- Describe your changes. --> 1. platform.cpp missed inclusion of sys/auxv.h (for elf_aux_info) and machine/cpu.h (for PPC_FEATURE2_ARCH_3_00). I missed that in my previous commit. 2. Same as on AIX, __vector int32_t is not defined and __vector int needs to be used. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixes build on FreeBSD / powerpc64le platform.
…icrosoft#26303) The original check enforces both the present_key and the past_key must be present. But with IO-binding there may be an issue: The past_key can be nullptr even when present_key is allocated. In reality, the kernel should just do the computation when it has the data, or when the output is requested. --------- Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
### Description <!-- Describe your changes. --> The coreml gather MLProgram operator supports fp16, but the check was missing in `GatherOpBuilder::HasSupportedInputsImpl()` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We use `Gather` in LeelaChessZero. --------- Co-authored-by: borg323 <borg323@users.noreply.github.com>
### Description Fixes microsoft#26865
…oft#26696) ### Description When multiple devices are provided in `AppendExecutionProviders_V2`, default to the NPU device, instead of picking the last device in the list.
### Description <!-- Describe your changes. --> NOTE: Need microsoft#26597 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
This pull-request removes the: - onnxruntime_providers_cuda.dll - onnxruntime_providers_qnn.dll - All the QNN binaries from the FL nuget package.
### Description - Support pre-opset11 `Pad` in the QNN op builder. - Add GPU backend tests for `Pad`. ### Motivation and Context - Enables `Pad` translation in models using older opsets.
### Description - Transposes are inserted for Softmax with axis != output_rank-1 for the HTP backend. - The GPU backend also has this requirement on the axis param, so this change enables the layout transformation for the GPU as well. ### Motivation and Context - Enables more models with GPU backend.
### Description Support for Cuda Graph on by default ### Motivation and Context We wanted to have Cuda Graph support on by default for NV TRT-RTX EP. Added the use of RTX graph capture and removed the access external checks for the same
Made compute capability as kCURRENT by default For better performance, benchmarking and based on current state most use cases today to be build and run on same device or same SM.
- Updated stale issue policy to include 'no stale' label checks and modified reply message. - Do not reopen closed stale issues automatically, as it is just noisy. Instead, we could encourage issue owners to reopen manually or create new issues.
### Description - Add standalone RMSNorm op translation in QNN EP - Add unit tests ### Motivation and Context - This fixes the CPU fallback of ONNX RMSNormalization operator when running inferences using QNN EP
### Description <!-- Describe your changes. --> `js/` linting is already done in the Web CI pipeline so removing redundant check as its signaling false-positives Co-authored-by: Prathik Rao <prathikrao@microsoft.com>
### Description Update Qnn default version to 2.42.0.251225 Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com>
### Description Create new class to handle HTP power config updates Only update if there are changes to power config settings On dynamic htp perf mode updates, disable DSPQ polling if perf mode is not burst ### Motivation and Context Currently, if a session has set the performance mode to burst then changed the performance mode to anything else, DSPQ polling will be enabled and never disabled. This change is to allow disabling of DSPQ polling when the performance mode is not burst, even on updates. --------- Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
microsoft#26622) ### Description - Adding comprehensive logging to the QNN EP that displays detailed information about all the graph inputs, outputs and initializers. - Information is dumped into a json file during graph composition only if we are dumping the json qnn graph. ### Motivation and Context - Useful for debugging and understanding if we are hitting peak memory during inference
Fix VS2026 build. ### Description <!-- Describe your changes. --> Fix some correctness issues. Fix VS 2026 build. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…ft#26995) ### Description Remove the limitations on using onnxruntime_USE_KLEIDIAI in a Windows on Arm environment. ### Motivation and Context Historically the KleidiAI build had difficulties with using Microsoft compiler for Arm environments (MSVC). As a result a hard exclusion of onnxruntime_USE_KLEIDIAI and MSVC was added and subsequently consolidated into cmake/CMakeLists.txt by [this](microsoft@2e8a45a) commit. The problems in KleidiAI were resolved in their v1.14.0 release. v1.15.0 was introduced via [this](microsoft@8fe4804) commit. This PR removes the limitation, allowing MSVC to be used to compile with onnxruntime_USE_KLEIDIAI enabled in a Winodws on Arm environment. In addition there were legacy restrictions in CMakeLists.txt relating to DOTPROD and I8MM CPU features. This is already handled in the KleidiAI build. ### Verification Following the Windows build instructions [here](https://onnxruntime.ai/docs/build/inferencing.html#windows) KleidiAI and its associated logic in MLAS will be built when ARM64 is detected. **Note**: As is made clear in these build instructions MSVC must include support for ARM64. Both Python and Cmake must be native ARM64. Signed-off-by: Colm Donelan <colm.donelan@arm.com>
ankitm3k
approved these changes
Jan 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.