Merge 'rel-1.22.0' into 'win-ort-main' @ 2abab8d39e (#24481)#24536
Merged
ashrit-ms merged 77 commits intowin-ort-mainfrom Apr 24, 2025
Merged
Merge 'rel-1.22.0' into 'win-ort-main' @ 2abab8d39e (#24481)#24536ashrit-ms merged 77 commits intowin-ort-mainfrom
ashrit-ms merged 77 commits intowin-ort-mainfrom
Conversation
### Description Exclude zero-dim input testcase for WebGPU EP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
1. Split build.py to two files, because currently the file is over 3000 lines. This PR moves 900 of them to a new file. 2. Put the build args into groups. It makes more explicit that "--x86", "--arm", "--arm64" and "--arm64ec" args are for Windows only. 3. Remove the "--use_avx512" and "--gen-api-doc" build args, as they are not referenced anywhere. "--gen-api-doc" was for generating documents for pytorch frontend. 4. Remove MPI related build flags. 5. Delete tools/ci_build/github/pai/orttraining-ci.yml 6. Remove --use_preinstalled_eigen and --eigen_path. Now we have a more unified approach for all ORT's dependencies (not just eigen). See https://onnxruntime.ai/docs/build/dependencies.html for more information. 7. Windows specific build options won't show up on non-Windows platforms. The same for macOS.
### Description This PR is one of a series of changes for optimization of Dawn API usage. See #24281 Optimize the code for workgroup dispatch in the `WebGpuContext` class. The updated code prefers using the C-API instead of the C++ API for WebGPU. This is because the C++ API uses class `wgpu::Buffer`, which causes significant amount of calls to `wgpuBufferAddRef` and `wgpuBufferRelease` to ensure the lifecycle of the buffer is managed correctly. For this specific use case in ONNX Runtime (launch a compute shader program), using the C-API is more efficient.
There is an issue with the 0.46.0 `wheel` version as reported in pypa/wheel#660. We are currently seeing this on the Python packaging pipelines, for example in [this run](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=740997&view=logs&j=4864752d-f1c3-57c0-06eb-25cee39385a7&s=3fc0883b-27ef-5aa3-1052-0a269c26624c&t=fa95d49e-17f6-501e-c36c-b2949c11fc4a&l=13).
### Description 1. Added 'ProcessInt64Tensors' method in BaseOpBuilder to handle common input processing to the graph. 2. Added logic in ProcessOutputs to handle common Cast addition at output. 3. Adds Cast Op at the input to convert to int32 for graph input. 4. Initializers and activation inputs are handled by casting int64_t data to int32_t for QNN compatibility by resizing and copying data. 5. Modified `TransposeOpBuilder` and `GatherOpBuilder`to handle processing outputs. 6. Added unit test for a Reshape op to run with int64 inputs
Support WebGPU build for android and ios - Add Java API for android test - Patch for dawn to reduce warnings for UNSAFE_BUFFER_USAGE
### Description Resolves #24343. Also added a test case to avoid breaking the module resolution of TypeScript in the future. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
MatMulNBits op can be simply emulated by DequantizeLinear + Transpose + MatMul and currently only 4-bit quantization is supported. Thus the B and zero_points (if present) inputs must be known as initializers with data type 'uint8' and we need to register them as 'uint4' WebNN constant. Typically, all initializers are registered as WebNN constants in one step via `ModelBuilder::RegisterInitializers` before constructing the WebNN graph. However, due to WebNN doesn't support cast to 'uint4', we need to defer the registration of these two inputs until the `MatMulNBitsBuilder::AddToModelBuilderImpl` is invoked.
### Description <!-- Describe your changes. --> This script can upload local perf log/csv to DB, which can be used as EP Perf Dashboard external data source. (Make sure the csv/log-parsing logic match the targeting DB table's schema ) #### Usage: * To post csv to db: `python parse_post_perf.py --kusto-table="<table_name>" --kusto-conn="<db_link>" --kusto-db="<dashboard_xyz>" --upload-csv="<path\to\data.csv>" ` * To parse log from mobile perf log and post to db: `python parse_post_perf.py --kusto-table="<table_name>" --kusto-conn="<db_link>" --kusto-db="<dashboard_xyz>" --parse_mobile_perf --log-file="<path/to/mobile_model_benchmark.log>" --model="<model_name>" --device-id="<device_name>" --commit-id="<ort_commit_id>" --ep="<test_backend>"` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description WebGPU, VitisAI, and DML are missing from the list. ### Motivation and Context If users misspell a provider name this error should be showing them the full possibilities. Leaving one out will lead to confusion. I noticed it when testing new providers in GenAI that the error message was not up to date.
### Description
<!-- Describe your changes. -->
Adds support for GroupQueryAttention via WebNN matmul, transpose,
reshape, and other operations that follow the logic in the GQA subgraph
below.
```
Abbreviations: B is batch_size, S is sequence_length, W is hidden_size, P is past_sequence_length
N is number of attention heads, H is head size, and W=N*H, h=Sqrt(H), G is group size.
GQA inputs: query, key value, past_key, past_value, seqlens_k, total_sequence_length
Notes: If the datatype of the inputs (qkv and past kv) is float16, we cast them to float32 to ensure data precision.
query key value
| | |
Reshape Reshape Reshape (B,S,H,N) seqlens_k
| | | / |
| | past_value | (scatter_indices*) |
q_Transpose | \ | / |
(0,2,1,3) | past_key ScatterND-----------------------|------> present_value
\ | / | |
present_key<--\----ScatterND Expand(G) (attention_bias, one/finfo_min mask*)
\ | | /
| Expand(G) | /
| | | /
| k_Transpose | /
| (0,1,3,2) | /
| | | /
+---------------------------------------+
| ScaledDotProductAttention |
+---------------------------------------+
|
output
```
The ScaledDotProductAttention logic is:
```
ScaledDotProductAttention Subgraph: The basis for MultiHeadAttention and GroupQueryAttention
inputs: query, key, value, scale, attention mask, and reshape_output_shape (for reshape)
Abbreviatios: B is batch_size, S is query sequence_length, kv_S is key/value sequence length,
N is number of attention heads, H is head size, W is hidden_size
query key
| |
+---matmul---+ scale
| |
+-----div-----+ attn_mask
| |
+-----add-----+ value
| |
+------matmul-----+
|
(0,2,1,3) transpose B,H,S,N -> B,S,H,N
|
Reshape B,S,H,N -> B,S,W
|
output
```
scatter_indices's calculation:
```
if_prefill (0/1 constant)
|
scatter_indices_left_constant scatter_indices_right_constant 0 ---> Where <--- Cast <---seqlens_k
| | |
| Add <--------------------------- scatter_pos*
| |
+--------------------+---------------------+
|
scatter_indices
```
attention_bias's calculation:
```
ones_array (shape=B,N,S,P) range_of_qkv_sequence_length_constant (0,1,2,...) (shape=S)
| |
CumSum (axis=3, exclusive=true, reversed=false) Add <--- scatter_pos
| |
| Expand (shape=P,S)
| |
+-------------------------------> Lesser <------------------------------Transpose (1,0)
|
1 ---> Where <--- finfo_min (minimum value of FP32)
|
attention_bias
```
*Notes: Now we only support `past_sequence_length ==
total_sequence_length` for GQA.*
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The Azure DevOps pipeline template [/nuget/templates/dml-vs-2022.yml](https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/github/azure-pipelines/nuget/templates/dml-vs-2022.yml) is used to build the ONNX Runtime DirectML (DML) components. It historically contained two potential mechanisms for creating NuGet packages: 1. Invoking `python tools/ci_build/build.py` with the `--build_nuget` flag. 2. Executing a specific `NuPackScript` (usually calling `msbuild /t:CreatePackage`). This redundancy created a significant problem during release builds (when the pipeline parameter IsReleaseBuild is set to true). Here's why: - Duplicate Package Creation: Both packaging methods would execute. - build.py --build_nuget created a package with a development/pre-release version suffix (e.g., Microsoft.ML.OnnxRuntime.DirectML.1.21.1-dev-20250408-0849-84808eb710.nupkg). - The NuPackScript's msbuild call, influenced by IsReleaseBuild=true, created the clean release version package (e.g., Microsoft.ML.OnnxRuntime.DirectML.1.21.1.nupkg). - ren Command Failure: For the x86 and arm64 builds, the NuPackScript contains a command like: ```Bash ren Microsoft.ML.OnnxRuntime.DirectML.* win-dml-x86.zip ``` This command fails when two files match the pattern Microsoft.ML.OnnxRuntime.DirectML.* (the dev package and the release package), as ren requires a single source file when using wildcards for renaming. - Result: This caused build failures specifically when attempting to create release candidates or final release builds for x86 and arm64 DML components. This issue did not typically occur in regular nightly builds (IsReleaseBuild: false) because only one package (the dev version) was likely produced, allowing the ren command to succeed. Therefore we only found the problem when doing a patch release for ONNX Runtime 1.21. (@amarin16, the release manager of ONNX Runtime 1.21, found the issue and explained it to us why the pipeline was not working) The change is relatively simple. This PR removes the `--build_nuget` flag from the `python tools/ci_build/build.py` command within the dml-vs-2022.yml template. By removing the redundant packaging step from build.py, only the NuPackScript's msbuild command generates a package file. This ensures only one file matches the Microsoft.ML.OnnxRuntime.DirectML.* pattern, allowing the subsequent ren command in the x86 and arm64 scripts to execute successfully during release builds. # Background (how the DML packaging pipeline works) The build has two stages: 1. Individual Architecture Builds (Using dml-vs-2022.yml): Each stage (x64, x86, arm64) runs, now reliably using only its specific NuPackScript to generate its artifact without the risk of the ren command failing during release. x64 produces: Microsoft.ML.OnnxRuntime.DirectML.[version].nupkg x86 produces: win-dml-x86.zip arm64 produces: win-dml-arm64.zip (arm32 is not built/included). 2. Final Packaging Stage (e.g., stages/nuget_dml_packaging_stage.yml): Downloads these artifacts and combines them by unpacking the base x64 .nupkg, injecting the contents of the .zip files into the appropriate runtimes/ directories (e.g., runtimes/win-x86/native/, runtimes/win-arm64/native/), and re-packing the final, multi-architecture Microsoft.ML.OnnxRuntime.DirectML.nupkg. In stage 1 only x64 produces a nuget package, therefore specific MSBuild parameters: `/p:IsReleaseBuild=${{ parameters.IsReleaseBuild }}` is passed to all architectures' MSBuild calls, while `/p:CurrentData=$(BuildDate) /p:CurrentTime=$(BuildTime)` are passed only in the x64 script. BTW, the property "CurrentData" apparently is a typo. It should be `CurrentDate`.
### Description <!-- Describe your changes. --> Make test `CApiTest.RequestLoadCancellation` deterministic by removing the `terminator` thread. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The test contributes to CI failures
### Description
This change allows NPM tests to run the nodejs binding for webgpu. This
helps to debug test failures much easier because WebAssembly is
generally very difficult to debug.
Steps to debug:
0. build
- {ORT_ROOT}> build --config Debug --use_webgpu --build_nodejs
- {ORT_ROOT}\js\web> npm ci
- {ORT_ROOT}\js\web> npm run pull:wasm
2. run `npm test -- <args> -b=webgpu -e=node` once. ( this command
generates necessary .js files and `testdata-config.json`.)
3. use native debugger to debug:
```
C:\Program Files\nodejs\node.exe
{ORT_ROOT}\js\node_modules\mocha\bin\_mocha --timeout 999999 --colors -r
{ORT_ROOT}\js/web/dist/ort.node.min.js {ORT_ROOT}\js/web/test/test-main
```
### Description MlasTranspose was running single-thread and was not performing well enough on a multi-threaded CPU. Therefore, I modified it to run with multi-thread to improve performance. The `MlasTranspose` was previously running in a single-threaded, which resulted in suboptimal performance on multi-threaded CPUs. To address this, I have modified it to utilize multi-threading. ### Motivation and Context We encountered this issue while running the [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large), which was converted to ONNX format and executed on a multi-core CPU (Xeon 6338). Below are the performance metrics before and after the modification: | | INTER_NUM_THREADS | INTRA_NUM_THREADS | INPUT_LENGTH | BATCH_SIZE | Duration time[sec] | | ------ | ----------------- | ----------------- | ------------ | ---------- | ------------------ | | BEFORE | 1 | 16 | 512 | 4 | 1.24 | | AFTER | 1 | 16 | 512 | 4 | 1.09 | Condition - FP32 - CPUExecutionProvider This change resulted in a performance improvement of approximately 14%. MlasTranspose stand-alone performance improvements are as follows | | INTRA_NUM_THREADS | BEFORE | AFTER | | --------------------------------- | ---- | -------------- | ------------- | | MlasTranspose [msec] | 16 | 182.55 [ms] | 11.60 [ms] | `MlasTranspose` is x15~16 faster.
On Qualcomm Adreno X1 GPUs, the previous implementation of the FlashAttentionProgram shader in the WebGPU backend was causing high register pressure, leading to performance degradation. This PR uses workgroup memory to reduce the register pressure and improve performance. TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1 GPU.
### Description 1. Transform INT64 shape of Expand Op to INT32 shape. 2. Add Unit test to check INT64 Shape conversion to INT32 by QNN EP. ### Motivation and Context QNN doesn't support INT64 shape for Expand Op. This commit delegates the Expand Ops with INT64 shape on QNN EP. This improves the inference time.
### Description - fix a bug in ConvTranspose This bug causes `input_channels_per_group_int` to be `-3` for a test case, and later causes a loop of `4294967293` times (`uint32_t(-3)`) that causing timeout. - fix cache hint of Conv2dMMProgram After fixing the bug in ConvTranspose, more cache hint inconsistencies are revealed. This change fixes channel_last missing in the cache hint of Conv2dMMProgram.
1. Migrate OpenVino Pipeline to Github Actions 2. Update the OpenVino pipeline's docker file to use almalinux8 instead of Ubuntu, to be aligned with the other Linux CI pipelines. (We cannot pull images from docker hub because it requires a paid account)
### Description Add InstanceNormalization operator to WebGPU EP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…ite-default (#24396) Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) from 6.2.5 to 6.2.6. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/vitejs/vite/releases">vite's releases</a>.</em></p> <blockquote> <h2>v6.2.6</h2> <p>Please refer to <a href="https://github.com/vitejs/vite/blob/v6.2.6/packages/vite/CHANGELOG.md">CHANGELOG.md</a> for details.</p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/vitejs/vite/blob/v6.2.6/packages/vite/CHANGELOG.md">vite's changelog</a>.</em></p> <blockquote> <h2><!-- raw HTML omitted -->6.2.6 (2025-04-10)<!-- raw HTML omitted --></h2> <ul> <li>fix: reject requests with <code>#</code> in request-target (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19830">#19830</a>) (<a href="https://github.com/vitejs/vite/commit/3bb0883d22d59cfd901ff18f338e8b4bf11395f7">3bb0883</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19830">#19830</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/vitejs/vite/commit/d3dbf25fd5e21448f9ea6cec8fb5ac45d220037b"><code>d3dbf25</code></a> release: v6.2.6</li> <li><a href="https://github.com/vitejs/vite/commit/3bb0883d22d59cfd901ff18f338e8b4bf11395f7"><code>3bb0883</code></a> fix: reject requests with <code>#</code> in request-target (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19830">#19830</a>)</li> <li>See full diff in <a href="https://github.com/vitejs/vite/commits/v6.2.6/packages/vite">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description Update protobuf-java to 3.25.5 ### Motivation and Context To fix the [CG issue](https://aiinfra.visualstudio.com/Lotus/_componentGovernance/218239/alert/12112143?typeId=29309793&pipelinesTrackingFilter=0). Change file links - [x] java_linux_final_test.sh -> java-cuda-packaging-stage.yml (Jar_Packaging_GPU stage from Zip-Nuget) - [ ] final-jar-testing.yml (Final_Jar_Testing_$ stages)
### Description
- Adds C/C++ API functionality to compile a model (i.e., generate a
model with EPContext nodes) using explicit APIs.
- Adds support for compiling when input or output models are in memory
(not just files).
- Allows specifying the threshold for when initializers are stored in an
external file.
- Allows file paths of arbitrary lengths (session_option key/value
configs limited string length to 2048).
List of C API functions:
```C++
ORT_API(const OrtCompileApi*, GetCompileApi);
ORT_API(void, ReleaseModelCompilationOptions, _Frees_ptr_opt_ OrtModelCompilationOptions*);
ORT_API2_STATUS(CreateModelCompilationOptionsFromSessionOptions, _In_ const OrtEnv* env,
_In_ const OrtSessionOptions* session_options, _Outptr_ OrtModelCompilationOptions** out);
ORT_API2_STATUS(ModelCompilationOptions_SetInputModelPath, _In_ OrtModelCompilationOptions* model_compile_options,
_In_ const ORTCHAR_T* input_model_path);
ORT_API2_STATUS(ModelCompilationOptions_SetInputModelFromBuffer, _In_ OrtModelCompilationOptions* model_compile_options,
_In_ const void* input_model_data, size_t input_model_data_size);
ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelPath, _In_ OrtModelCompilationOptions* model_compile_options,
_In_ const ORTCHAR_T* output_model_path);
ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelExternalInitializersFile,
_In_ OrtModelCompilationOptions* model_compile_options,
_In_ const ORTCHAR_T* external_initializers_file_path,
size_t external_initializer_size_threshold);
ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelBuffer, _In_ OrtModelCompilationOptions* model_compile_options,
_Inout_ OrtAllocator* allocator, void** output_model_buffer_ptr, size_t* output_model_buffer_size_ptr);
ORT_API2_STATUS(ModelCompilationOptions_SetEpContextEmbedMode, _In_ OrtModelCompilationOptions* model_compile_options,
bool embed_ep_context_in_model);
ORT_API2_STATUS(CompileModel, _In_ const OrtEnv* env, _In_ const OrtModelCompilationOptions* model_options);
```
Example (see unit tests for others):
```C++
#include "onnxruntime_cxx_api.h"
// Test using the CompileModel() API with settings:
// - input model from buffer
// - output model file
// - EPContext nodes in output model use embedded binary blobs.
TEST_F(QnnHTPBackendTests, CompileApi_FromSessionOptions_InputModelAsBuffer_Embedded) {
const ORTCHAR_T* output_model_file = ORT_TSTR("./qnn_context_binary_multi_partition_test.onnx");
std::filesystem::remove(output_model_file);
// Initialize session options with QNN EP
Ort::SessionOptions session_options;
ProviderOptions provider_options;
#if defined(_WIN32)
provider_options["backend_path"] = "QnnHtp.dll";
#else
provider_options["backend_path"] = "libQnnHtp.so";
#endif
provider_options["offload_graph_io_quantization"] = "0";
session_options.AppendExecutionProvider("QNN", provider_options);
// Create model compilation options from the session options.
Ort::ModelCompilationOptions compile_options(*ort_env, session_options);
compile_options.SetInputModelFromBuffer(reinterpret_cast<const void*>(model_data.data()), model_data.size());
compile_options.SetOutputModelPath(output_model_file);
compile_options.SetEpContextEmbedMode(true);
// Compile the model.
Ort::Status status = Ort::CompileModel(*ort_env, compile_options);
ASSERT_TRUE(status.IsOK());
// Make sure the compiled model was generated and has the expected number of EPContext nodes.
ASSERT_TRUE(std::filesystem::exists(output_model_file));
CheckEpContextNodeCounts(output_model_file, 2, 2);
}
```
### Motivation and Context
Improve compilation workflow and add new capabilities.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description Add 8bit support to matmulnbits quantizer. matmul_4bits_quantizer now can quantize a const B in a MatMul to 8bits initializer. ### Motivation and Context MatMul4Bits has accuracy issue for phi-4 model used for foundry local. The early prototype showed >= 6bits can fix the issue. To mitigate the issue as soon as possible, add 8bit support to MatMulNBits.
### Description
There are 2 benefits to this change:
- the comments contain "Σ", a unicode char causing `std::wclog` failed
and no longer output future logs on Windows native app, if not enabled
UTF-8 explicitly by `std::wclog.imbue(std::locale(".UTF-8"));`. Moving
it out resolves the problem.
- makes the WGSL code slightly shorter.
### Description <!-- Describe your changes. --> Replace use of gsl::narrow with narrow to build for xnnpack with exceptions disabled @snnn ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Address issue #24383
Unblocks nomic-embed model.
### Description
Support mixed precision in quantization for RTN
### Motivation and Context
More flexible for quantization
Usage:
```
customized_weight_config = {}
for i in layers_to_exclude:
customized_weight_config["/model/layers."+str(i)+"/MatMul"] = {"bits": 8}
algo_config = matmul_4bits_quantizer.RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)
quant = MatMul4BitsQuantizer(
model=onnx_model,
block_size=32,
is_symmetric=False,
accuracy_level=4,
nodes_to_exclude=nodes_to_exclude,
algo_config=algo_config,
)
```
…4385) ### Description <!-- Describe your changes. --> This PR adds support for the Resize operator in cubic mode without antialiasing (antialias=0). It supports scaling constraints of the form [1, scale_h, scale_w, 1], where scale_h ≥ 1 and scale_w ≥ 1. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The ARM64 Conv supports FP16, and we have an NhwcTransformer that transforms FP16 Conv to FP16 NhwcFusedConv. As a result, the subsequent Resize op also uses the NHWC format.
### Description Update N-API version to 6. - NAPI v6 is required for `napi_set_instance_data` and `napi_get_instance_data`, as used by #24366 - Adding the "binary" field in package.json for CMake-js to work correctly. (was unintentially removed in #24418) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description Fix compilation issue (undeclared identifier) in Azure EP unit test. ### Motivation and Context A previous PR caused a compilation issue in the Azure EP unit test: #24433 Our PR CI pipelines did not catch it. It was caught by our post-merge packaging pipelines. ```shell D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(28,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj] D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(29,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj] D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(30,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj] ```
### Description If it would improve performance, this patch moves outputs to MLTensor backed Tensors. ### Motivation and Context We are currently performing an extra copy on output tensors located in the CPU when using the WebNN EP (MLTensor -(copy)-> wasm heap -(copy)-> JS). This patch removes this copy by moving the readback to JS instead of wasm. As an extra benefit, we can also start the readbacks and wait for them in parallel. This change is similar to #23073
### Description Fix Nodejs binding build for Linux. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description MatmulTransposeFusion does not work correctly when input A and B are the same for a `MatMul` node.  Fixes #24341 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description
zeros_ memory buffer was uninitialized, but it must be initialized to
zero.
### Motivation and Context
A memory allocator change in GenAI started crashing in FlashAttention
and this was eventually tracked down to be the cause. The allocator
change was innocent. I'm not sure how this didn't fail previously, or if
it was we weren't getting the reports about it.
Co-authored-by: Ryan Hill <{ID}+{username}@users.noreply.github.com>
### Description Mapping ORT verbose logging back to QnnGpu Debug logging. ### Motivation and Context Why is this change required? What problem does it solve? As of now this change is required for the QnnGpu backend to run models correctly. It's necessity is mentioned in this commit b4b5a79 It is temporarily reverting this commit. for the GPU case only, due to loss of functionality 9d45b9a
### Description update Node.js binding document for 1.22 release ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description Handle empty input cases in the native reduce kernel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description TensorProto may have external data in existing memory buffer. For those TensorProto, the 'location' field of the external data info is set to a special marker `*/_ORT_MEM_ADDR_/*`, and the 'offset' field contains the address of the memory buffer. This PR allows DirectML EP to recognize in-memory external data TensorProto and use the address of existing memory buffer containing the external data. ### Motivation and Context Applications using ModelEditor API may create initializers with existing buffer to save memory, such as WebNN. This fix allows DirectML EP can be used by those applications. --------- Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>
### Description Update the packaging pipeline to include the corresponding Nuget version info for Node.js binding. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…am (#24390) ### Description Supports batch and zero points in MatMulNBits WideTileProgram ### Motivation and Context See above
### Description Add validation in path that user CreateSessionFromArray: if ep.context_enable is set, then ep.context_file_path is expected, otherwise report error because ORT don't know where to generate the _ctx.onnx file
…ith the design doc (#24461) ### Description Update the generated Qnn context binary file name to align with the EPContext design doc https://onnxruntime.ai/docs/execution-providers/EP-Context-Design.html
…view) (#24457) ### Description This PR introduces a new provider option called `enable_causallm` for OVEP. This provider option will serve as a entry gate towards enabling inference using ORT GenAI integration with OVEP in the upcoming PR ahead inside OVEP.
### Description Upgrade Transformers to 4.48.0 for llama2, this version deprecated the old format of past_key_value, the current format is DynamicCache. So, we need to add patches to dynamo exporter in llama2. Thanks to @xadupre who made the changes to add the patches to dynamo exporter, and implements patches to transformers 4.48.0 which don't export and convert dynamic_axes into dynamic shapes. --------- Co-authored-by: xadupre <xadupre@microsoft.com> Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…24416) ### Description Adds session config option (`"session.disable_model_compile"`) that disables model compilation during session initialization. If this option is set to "1", inference session creation will fail with error code ORT_MODEL_REQUIRES_COMPILATION if compilation is required to run the model on any Execution Provider added to the session. Only the following kinds of models are valid when this option is set to "1": - Pre-compiled models that have EPContext nodes for the compiling Execution Providers in the session. - Non-compiled models that run only on non-compiling Execution Providers, like CPU EP. ### Example usage The following example (taken from a unit test) tries to load a model that requires compilation with a session that disables compilation. The session creation fails with error code `ORT_MODEL_REQUIRES_COMPILATION`. Then, the example compiles the model and loads the compiled model successfully. ```C++ // Taken from a unit test ... // Initialize session options with QNN EP Ort::SessionOptions session_options; ProviderOptions provider_options; provider_options["backend_type"] = "htp"; provider_options["offload_graph_io_quantization"] = "0"; session_options.AppendExecutionProvider("QNN", provider_options); session_options.AddConfigEntry(kOrtSessionOptionsDisableEpCompile, "1"); // Disable model compilation! // Create an inference session that fails with error ORT_MODEL_REQUIRES_COMPILATION try { Ort::Session session(*ort_env, input_model_file, session_options); FAIL() << "Expected Session creation to fail but it succeeded"; // Should not get here! } catch (const Ort::Exception& excpt) { OrtErrorCode error_code = excpt.GetOrtErrorCode(); std::string_view error_msg = excpt.what(); ASSERT_EQ(error_code, ORT_MODEL_REQUIRES_COMPILATION); ASSERT_THAT(error_msg, testing::HasSubstr(kQnnExecutionProvider)); } // Session creation failed because the model was not pre-compiled. // Try to compile it now. // Create model compilation options from the session options. Ort::ModelCompilationOptions compile_options(*ort_env, session_options); compile_options.SetInputModelPath(input_model_file); compile_options.SetOutputModelPath(output_model_file); // Compile the model. Ort::Status status = Ort::CompileModel(*ort_env, compile_options); ASSERT_TRUE(status.IsOK()) << status.GetErrorMessage(); // Should be able to create a session with the compiled model and the original session options. Ort::Session session(*ort_env, output_model_file, session_options); ``` ### Motivation and Context Compiling models can take a very long time. Want to have a session option that requires input models that do not need to be compiled.
…#24463) ### Description Re-enables (and fixes) generation of compiled EpContext models with **both** input and output models stored in buffers. ### Motivation and Context Previous PR #24176 inadvertently added a check that disabled storing both input and output models in buffers. However, we need this functionality. This was actually a fortunate scenario, as it led to the discovery of a bug.
### Description * Rename filename and class name since it supports 4 and 8 bits. * Update HQQWeightOnlyQuantizer to support 8 bits. * Update some comments. ### Motivation and Context #24384 added 8 bits support for the default weight only quantizer.
…24474) ### Description <!-- Describe your changes. --> Use a pimpl-esque approach so that the winml OrtModel type doesn't conflict with the model editing API OrtModel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix crash due to linker calling the incorrect destructor when there are two different OrtModel types in the global namespace.
…h to int32 (#24425) Some WebNN backends support limited data types for the input and output of a WebNN graph. However, they can support more data types for intermediate nodes. To address this limitation, we implement a data type fallback mechanism. (Note: Currently, we only support fallback to int32 for certain integer data types.) If a data type is not supported for a graph's input or output but is supported for intermediate nodes, we will: 1. Save the input MLTensor as 'int32' data type, 2. Convert the input data from ORT to int32, 3. Insert a cast operation to WebNN graph to convert the input back to its original data type, 4. Insert a cast operation to WebNN graph to convert the output back to 'int32', 5. Convert the output data from int32 to its original data type.
### Description <!-- Describe your changes. --> Add infrastructure to enable auto EP selection. Device discovery for CPU/GPU/NPU on Windows. Supports internal (CPU/DML/WebGPU) and provider bridge (CUDA) EPs currently. Infrastructure will be used with plugin EPs next. Selection policy implementation will be added next, so in the interim there's a temporary function with manually specified selection so unit tests can cover the end-to-end. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
### Description WebNN doesn't support AveragePool with count_include_pad == 1. ### Motivation and Context Support it by adding a pad and calling averagePool2D with pads as 0's.
### Description <!-- Describe your changes. --> Fix some issues. Use adapter number instead of bus number. Bus number doesn't work as expected on VMs. Disable for XBOX build. Needs different handling for adapter lookup. Use adapter number as device_id when creating DML OrtEpDevice. Fix some issues with the metadata. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR merges the 'rel-1.22.0' branch into 'win-ort-main' at commit
2abab8d39e