Skip to content

Merge 'rel-1.22.0' into 'win-ort-main' @ 2abab8d39e (#24481)#24536

Merged
ashrit-ms merged 77 commits intowin-ort-mainfrom
ashritms/update-to-rel-1.22.0
Apr 24, 2025
Merged

Merge 'rel-1.22.0' into 'win-ort-main' @ 2abab8d39e (#24481)#24536
ashrit-ms merged 77 commits intowin-ort-mainfrom
ashritms/update-to-rel-1.22.0

Conversation

@ashrit-ms
Copy link
Contributor

Description

This PR merges the 'rel-1.22.0' branch into 'win-ort-main' at commit 2abab8d39e

satyajandhyala and others added 30 commits April 24, 2025 11:01
### Description
Exclude zero-dim input testcase for WebGPU EP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. Split build.py to two files, because currently the file is over 3000
lines. This PR moves 900 of them to a new file.
2. Put the build args into groups. It makes more explicit that "--x86",
"--arm", "--arm64" and "--arm64ec" args are for Windows only.
3. Remove the "--use_avx512" and "--gen-api-doc" build args, as they are
not referenced anywhere. "--gen-api-doc" was for generating documents
for pytorch frontend.
4. Remove MPI related build flags.
5. Delete tools/ci_build/github/pai/orttraining-ci.yml
6. Remove --use_preinstalled_eigen and --eigen_path. Now we have a more
unified approach for all ORT's dependencies (not just eigen). See
https://onnxruntime.ai/docs/build/dependencies.html for more
information.
7. Windows specific build options won't show up on non-Windows
platforms. The same for macOS.
### Description

This PR is one of a series of changes for optimization of Dawn API
usage. See #24281

Optimize the code for workgroup dispatch in the `WebGpuContext` class.

The updated code prefers using the C-API instead of the C++ API for
WebGPU. This is because the C++ API uses class `wgpu::Buffer`, which
causes significant amount of calls to `wgpuBufferAddRef` and
`wgpuBufferRelease` to ensure the lifecycle of the buffer is managed
correctly. For this specific use case in ONNX Runtime (launch a compute
shader program), using the C-API is more efficient.
### Description

1. Added 'ProcessInt64Tensors' method in BaseOpBuilder to handle common input processing to the graph.
2. Added logic in ProcessOutputs to handle common Cast addition at output.
3. Adds Cast Op at the input to convert to int32 for graph input.
4. Initializers and activation inputs are handled by casting int64_t data to int32_t for QNN compatibility by resizing and copying data.
5. Modified `TransposeOpBuilder` and `GatherOpBuilder`to handle processing outputs.
6. Added unit test for a Reshape op to run with int64 inputs
Support WebGPU build for android and ios
- Add Java API for android test
- Patch for dawn to reduce warnings for UNSAFE_BUFFER_USAGE
### Description

Resolves #24343. Also
added a test case to avoid breaking the module resolution of TypeScript
in the future.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
MatMulNBits op can be simply emulated by DequantizeLinear + Transpose +
MatMul and currently only 4-bit quantization is supported.

Thus the B and zero_points (if present) inputs must be known as
initializers with data type 'uint8' and we need to register them as
'uint4' WebNN constant.

Typically, all initializers are registered as WebNN constants in one
step via `ModelBuilder::RegisterInitializers` before constructing the
WebNN graph. However, due to WebNN doesn't support cast to 'uint4', we
need to defer the registration of these two inputs until the
`MatMulNBitsBuilder::AddToModelBuilderImpl` is invoked.
### Description
<!-- Describe your changes. -->
This script can upload local perf log/csv to DB, which can be used as EP
Perf Dashboard external data source.
(Make sure the csv/log-parsing logic match the targeting DB table's
schema )

#### Usage:
* To post csv to db:
`python parse_post_perf.py --kusto-table="<table_name>"
--kusto-conn="<db_link>" --kusto-db="<dashboard_xyz>"
--upload-csv="<path\to\data.csv>"
`
* To parse log from mobile perf log and post to db:
`python parse_post_perf.py --kusto-table="<table_name>"
--kusto-conn="<db_link>" --kusto-db="<dashboard_xyz>"
--parse_mobile_perf --log-file="<path/to/mobile_model_benchmark.log>"
--model="<model_name>" --device-id="<device_name>"
--commit-id="<ort_commit_id>" --ep="<test_backend>"`

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
WebGPU, VitisAI, and DML are missing from the list.

### Motivation and Context
If users misspell a provider name this error should be showing them the
full possibilities. Leaving one out will lead to confusion.

I noticed it when testing new providers in GenAI that the error message
was not up to date.
### Description
<!-- Describe your changes. -->
Adds support for GroupQueryAttention via WebNN matmul, transpose,
reshape, and other operations that follow the logic in the GQA subgraph
below.

```
 Abbreviations: B is batch_size, S is sequence_length, W is hidden_size, P is past_sequence_length
                N is number of attention heads, H is head size, and W=N*H, h=Sqrt(H), G is group size.
    GQA inputs: query, key value, past_key, past_value, seqlens_k, total_sequence_length
    Notes: If the datatype of the inputs (qkv and past kv) is float16, we cast them to float32 to ensure data precision.

          query      key               value
            |         |                  |
         Reshape   Reshape            Reshape (B,S,H,N)     seqlens_k
            |         |                  |                  /       |
            |         |       past_value |   (scatter_indices*)     |
        q_Transpose   |              \   |   /                      |
        (0,2,1,3)     | past_key    ScatterND-----------------------|------> present_value
             \        |  /              |                           |
present_key<--\----ScatterND         Expand(G)      (attention_bias, one/finfo_min mask*)
               \      |                 |              /
               |   Expand(G)            |             /
               |      |                 |            /
               |  k_Transpose           |           /
               |   (0,1,3,2)            |          /
               |      |                 |         /
            +---------------------------------------+
            |        ScaledDotProductAttention      |
            +---------------------------------------+
                             |
                           output

```
The ScaledDotProductAttention logic is:
```
    ScaledDotProductAttention Subgraph: The basis for MultiHeadAttention and GroupQueryAttention
    inputs: query, key, value, scale, attention mask, and reshape_output_shape (for reshape)
    Abbreviatios: B is batch_size, S is query sequence_length, kv_S is key/value sequence length,
                  N is number of attention heads, H is head size, W is hidden_size

  query         key
    |            |
    +---matmul---+    scale
          |             |
          +-----div-----+   attn_mask
                 |             |
                 +-----add-----+        value
                        |                 |
                        +------matmul-----+
                                 |
                   (0,2,1,3) transpose B,H,S,N -> B,S,H,N
                                 |
                              Reshape B,S,H,N -> B,S,W
                                 |
                               output
```
scatter_indices's calculation:
```
                                                                                               if_prefill (0/1 constant)
                                                                                                    |
        scatter_indices_left_constant             scatter_indices_right_constant           0 ---> Where <--- Cast <---seqlens_k
                      |                                          |                                  |
                      |                                         Add <--------------------------- scatter_pos*
                      |                                          |
                      +--------------------+---------------------+
                                           |
                                      scatter_indices
```

attention_bias's calculation:
```
                  ones_array (shape=B,N,S,P)                                  range_of_qkv_sequence_length_constant (0,1,2,...) (shape=S)
                      |                                                                          |
                   CumSum (axis=3, exclusive=true, reversed=false)                              Add <--- scatter_pos
                      |                                                                          |
                      |                                                                        Expand (shape=P,S)
                      |                                                                          |
                      +-------------------------------> Lesser <------------------------------Transpose (1,0)
                                                           |
                                                  1 ---> Where <--- finfo_min (minimum value of FP32)
                                                           |
                                                      attention_bias
```

*Notes: Now we only support `past_sequence_length ==
total_sequence_length` for GQA.*

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The Azure DevOps pipeline template
[/nuget/templates/dml-vs-2022.yml](https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/github/azure-pipelines/nuget/templates/dml-vs-2022.yml)
is used to build the ONNX Runtime DirectML (DML) components. It
historically contained two potential mechanisms for creating NuGet
packages:

1. Invoking `python tools/ci_build/build.py` with the `--build_nuget`
flag.
2. Executing a specific `NuPackScript` (usually calling `msbuild
/t:CreatePackage`).

This redundancy created a significant problem during release builds
(when the pipeline parameter IsReleaseBuild is set to true). Here's why:
- Duplicate Package Creation: Both packaging methods would execute.
- build.py --build_nuget created a package with a
development/pre-release version suffix (e.g.,
Microsoft.ML.OnnxRuntime.DirectML.1.21.1-dev-20250408-0849-84808eb710.nupkg).
- The NuPackScript's msbuild call, influenced by IsReleaseBuild=true,
created the clean release version package (e.g.,
Microsoft.ML.OnnxRuntime.DirectML.1.21.1.nupkg).
- ren Command Failure: For the x86 and arm64 builds, the NuPackScript
contains a command like:
    ```Bash
    ren Microsoft.ML.OnnxRuntime.DirectML.* win-dml-x86.zip
    ``` 
This command fails when two files match the pattern
Microsoft.ML.OnnxRuntime.DirectML.* (the dev package and the release
package), as ren requires a single source file when using wildcards for
renaming.
- Result: This caused build failures specifically when attempting to
create release candidates or final release builds for x86 and arm64 DML
components. This issue did not typically occur in regular nightly builds
(IsReleaseBuild: false) because only one package (the dev version) was
likely produced, allowing the ren command to succeed. Therefore we only
found the problem when doing a patch release for ONNX Runtime 1.21.

(@amarin16, the release manager of ONNX Runtime 1.21, found the issue
and explained it to us why the pipeline was not working)

The change is relatively simple. This PR removes the `--build_nuget`
flag from the `python tools/ci_build/build.py` command within the
dml-vs-2022.yml template. By removing the redundant packaging step from
build.py, only the NuPackScript's msbuild command generates a package
file. This ensures only one file matches the
Microsoft.ML.OnnxRuntime.DirectML.* pattern, allowing the subsequent ren
command in the x86 and arm64 scripts to execute successfully during
release builds.

# Background (how the DML packaging pipeline works)

The build has two stages:

1. Individual Architecture Builds (Using dml-vs-2022.yml): Each stage
(x64, x86, arm64) runs, now reliably using only its specific
NuPackScript to generate its artifact without the risk of the ren
command failing during release.
x64 produces: Microsoft.ML.OnnxRuntime.DirectML.[version].nupkg
x86 produces: win-dml-x86.zip
arm64 produces: win-dml-arm64.zip
(arm32 is not built/included).
2. Final Packaging Stage (e.g., stages/nuget_dml_packaging_stage.yml):
Downloads these artifacts and combines them by unpacking the base x64
.nupkg, injecting the contents of the .zip files into the appropriate
runtimes/ directories (e.g., runtimes/win-x86/native/,
runtimes/win-arm64/native/), and re-packing the final,
multi-architecture Microsoft.ML.OnnxRuntime.DirectML.nupkg.

In stage 1 only x64 produces a nuget package, therefore specific MSBuild
parameters: `/p:IsReleaseBuild=${{ parameters.IsReleaseBuild }}` is
passed to all architectures' MSBuild calls, while
`/p:CurrentData=$(BuildDate) /p:CurrentTime=$(BuildTime)` are passed
only in the x64 script. BTW, the property "CurrentData" apparently is a
typo. It should be `CurrentDate`.
### Description
<!-- Describe your changes. -->
Make test `CApiTest.RequestLoadCancellation` deterministic by removing
the `terminator` thread.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The test contributes to CI failures
### Description

This change allows NPM tests to run the nodejs binding for webgpu. This
helps to debug test failures much easier because WebAssembly is
generally very difficult to debug.

Steps to debug:

0. build
   - {ORT_ROOT}> build --config Debug --use_webgpu --build_nodejs
   - {ORT_ROOT}\js\web> npm ci
   - {ORT_ROOT}\js\web> npm run pull:wasm
2. run `npm test -- <args> -b=webgpu -e=node` once. ( this command
generates necessary .js files and `testdata-config.json`.)
3. use native debugger to debug:
   ```
C:\Program Files\nodejs\node.exe
{ORT_ROOT}\js\node_modules\mocha\bin\_mocha --timeout 999999 --colors -r
{ORT_ROOT}\js/web/dist/ort.node.min.js {ORT_ROOT}\js/web/test/test-main
   ```
### Description

MlasTranspose was running single-thread and was not performing well
enough on a multi-threaded CPU. Therefore, I modified it to run with
multi-thread to improve performance.

The `MlasTranspose` was previously running in a single-threaded, which
resulted in suboptimal performance on multi-threaded CPUs. To address
this, I have modified it to utilize multi-threading.

### Motivation and Context

We encountered this issue while running the
[multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large),
which was converted to ONNX format and executed on a multi-core CPU
(Xeon 6338). Below are the performance metrics before and after the
modification:

| | INTER_NUM_THREADS | INTRA_NUM_THREADS | INPUT_LENGTH | BATCH_SIZE |
Duration time[sec] |
| ------ | ----------------- | ----------------- | ------------ |
---------- | ------------------ |
| BEFORE | 1 | 16 | 512 | 4 | 1.24 |
| AFTER | 1 | 16 | 512 | 4 | 1.09 |

Condition
- FP32
- CPUExecutionProvider

This change resulted in a performance improvement of approximately 14%.
MlasTranspose stand-alone performance improvements are as follows

| | INTRA_NUM_THREADS | BEFORE | AFTER |
| --------------------------------- | ---- | -------------- |
------------- |
| MlasTranspose [msec] | 16 | 182.55 [ms] | 11.60 [ms] |

`MlasTranspose` is x15~16 faster.
On Qualcomm Adreno X1 GPUs, the previous implementation of the
FlashAttentionProgram shader in the WebGPU backend was causing high
register pressure, leading to performance degradation. This PR uses
workgroup memory to reduce the register pressure and improve
performance.

TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1
GPU.
### Description
1. Transform INT64 shape of Expand Op to INT32 shape.
2. Add Unit test to check INT64 Shape conversion to INT32 by QNN EP.


### Motivation and Context
QNN doesn't support INT64 shape for Expand Op. This commit delegates the Expand Ops
with INT64 shape on QNN EP. This improves the inference time.
### Description

- fix a bug in ConvTranspose

This bug causes `input_channels_per_group_int` to be `-3` for a test
case, and later causes a loop of `4294967293` times (`uint32_t(-3)`)
that causing timeout.

- fix cache hint of Conv2dMMProgram

After fixing the bug in ConvTranspose, more cache hint inconsistencies
are revealed. This change fixes channel_last missing in the cache hint
of Conv2dMMProgram.
1. Migrate OpenVino Pipeline to Github Actions
2. Update the OpenVino pipeline's docker file to use almalinux8 instead
of Ubuntu, to be aligned with the other Linux CI pipelines. (We cannot
pull images from docker hub because it requires a paid account)
### Description
Add InstanceNormalization operator to WebGPU EP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…ite-default (#24396)

Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite)
from 6.2.5 to 6.2.6.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/vitejs/vite/releases">vite's
releases</a>.</em></p>
<blockquote>
<h2>v6.2.6</h2>
<p>Please refer to <a
href="https://github.com/vitejs/vite/blob/v6.2.6/packages/vite/CHANGELOG.md">CHANGELOG.md</a>
for details.</p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/vitejs/vite/blob/v6.2.6/packages/vite/CHANGELOG.md">vite's
changelog</a>.</em></p>
<blockquote>
<h2><!-- raw HTML omitted -->6.2.6 (2025-04-10)<!-- raw HTML omitted
--></h2>
<ul>
<li>fix: reject requests with <code>#</code> in request-target (<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19830">#19830</a>)
(<a
href="https://github.com/vitejs/vite/commit/3bb0883d22d59cfd901ff18f338e8b4bf11395f7">3bb0883</a>),
closes <a
href="https://redirect.github.com/vitejs/vite/issues/19830">#19830</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/vitejs/vite/commit/d3dbf25fd5e21448f9ea6cec8fb5ac45d220037b"><code>d3dbf25</code></a>
release: v6.2.6</li>
<li><a
href="https://github.com/vitejs/vite/commit/3bb0883d22d59cfd901ff18f338e8b4bf11395f7"><code>3bb0883</code></a>
fix: reject requests with <code>#</code> in request-target (<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19830">#19830</a>)</li>
<li>See full diff in <a
href="https://github.com/vitejs/vite/commits/v6.2.6/packages/vite">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=vite&package-manager=npm_and_yarn&previous-version=6.2.5&new-version=6.2.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
Update protobuf-java to 3.25.5



### Motivation and Context
To fix the [CG
issue](https://aiinfra.visualstudio.com/Lotus/_componentGovernance/218239/alert/12112143?typeId=29309793&pipelinesTrackingFilter=0).

Change file links

- [x] java_linux_final_test.sh -> java-cuda-packaging-stage.yml
(Jar_Packaging_GPU stage from Zip-Nuget)
- [ ] final-jar-testing.yml (Final_Jar_Testing_$ stages)
### Description
- Adds C/C++ API functionality to compile a model (i.e., generate a
model with EPContext nodes) using explicit APIs.
- Adds support for compiling when input or output models are in memory
(not just files).
- Allows specifying the threshold for when initializers are stored in an
external file.
- Allows file paths of arbitrary lengths (session_option key/value
configs limited string length to 2048).

List of C API functions:
```C++
ORT_API(const OrtCompileApi*, GetCompileApi);

ORT_API(void, ReleaseModelCompilationOptions, _Frees_ptr_opt_ OrtModelCompilationOptions*);
ORT_API2_STATUS(CreateModelCompilationOptionsFromSessionOptions, _In_ const OrtEnv* env,
                _In_ const OrtSessionOptions* session_options, _Outptr_ OrtModelCompilationOptions** out);
ORT_API2_STATUS(ModelCompilationOptions_SetInputModelPath, _In_ OrtModelCompilationOptions* model_compile_options,
                _In_ const ORTCHAR_T* input_model_path);
ORT_API2_STATUS(ModelCompilationOptions_SetInputModelFromBuffer, _In_ OrtModelCompilationOptions* model_compile_options,
                _In_ const void* input_model_data, size_t input_model_data_size);
ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelPath, _In_ OrtModelCompilationOptions* model_compile_options,
                _In_ const ORTCHAR_T* output_model_path);
ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelExternalInitializersFile,
                _In_ OrtModelCompilationOptions* model_compile_options,
                _In_ const ORTCHAR_T* external_initializers_file_path,
                size_t external_initializer_size_threshold);
ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelBuffer, _In_ OrtModelCompilationOptions* model_compile_options,
                _Inout_ OrtAllocator* allocator, void** output_model_buffer_ptr, size_t* output_model_buffer_size_ptr);
ORT_API2_STATUS(ModelCompilationOptions_SetEpContextEmbedMode, _In_ OrtModelCompilationOptions* model_compile_options,
                bool embed_ep_context_in_model);
ORT_API2_STATUS(CompileModel, _In_ const OrtEnv* env, _In_ const OrtModelCompilationOptions* model_options);
```

Example (see unit tests for others):
```C++
#include "onnxruntime_cxx_api.h"

// Test using the CompileModel() API with settings:
//   - input model from buffer
//   - output model file
//   - EPContext nodes in output model use embedded binary blobs.
TEST_F(QnnHTPBackendTests, CompileApi_FromSessionOptions_InputModelAsBuffer_Embedded) {
  const ORTCHAR_T* output_model_file = ORT_TSTR("./qnn_context_binary_multi_partition_test.onnx");
  std::filesystem::remove(output_model_file);

  // Initialize session options with QNN EP
  Ort::SessionOptions session_options;
  ProviderOptions provider_options;
#if defined(_WIN32)
  provider_options["backend_path"] = "QnnHtp.dll";
#else
  provider_options["backend_path"] = "libQnnHtp.so";
#endif
  provider_options["offload_graph_io_quantization"] = "0";
  session_options.AppendExecutionProvider("QNN", provider_options);

  // Create model compilation options from the session options.
  Ort::ModelCompilationOptions compile_options(*ort_env, session_options);
  compile_options.SetInputModelFromBuffer(reinterpret_cast<const void*>(model_data.data()), model_data.size());
  compile_options.SetOutputModelPath(output_model_file);
  compile_options.SetEpContextEmbedMode(true);

  // Compile the model.
  Ort::Status status = Ort::CompileModel(*ort_env, compile_options);
  ASSERT_TRUE(status.IsOK());

  // Make sure the compiled model was generated and has the expected number of EPContext nodes.
  ASSERT_TRUE(std::filesystem::exists(output_model_file));
  CheckEpContextNodeCounts(output_model_file, 2, 2);
}
```


### Motivation and Context
Improve compilation workflow and add new capabilities.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
Add 8bit support to matmulnbits quantizer. matmul_4bits_quantizer now
can quantize a const B in a MatMul to 8bits initializer.

### Motivation and Context
MatMul4Bits has accuracy issue for phi-4 model used for foundry local.
The early prototype showed >= 6bits can fix the issue.
To mitigate the issue as soon as possible, add 8bit support to
MatMulNBits.
### Description

There are 2 benefits to this change:
- the comments contain "Σ", a unicode char causing `std::wclog` failed
and no longer output future logs on Windows native app, if not enabled
UTF-8 explicitly by `std::wclog.imbue(std::locale(".UTF-8"));`. Moving
it out resolves the problem.
- makes the WGSL code slightly shorter.
### Description
<!-- Describe your changes. -->
Replace use of gsl::narrow with narrow to build for xnnpack with
exceptions disabled @snnn


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Address issue #24383
### Description
Support mixed precision in quantization for RTN



### Motivation and Context
More flexible for quantization
Usage:
```
customized_weight_config = {}

for i in layers_to_exclude:
    customized_weight_config["/model/layers."+str(i)+"/MatMul"] = {"bits": 8}

algo_config = matmul_4bits_quantizer.RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)
quant = MatMul4BitsQuantizer(
    model=onnx_model,
    block_size=32,
    is_symmetric=False,
    accuracy_level=4,
    nodes_to_exclude=nodes_to_exclude,
    algo_config=algo_config,
)
```
…4385)

### Description
<!-- Describe your changes. -->

This PR adds support for the Resize operator in cubic mode without
antialiasing (antialias=0). It supports scaling constraints of the form
[1, scale_h, scale_w, 1], where scale_h ≥ 1 and scale_w ≥ 1.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The ARM64 Conv supports FP16, and we have an NhwcTransformer that
transforms FP16 Conv to FP16 NhwcFusedConv. As a result, the subsequent
Resize op also uses the NHWC format.
fs-eire and others added 26 commits April 24, 2025 11:01
### Description

Update N-API version to 6.

- NAPI v6 is required for `napi_set_instance_data` and
`napi_get_instance_data`, as used by #24366
- Adding the "binary" field in package.json for CMake-js to work
correctly. (was unintentially removed in #24418)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fix compilation issue (undeclared identifier) in Azure EP unit test.



### Motivation and Context
A previous PR caused a compilation issue in the Azure EP unit test:
#24433

Our PR CI pipelines did not catch it. It was caught by our post-merge
packaging pipelines.

```shell
D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(28,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj]
D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(29,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj]
D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(30,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj]
```
### Description
If it would improve performance, this patch moves outputs to MLTensor
backed Tensors.

### Motivation and Context
We are currently performing an extra copy on output tensors located in
the CPU when using the WebNN EP (MLTensor -(copy)-> wasm heap -(copy)->
JS). This patch removes this copy by moving the readback to JS instead
of wasm. As an extra benefit, we can also start the readbacks and wait
for them in parallel.

This change is similar to #23073
### Description

Fix Nodejs binding build for Linux.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

MatmulTransposeFusion does not work correctly when input A and B are the
same for a `MatMul` node.


![image](https://github.com/user-attachments/assets/48a6afd8-13d0-48d4-b86f-53a866c47803)

Fixes #24341

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
zeros_ memory buffer was uninitialized, but it must be initialized to
zero.


### Motivation and Context
A memory allocator change in GenAI started crashing in FlashAttention
and this was eventually tracked down to be the cause. The allocator
change was innocent. I'm not sure how this didn't fail previously, or if
it was we weren't getting the reports about it.

Co-authored-by: Ryan Hill <{ID}+{username}@users.noreply.github.com>
### Description
Mapping ORT verbose logging back to QnnGpu Debug logging.

### Motivation and Context
Why is this change required? What problem does it solve?
As of now this change is required for the QnnGpu backend to run models correctly.
It's necessity is mentioned in this commit

b4b5a79
It is temporarily reverting this commit. for the GPU case only, due to
loss of functionality

9d45b9a
### Description

 update Node.js binding document for 1.22 release



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Handle empty input cases in the native reduce kernel.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
… EP (#24406)

### Description
A new overload of CreateProvider() was added to the OpenVINO EP to
handle the extraction of EP options from the session option
configurations.


### Motivation and Context
Allows use of new Compile API.
Refer to #24207
### Description
TensorProto may have external data in existing memory buffer. For those
TensorProto, the 'location' field of the external data info is set to a
special marker `*/_ORT_MEM_ADDR_/*`, and the 'offset' field contains the
address of the memory buffer.

This PR allows DirectML EP to recognize in-memory external data
TensorProto and use the address of existing memory buffer containing the
external data.

### Motivation and Context
Applications using ModelEditor API may create initializers with existing
buffer to save memory, such as WebNN. This fix allows DirectML EP can be
used by those applications.

---------

Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>
### Description

Update the packaging pipeline to include the corresponding Nuget version
info for Node.js binding.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…am (#24390)

### Description
Supports batch and zero points in MatMulNBits WideTileProgram



### Motivation and Context
See above
### Description
Add validation in path that user CreateSessionFromArray: if ep.context_enable is set, then ep.context_file_path is expected,
otherwise report error because ORT don't know where to generate the _ctx.onnx file
…ith the design doc (#24461)

### Description
Update the generated Qnn context binary file name to align with the EPContext design doc https://onnxruntime.ai/docs/execution-providers/EP-Context-Design.html
…view) (#24457)

### Description
This PR introduces a new provider option called `enable_causallm` for
OVEP.

This provider option will serve as a entry gate towards enabling
inference using ORT GenAI integration with OVEP in the upcoming PR ahead
inside OVEP.
…ession options (#24445)

### Description
A new overload of CreateProvider() was added to the to handle the
extraction of EP options from the session option configurations.

### Motivation and Context
Allows use of new Compile API.
Refer to #24207
### Description
Upgrade Transformers to 4.48.0 for llama2, this version deprecated the
old format of past_key_value, the current format is DynamicCache. So, we
need to add patches to dynamo exporter in llama2.

Thanks to @xadupre who made the changes to add the patches to dynamo
exporter, and implements patches to transformers 4.48.0 which don't
export and convert dynamic_axes into dynamic shapes.

---------

Co-authored-by: xadupre <xadupre@microsoft.com>
Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…24416)

### Description
Adds session config option (`"session.disable_model_compile"`) that
disables model compilation during session initialization.

If this option is set to "1", inference session creation will fail with
error code ORT_MODEL_REQUIRES_COMPILATION if compilation is required to
run the model on any Execution Provider added to the session. Only the
following kinds of models are valid when this option is set to "1":
- Pre-compiled models that have EPContext nodes for the compiling
Execution Providers in the session.
- Non-compiled models that run only on non-compiling Execution
Providers, like CPU EP.

### Example usage
The following example (taken from a unit test) tries to load a model
that requires compilation with a session that disables compilation. The
session creation fails with error code `ORT_MODEL_REQUIRES_COMPILATION`.
Then, the example compiles the model and loads the compiled model
successfully.

```C++
  // Taken from a unit test ...

  // Initialize session options with QNN EP
  Ort::SessionOptions session_options;
  ProviderOptions provider_options;
  provider_options["backend_type"] = "htp";
  provider_options["offload_graph_io_quantization"] = "0";

  session_options.AppendExecutionProvider("QNN", provider_options);
  session_options.AddConfigEntry(kOrtSessionOptionsDisableEpCompile, "1");  // Disable model compilation!

  // Create an inference session that fails with error ORT_MODEL_REQUIRES_COMPILATION
  try {
    Ort::Session session(*ort_env, input_model_file, session_options);
    FAIL() << "Expected Session creation to fail but it succeeded";  // Should not get here!
  } catch (const Ort::Exception& excpt) {
    OrtErrorCode error_code = excpt.GetOrtErrorCode();
    std::string_view error_msg = excpt.what();
    ASSERT_EQ(error_code, ORT_MODEL_REQUIRES_COMPILATION);
    ASSERT_THAT(error_msg, testing::HasSubstr(kQnnExecutionProvider));
  }

  // Session creation failed because the model was not pre-compiled.
  // Try to compile it now.

  // Create model compilation options from the session options.
  Ort::ModelCompilationOptions compile_options(*ort_env, session_options);
  compile_options.SetInputModelPath(input_model_file);
  compile_options.SetOutputModelPath(output_model_file);

  // Compile the model.
  Ort::Status status = Ort::CompileModel(*ort_env, compile_options);
  ASSERT_TRUE(status.IsOK()) << status.GetErrorMessage();

  // Should be able to create a session with the compiled model and the original session options.
  Ort::Session session(*ort_env, output_model_file, session_options);
```

### Motivation and Context
Compiling models can take a very long time. Want to have a session
option that requires input models that do not need to be compiled.
…#24463)

### Description
Re-enables (and fixes) generation of compiled EpContext models with
**both** input and output models stored in buffers.

### Motivation and Context
Previous PR #24176 inadvertently added a check that disabled storing
both input and output models in buffers. However, we need this
functionality. This was actually a fortunate scenario, as it led to the
discovery of a bug.
### Description

* Rename  filename and class name since it supports 4 and 8 bits.
* Update HQQWeightOnlyQuantizer to support 8 bits.
* Update some comments.

### Motivation and Context
#24384 added 8 bits support
for the default weight only quantizer.
…24474)

### Description
<!-- Describe your changes. -->
Use a pimpl-esque approach so that the winml OrtModel type doesn't
conflict with the model editing API OrtModel.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix crash due to linker calling the incorrect destructor when there are
two different OrtModel types in the global namespace.
…h to int32 (#24425)

Some WebNN backends support limited data types for the input and output
of a WebNN graph. However, they can support more data types for
intermediate nodes. To address this limitation, we implement a data type
fallback mechanism. (Note: Currently, we only support fallback to int32
for certain integer data types.)

If a data type is not supported for a graph's input or output but is
supported for intermediate nodes, we will:
1. Save the input MLTensor as 'int32' data type,
2. Convert the input data from ORT to int32,
3. Insert a cast operation to WebNN graph to convert the input back to
its original data type,
4. Insert a cast operation to WebNN graph to convert the output back to
'int32',
5. Convert the output data from int32 to its original data type.
### Description
<!-- Describe your changes. -->
Add infrastructure to enable auto EP selection.

Device discovery for CPU/GPU/NPU on Windows.
Supports internal (CPU/DML/WebGPU) and provider bridge (CUDA) EPs
currently.
Infrastructure will be used with plugin EPs next.

Selection policy implementation will be added next, so in the interim
there's a temporary function with manually specified selection so unit
tests can cover the end-to-end.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
### Description
WebNN doesn't support AveragePool with count_include_pad == 1.



### Motivation and Context
Support it by adding a pad and calling averagePool2D with pads as 0's.
### Description
<!-- Describe your changes. -->
Fix some issues.
Use adapter number instead of bus number. Bus number doesn't work as
expected on VMs.
Disable for XBOX build. Needs different handling for adapter lookup. 
Use adapter number as device_id when creating DML OrtEpDevice.
Fix some issues with the metadata. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@ashrit-ms ashrit-ms requested review from a team as code owners April 24, 2025 18:19
@ashrit-ms ashrit-ms merged commit dafcb6a into win-ort-main Apr 24, 2025
16 of 18 checks passed
@ashrit-ms ashrit-ms deleted the ashritms/update-to-rel-1.22.0 branch April 24, 2025 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.