Skip to content

Backmerging with Msft commits #675

Merged
ankitm3k merged 37 commits intoovep-developfrom
sync_msft_24_4_25
Apr 24, 2025
Merged

Backmerging with Msft commits #675
ankitm3k merged 37 commits intoovep-developfrom
sync_msft_24_4_25

Conversation

@jatinwadhwa921
Copy link

Backmerging with Msft commits

adrianlizarraga and others added 30 commits April 17, 2025 08:23
… EP (microsoft#24406)

### Description
A new overload of CreateProvider() was added to the OpenVINO EP to
handle the extraction of EP options from the session option
configurations.


### Motivation and Context
Allows use of new Compile API.
Refer to microsoft#24207
### Description
TensorProto may have external data in existing memory buffer. For those
TensorProto, the 'location' field of the external data info is set to a
special marker `*/_ORT_MEM_ADDR_/*`, and the 'offset' field contains the
address of the memory buffer.

This PR allows DirectML EP to recognize in-memory external data
TensorProto and use the address of existing memory buffer containing the
external data.

### Motivation and Context
Applications using ModelEditor API may create initializers with existing
buffer to save memory, such as WebNN. This fix allows DirectML EP can be
used by those applications.

---------

Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>
…t#24450)

### Description

Update the packaging pipeline to include the corresponding Nuget version
info for Node.js binding.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…am (microsoft#24390)

### Description
Supports batch and zero points in MatMulNBits WideTileProgram



### Motivation and Context
See above
…soft#24176)

### Description
Add validation in path that user CreateSessionFromArray: if ep.context_enable is set, then ep.context_file_path is expected,
otherwise report error because ORT don't know where to generate the _ctx.onnx file
…ith the design doc (microsoft#24461)

### Description
Update the generated Qnn context binary file name to align with the EPContext design doc https://onnxruntime.ai/docs/execution-providers/EP-Context-Design.html
…view) (microsoft#24457)

### Description
This PR introduces a new provider option called `enable_causallm` for
OVEP.

This provider option will serve as a entry gate towards enabling
inference using ORT GenAI integration with OVEP in the upcoming PR ahead
inside OVEP.
…ession options (microsoft#24445)

### Description
A new overload of CreateProvider() was added to the to handle the
extraction of EP options from the session option configurations.

### Motivation and Context
Allows use of new Compile API.
Refer to microsoft#24207
### Description
Upgrade Transformers to 4.48.0 for llama2, this version deprecated the
old format of past_key_value, the current format is DynamicCache. So, we
need to add patches to dynamo exporter in llama2.

Thanks to @xadupre who made the changes to add the patches to dynamo
exporter, and implements patches to transformers 4.48.0 which don't
export and convert dynamic_axes into dynamic shapes.

---------

Co-authored-by: xadupre <xadupre@microsoft.com>
Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…icrosoft#24416)

### Description
Adds session config option (`"session.disable_model_compile"`) that
disables model compilation during session initialization.

If this option is set to "1", inference session creation will fail with
error code ORT_MODEL_REQUIRES_COMPILATION if compilation is required to
run the model on any Execution Provider added to the session. Only the
following kinds of models are valid when this option is set to "1":
- Pre-compiled models that have EPContext nodes for the compiling
Execution Providers in the session.
- Non-compiled models that run only on non-compiling Execution
Providers, like CPU EP.

### Example usage
The following example (taken from a unit test) tries to load a model
that requires compilation with a session that disables compilation. The
session creation fails with error code `ORT_MODEL_REQUIRES_COMPILATION`.
Then, the example compiles the model and loads the compiled model
successfully.

```C++
  // Taken from a unit test ...

  // Initialize session options with QNN EP
  Ort::SessionOptions session_options;
  ProviderOptions provider_options;
  provider_options["backend_type"] = "htp";
  provider_options["offload_graph_io_quantization"] = "0";

  session_options.AppendExecutionProvider("QNN", provider_options);
  session_options.AddConfigEntry(kOrtSessionOptionsDisableEpCompile, "1");  // Disable model compilation!

  // Create an inference session that fails with error ORT_MODEL_REQUIRES_COMPILATION
  try {
    Ort::Session session(*ort_env, input_model_file, session_options);
    FAIL() << "Expected Session creation to fail but it succeeded";  // Should not get here!
  } catch (const Ort::Exception& excpt) {
    OrtErrorCode error_code = excpt.GetOrtErrorCode();
    std::string_view error_msg = excpt.what();
    ASSERT_EQ(error_code, ORT_MODEL_REQUIRES_COMPILATION);
    ASSERT_THAT(error_msg, testing::HasSubstr(kQnnExecutionProvider));
  }

  // Session creation failed because the model was not pre-compiled.
  // Try to compile it now.

  // Create model compilation options from the session options.
  Ort::ModelCompilationOptions compile_options(*ort_env, session_options);
  compile_options.SetInputModelPath(input_model_file);
  compile_options.SetOutputModelPath(output_model_file);

  // Compile the model.
  Ort::Status status = Ort::CompileModel(*ort_env, compile_options);
  ASSERT_TRUE(status.IsOK()) << status.GetErrorMessage();

  // Should be able to create a session with the compiled model and the original session options.
  Ort::Session session(*ort_env, output_model_file, session_options);
```

### Motivation and Context
Compiling models can take a very long time. Want to have a session
option that requires input models that do not need to be compiled.
…microsoft#24463)

### Description
Re-enables (and fixes) generation of compiled EpContext models with
**both** input and output models stored in buffers.

### Motivation and Context
Previous PR microsoft#24176 inadvertently added a check that disabled storing
both input and output models in buffers. However, we need this
functionality. This was actually a fortunate scenario, as it led to the
discovery of a bug.
…oft#24472)

### Description

* Rename  filename and class name since it supports 4 and 8 bits.
* Update HQQWeightOnlyQuantizer to support 8 bits.
* Update some comments.

### Motivation and Context
microsoft#24384 added 8 bits support
for the default weight only quantizer.
…icrosoft#24474)

### Description
<!-- Describe your changes. -->
Use a pimpl-esque approach so that the winml OrtModel type doesn't
conflict with the model editing API OrtModel.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix crash due to linker calling the incorrect destructor when there are
two different OrtModel types in the global namespace.
…h to int32 (microsoft#24425)

Some WebNN backends support limited data types for the input and output
of a WebNN graph. However, they can support more data types for
intermediate nodes. To address this limitation, we implement a data type
fallback mechanism. (Note: Currently, we only support fallback to int32
for certain integer data types.)

If a data type is not supported for a graph's input or output but is
supported for intermediate nodes, we will:
1. Save the input MLTensor as 'int32' data type,
2. Convert the input data from ORT to int32,
3. Insert a cast operation to WebNN graph to convert the input back to
its original data type,
4. Insert a cast operation to WebNN graph to convert the output back to
'int32',
5. Convert the output data from int32 to its original data type.
### Description
<!-- Describe your changes. -->
Add infrastructure to enable auto EP selection.

Device discovery for CPU/GPU/NPU on Windows.
Supports internal (CPU/DML/WebGPU) and provider bridge (CUDA) EPs
currently.
Infrastructure will be used with plugin EPs next.

Selection policy implementation will be added next, so in the interim
there's a temporary function with manually specified selection so unit
tests can cover the end-to-end.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
)

### Description
WebNN doesn't support AveragePool with count_include_pad == 1.



### Motivation and Context
Support it by adding a pad and calling averagePool2D with pads as 0's.
### Description
<!-- Describe your changes. -->
Fix some issues.
Use adapter number instead of bus number. Bus number doesn't work as
expected on VMs.
Disable for XBOX build. Needs different handling for adapter lookup. 
Use adapter number as device_id when creating DML OrtEpDevice.
Fix some issues with the metadata. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…#24411)

If LD_LIBRARY_PATH is not defined a blank "-L" is added to the link
command. This causes the next object to be linked to get treated as if
it was a search path and causes link failure.

Signed-off-by: Andrew Davis <afd@ti.com>
Signed-off-by: Clément Péron <peron.clem@gmail.com>
Co-authored-by: Andrew Davis <afd@ti.com>
### Description
<!-- Describe your changes. -->
For QNN-EP, build FP-to-Bool Cast into NotEqual.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
HTP currently does not support FP-to-Bool Cast due to some limitations.
To unblock CLIP models, replace such Cast with NotEqual to achieve the
same functionality.

Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com>
### Description
<!-- Describe your changes. -->
Add script to make it easier to manually trigger the workflows. Similar
to run_CIs_for_branch.py. Uses `gh` command line tool.

A workflow can be triggered if it has `workflow_dispatch` enabled.
Currently only these workflows can be triggered:

```
Android CI
iOS_CI_on_Mac
Linux OpenVINO CI
MacOS CI Pipeline
Update C/C++ API Docs
Update C# API Docs
Publish site
Update Java API Docs
Update JS API Docs
Update Objective-C API Docs
Update Python API Docs
Web CI Pipeline
ONNX Runtime CUDA Builds
ONNX Runtime DirectML Builds
Windows OpenVINO CI Pipeline
Windows GPU TensorRT CI Pipeline
ONNX Runtime WebGPU Builds
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Excludes QnnGpu.dll from Windows x64 NuGet package because it is not
available for that architecture.



### Motivation and Context
Fix failure in QNN packaging pipeline:
```shell
CreateNativePackage:
  Generating nuspec for the native Microsoft.ML.OnnxRuntime.QNN nuget package...
  python ..\tools\nuget\generate_nuspec_for_native_nuget.py --package_version 1.22.0-dev-20250421-0439-2abab8d --package_name Microsoft.ML.OnnxRuntime.QNN --target_architecture x64 --build_config RelWithDebInfo --native_build_path D:\a\_work\1\b\RelWithDebInfo\RelWithDebInfo --packages_path D:\a\_work\1\b\packages --ort_build_path D:\a\_work\1\b --sources_path D:\a\_work\1\s --commit_id 2abab8d --is_release_build False --execution_provider None --nuspec_name NativeNuget.nuspec
          1 file(s) copied.
          1 file(s) copied.
  nuspec_name: NativeNuget.nuspec
  Bundling native shared library artifacts into Microsoft.ML.OnnxRuntime nuget package...
  nuget pack NativeNuget.nuspec
  Attempting to build package from 'NativeNuget.nuspec'.
##[error]EXEC(0,0): Error NU5019: File not found: 'D:\a\_work\1\b\RelWithDebInfo\RelWithDebInfo\QnnGpu.dll'.
EXEC : error NU5019: File not found: 'D:\a\_work\1\b\RelWithDebInfo\RelWithDebInfo\QnnGpu.dll'. [D:\a\_work\1\s\csharp\OnnxRuntime.CSharp.proj]
##[error]csharp\OnnxRuntime.CSharp.proj(109,5): Error MSB3073: The command "nuget pack NativeNuget.nuspec" exited with code 1.
```

Introduced by this PR:
microsoft#24435
…#24493)

### Description
Removes unnecessary std::move on an r-value expression. This caused a
compiler warning/error in the Linux Android QNN pipeline.



### Motivation and Context
Introduced by PR: microsoft#24466
`onnx.mapping` was deprecated and is being removed. This PR updates
removes deprecated usage.

@MaanavD would be good if this can make it into 1.22.0 for forward ONNX
release (1.19+) compatibility.
Increases operator coverage for WebGPU EP.
…on Maven) (microsoft#24494)

### Description
Updates the Android QNN package to use QNN SDK 2.33.0, which is
available on Maven. QNN SDK 2.33.2 is not available yet on Maven:
https://mvnrepository.com/artifact/com.qualcomm.qti/qnn-runtime


### Motivation and Context
Previous PR that updated QNN SDK version:
microsoft#24440
### Description
<!-- Describe your changes. -->

It looks like the test case `test_convtranspose_autopad_same` in ONNX
opset14(v1.9) is generated incorrectly:
onnx/onnx#3440

The problem is already fixed in opset15+(v1.10).

So, this PR fixes 2 problems:
- disable the test case for opset14 in onnxruntime-web tests ( still run
the same test for opset15 and opset17 )
- re-enable the test case by removing if from "current_failing_tests" in
onnxruntime\test\testdata\onnx_backend_test_series_filters.jsonc (which
should have been done when upgrading to onnx v1.10)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
… them manually. (microsoft#24505)

### Description
<!-- Describe your changes. -->

Add `workflow_dispatch` event to GitHub Actions workflows so we can run
them manually.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Make it easier to run workflows during development.
…cpkg (microsoft#24012)

1. Patch ONNX to support minimal build
2. Improve ort-web's vcpkg build scripts. 

The generate_vcpkg_triplets_for_emscripten function in
tools\python\util\vcpkg_helpers.py didn't process the enable_rtti
condition. So, when it is true, we should add -fno-rtti to cxxflags

Fix an issue related to DISABLE_EXCEPTION_CATCHING. Make it clear that
there are three modes:

1. No EH (-fno-exceptions, -sDISABLE_EXCEPTION_CATCHING=1):
Set enable_minimal_onnx_build=True, enable_wasm_exception_catching=False
2. Full EH (-fexceptions, -sDISABLE_EXCEPTION_CATCHING=0):
Set enable_minimal_onnx_build=False, enable_wasm_exception_catching=True
3. Throw Only EH (-fexceptions, -sDISABLE_EXCEPTION_CATCHING=1):
Set enable_minimal_onnx_build=False,
enable_wasm_exception_catching=False

Debug build should only use the second one. 
In release build by default emscripten disables catching C++ exceptions
(specifically, emitting catch blocks). That's the second case. In a
normal release build(what we ship),
- Usually enable_wasm_api_exception_catching is set to true
-  So disable_wasm_exception_catching is also True
-  So onnxruntime_ENABLE_WEBASSEMBLY_EXCEPTION_CATCHING is False
- So, the flag DISABLE_EXCEPTION_CATCHING should not be set. Because by
default it is true. We should not have "-sDISABLE_EXCEPTION_CATCHING=0"

But we do not want to rely on what default value is. So the
vcpkg_helper.py script still explicitly set DISABLE_EXCEPTION_CATCHING
to 1.

In onnxruntime_webassembly.cmake currently we have 
```cmake
  if (NOT onnxruntime_ENABLE_WEBASSEMBLY_MEMORY64)
    target_link_options(onnxruntime_webassembly PRIVATE "SHELL:-s DISABLE_EXCEPTION_THROWING=0")
  endif()
```
But I think we need to set DISABLE_EXCEPTION_THROWING to 1 when the
build is in the first mode(No EH). This PR also resolves microsoft#24279 ,
because vcpkg has native support for cross-compiling. Users do not need
to specific a custom protoc path.
Reverts microsoft#24372

The above PR removes the `build-nuget` command-line argument from the
`dml-vs-2022.yml` file. This PR reverts that change and adds the
`build-nuget` back to the file.


The `--build_nuget` option creates the
`csharp\src\Microsoft.ML.OnnxRuntime\bin\RelWithDebInfo` directory
structure and stores binaries in there. There's a subsequent task in the
yaml file that tries to sign DLLs in the
`csharp\src\Microsoft.ML.OnnxRuntime\bin\RelWithDebInfo`, however this
task fails because the directory structure is now never created (due to
removal of `--build_nuget`).
…32 (microsoft#24437)

Decomposed [Skip]SimplifiedLayerNormalization will lose precision in
FP16, we'd like to add cast (to: fp32) ops around it in WebNN EP to
ensure its precision rather than manually add cast nodes in each model
file.
benjamin-hodgson and others added 7 commits April 22, 2025 19:11
…hat generates Microsoft.ML.OnnxRuntime.nuspec (microsoft#24131)

### Description
As detailed in the issue (microsoft#24130), the Microsoft.ML.OnnxRuntime nuspec
needs an additional dependency group for the `native` TFM, to allow it
to be referenced via `PackageReference` in vcxproj projects.

I took a peek at the code after filing the issue, since it seemed like
it ought to be a simple fix. It looks like the nuspec file for the
`Microsoft.ML.OnnxRuntime` package is generated by the python code in
`generate_nuspec_for_native_nuget.py`, so I just added a line of code
there.

However I'm not sure how the build system invokes this generator so I
haven't been able to test it in situ. Let me know if this isn't the
right fix!


### Motivation and Context

See detailed description in microsoft#24130
…4510)

### Description
<!-- Describe your changes. -->

Add `python_version < "3.13"` for `onnxscript` dependency in
tools/ci_build/github/linux/python/requirements.txt.

`onnxscript` has `onnx` as a dependency. Building the `onnx` wheel fails
with Python 3.13.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix pipeline build failures.
### Description
Fix ConvTranspose indexing errors and other minor changes.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Necessary to get Kokoro working in WebGPU (except Conv).
### Description
<!-- Describe your changes. -->
Adds support for MultiHeadAttention via WebNN matmul, transpose,
reshape, and other operations that follow the logic in the MHA subgraph
below

```
 Abbreviatios: B is batch_size, S is sequence_length, W is hidden_size, P is past_sequence_length
               N is number of attention heads, H is head size, and W=N*H, h=Sqrt(H)
    Notes: If the datatype of the inputs (qkv and past kv) is float16, we cast them to float32 to ensure data precision.

                 query     key     value
                   |        |        |
           q_Reshape   k_Reshape   v_Reshape  (shape=B,S,H,N)
                   |        |        |
          q_Transpose  k_Transpose v_Transpose (perm=0,2,1,3)
             \           /           |
              \         /            |
present_key<---\----Concat <---------|----past_key
               |      |              |
               |  opt_k_transpose    |
               \  (0,1,3,2)          |
                \    /               |  past_value
                qk_MatMul            |     /
                     |  scale        |    /
                     |   /           |   /
                  qk_Div           Concat------> present_value
                      |              |
                      |              /
                     Add <----------/---------------attention_bias
                      |            /
                    Softmax       /
                       \         /
                        \       /
                      qkv_MatMul
                             |
                          Transpose (perm=0,2,1,3)
                             |
                          Reshape---(shape=B,P,W)
                             |
                           output
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The [1.22 release
branch](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) has
been cut, so we need to update the version in main from 1.22.0 to
1.23.0.
New EP - currently based on existing TensorRT EP but meant to be used on
RTX GPUs with a lean version of TensorRT.

### Description
Adding a new EP based on TensorRT EP. This is going to use a special
version of TensorRT optimized for RTX GPUs. In the future we plan to
make changes to the EP to streamline it further (e.g, get rid of
dependency on CUDA EP completely).

### Motivation and Context
The new TensorRT for RTX is going to have:
1. Much smaller footprint 
2. Much faster model compile/load times. 
3. Better usability in terms of use of cached models across multiple RTX
GPUs.

This effort is also targeting WCR ML workflows.

---------

Co-authored-by: Maximilian Müller <maximilianm@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: iraut <iraut@nvidia.com>
Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
@jatinwadhwa921 jatinwadhwa921 requested a review from ankitm3k April 24, 2025 05:47
@ankitm3k ankitm3k merged commit e59c069 into ovep-develop Apr 24, 2025
6 of 8 checks passed
@ankitm3k ankitm3k deleted the sync_msft_24_4_25 branch April 24, 2025 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.