Backmerging with Msft commits #675
Merged
ankitm3k merged 37 commits intoovep-developfrom Apr 24, 2025
Merged
Conversation
… EP (microsoft#24406) ### Description A new overload of CreateProvider() was added to the OpenVINO EP to handle the extraction of EP options from the session option configurations. ### Motivation and Context Allows use of new Compile API. Refer to microsoft#24207
### Description TensorProto may have external data in existing memory buffer. For those TensorProto, the 'location' field of the external data info is set to a special marker `*/_ORT_MEM_ADDR_/*`, and the 'offset' field contains the address of the memory buffer. This PR allows DirectML EP to recognize in-memory external data TensorProto and use the address of existing memory buffer containing the external data. ### Motivation and Context Applications using ModelEditor API may create initializers with existing buffer to save memory, such as WebNN. This fix allows DirectML EP can be used by those applications. --------- Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>
…t#24450) ### Description Update the packaging pipeline to include the corresponding Nuget version info for Node.js binding. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…am (microsoft#24390) ### Description Supports batch and zero points in MatMulNBits WideTileProgram ### Motivation and Context See above
…soft#24176) ### Description Add validation in path that user CreateSessionFromArray: if ep.context_enable is set, then ep.context_file_path is expected, otherwise report error because ORT don't know where to generate the _ctx.onnx file
…ith the design doc (microsoft#24461) ### Description Update the generated Qnn context binary file name to align with the EPContext design doc https://onnxruntime.ai/docs/execution-providers/EP-Context-Design.html
…view) (microsoft#24457) ### Description This PR introduces a new provider option called `enable_causallm` for OVEP. This provider option will serve as a entry gate towards enabling inference using ORT GenAI integration with OVEP in the upcoming PR ahead inside OVEP.
…ession options (microsoft#24445) ### Description A new overload of CreateProvider() was added to the to handle the extraction of EP options from the session option configurations. ### Motivation and Context Allows use of new Compile API. Refer to microsoft#24207
### Description Upgrade Transformers to 4.48.0 for llama2, this version deprecated the old format of past_key_value, the current format is DynamicCache. So, we need to add patches to dynamo exporter in llama2. Thanks to @xadupre who made the changes to add the patches to dynamo exporter, and implements patches to transformers 4.48.0 which don't export and convert dynamic_axes into dynamic shapes. --------- Co-authored-by: xadupre <xadupre@microsoft.com> Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…icrosoft#24416) ### Description Adds session config option (`"session.disable_model_compile"`) that disables model compilation during session initialization. If this option is set to "1", inference session creation will fail with error code ORT_MODEL_REQUIRES_COMPILATION if compilation is required to run the model on any Execution Provider added to the session. Only the following kinds of models are valid when this option is set to "1": - Pre-compiled models that have EPContext nodes for the compiling Execution Providers in the session. - Non-compiled models that run only on non-compiling Execution Providers, like CPU EP. ### Example usage The following example (taken from a unit test) tries to load a model that requires compilation with a session that disables compilation. The session creation fails with error code `ORT_MODEL_REQUIRES_COMPILATION`. Then, the example compiles the model and loads the compiled model successfully. ```C++ // Taken from a unit test ... // Initialize session options with QNN EP Ort::SessionOptions session_options; ProviderOptions provider_options; provider_options["backend_type"] = "htp"; provider_options["offload_graph_io_quantization"] = "0"; session_options.AppendExecutionProvider("QNN", provider_options); session_options.AddConfigEntry(kOrtSessionOptionsDisableEpCompile, "1"); // Disable model compilation! // Create an inference session that fails with error ORT_MODEL_REQUIRES_COMPILATION try { Ort::Session session(*ort_env, input_model_file, session_options); FAIL() << "Expected Session creation to fail but it succeeded"; // Should not get here! } catch (const Ort::Exception& excpt) { OrtErrorCode error_code = excpt.GetOrtErrorCode(); std::string_view error_msg = excpt.what(); ASSERT_EQ(error_code, ORT_MODEL_REQUIRES_COMPILATION); ASSERT_THAT(error_msg, testing::HasSubstr(kQnnExecutionProvider)); } // Session creation failed because the model was not pre-compiled. // Try to compile it now. // Create model compilation options from the session options. Ort::ModelCompilationOptions compile_options(*ort_env, session_options); compile_options.SetInputModelPath(input_model_file); compile_options.SetOutputModelPath(output_model_file); // Compile the model. Ort::Status status = Ort::CompileModel(*ort_env, compile_options); ASSERT_TRUE(status.IsOK()) << status.GetErrorMessage(); // Should be able to create a session with the compiled model and the original session options. Ort::Session session(*ort_env, output_model_file, session_options); ``` ### Motivation and Context Compiling models can take a very long time. Want to have a session option that requires input models that do not need to be compiled.
…microsoft#24463) ### Description Re-enables (and fixes) generation of compiled EpContext models with **both** input and output models stored in buffers. ### Motivation and Context Previous PR microsoft#24176 inadvertently added a check that disabled storing both input and output models in buffers. However, we need this functionality. This was actually a fortunate scenario, as it led to the discovery of a bug.
…oft#24472) ### Description * Rename filename and class name since it supports 4 and 8 bits. * Update HQQWeightOnlyQuantizer to support 8 bits. * Update some comments. ### Motivation and Context microsoft#24384 added 8 bits support for the default weight only quantizer.
…icrosoft#24474) ### Description <!-- Describe your changes. --> Use a pimpl-esque approach so that the winml OrtModel type doesn't conflict with the model editing API OrtModel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix crash due to linker calling the incorrect destructor when there are two different OrtModel types in the global namespace.
…h to int32 (microsoft#24425) Some WebNN backends support limited data types for the input and output of a WebNN graph. However, they can support more data types for intermediate nodes. To address this limitation, we implement a data type fallback mechanism. (Note: Currently, we only support fallback to int32 for certain integer data types.) If a data type is not supported for a graph's input or output but is supported for intermediate nodes, we will: 1. Save the input MLTensor as 'int32' data type, 2. Convert the input data from ORT to int32, 3. Insert a cast operation to WebNN graph to convert the input back to its original data type, 4. Insert a cast operation to WebNN graph to convert the output back to 'int32', 5. Convert the output data from int32 to its original data type.
### Description <!-- Describe your changes. --> Add infrastructure to enable auto EP selection. Device discovery for CPU/GPU/NPU on Windows. Supports internal (CPU/DML/WebGPU) and provider bridge (CUDA) EPs currently. Infrastructure will be used with plugin EPs next. Selection policy implementation will be added next, so in the interim there's a temporary function with manually specified selection so unit tests can cover the end-to-end. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
### Description <!-- Describe your changes. --> Fix some issues. Use adapter number instead of bus number. Bus number doesn't work as expected on VMs. Disable for XBOX build. Needs different handling for adapter lookup. Use adapter number as device_id when creating DML OrtEpDevice. Fix some issues with the metadata. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…#24411) If LD_LIBRARY_PATH is not defined a blank "-L" is added to the link command. This causes the next object to be linked to get treated as if it was a search path and causes link failure. Signed-off-by: Andrew Davis <afd@ti.com> Signed-off-by: Clément Péron <peron.clem@gmail.com> Co-authored-by: Andrew Davis <afd@ti.com>
### Description <!-- Describe your changes. --> For QNN-EP, build FP-to-Bool Cast into NotEqual. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> HTP currently does not support FP-to-Bool Cast due to some limitations. To unblock CLIP models, replace such Cast with NotEqual to achieve the same functionality. Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com>
### Description <!-- Describe your changes. --> Add script to make it easier to manually trigger the workflows. Similar to run_CIs_for_branch.py. Uses `gh` command line tool. A workflow can be triggered if it has `workflow_dispatch` enabled. Currently only these workflows can be triggered: ``` Android CI iOS_CI_on_Mac Linux OpenVINO CI MacOS CI Pipeline Update C/C++ API Docs Update C# API Docs Publish site Update Java API Docs Update JS API Docs Update Objective-C API Docs Update Python API Docs Web CI Pipeline ONNX Runtime CUDA Builds ONNX Runtime DirectML Builds Windows OpenVINO CI Pipeline Windows GPU TensorRT CI Pipeline ONNX Runtime WebGPU Builds ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description Excludes QnnGpu.dll from Windows x64 NuGet package because it is not available for that architecture. ### Motivation and Context Fix failure in QNN packaging pipeline: ```shell CreateNativePackage: Generating nuspec for the native Microsoft.ML.OnnxRuntime.QNN nuget package... python ..\tools\nuget\generate_nuspec_for_native_nuget.py --package_version 1.22.0-dev-20250421-0439-2abab8d --package_name Microsoft.ML.OnnxRuntime.QNN --target_architecture x64 --build_config RelWithDebInfo --native_build_path D:\a\_work\1\b\RelWithDebInfo\RelWithDebInfo --packages_path D:\a\_work\1\b\packages --ort_build_path D:\a\_work\1\b --sources_path D:\a\_work\1\s --commit_id 2abab8d --is_release_build False --execution_provider None --nuspec_name NativeNuget.nuspec 1 file(s) copied. 1 file(s) copied. nuspec_name: NativeNuget.nuspec Bundling native shared library artifacts into Microsoft.ML.OnnxRuntime nuget package... nuget pack NativeNuget.nuspec Attempting to build package from 'NativeNuget.nuspec'. ##[error]EXEC(0,0): Error NU5019: File not found: 'D:\a\_work\1\b\RelWithDebInfo\RelWithDebInfo\QnnGpu.dll'. EXEC : error NU5019: File not found: 'D:\a\_work\1\b\RelWithDebInfo\RelWithDebInfo\QnnGpu.dll'. [D:\a\_work\1\s\csharp\OnnxRuntime.CSharp.proj] ##[error]csharp\OnnxRuntime.CSharp.proj(109,5): Error MSB3073: The command "nuget pack NativeNuget.nuspec" exited with code 1. ``` Introduced by this PR: microsoft#24435
…#24493) ### Description Removes unnecessary std::move on an r-value expression. This caused a compiler warning/error in the Linux Android QNN pipeline. ### Motivation and Context Introduced by PR: microsoft#24466
`onnx.mapping` was deprecated and is being removed. This PR updates removes deprecated usage. @MaanavD would be good if this can make it into 1.22.0 for forward ONNX release (1.19+) compatibility.
Increases operator coverage for WebGPU EP.
…on Maven) (microsoft#24494) ### Description Updates the Android QNN package to use QNN SDK 2.33.0, which is available on Maven. QNN SDK 2.33.2 is not available yet on Maven: https://mvnrepository.com/artifact/com.qualcomm.qti/qnn-runtime ### Motivation and Context Previous PR that updated QNN SDK version: microsoft#24440
### Description <!-- Describe your changes. --> It looks like the test case `test_convtranspose_autopad_same` in ONNX opset14(v1.9) is generated incorrectly: onnx/onnx#3440 The problem is already fixed in opset15+(v1.10). So, this PR fixes 2 problems: - disable the test case for opset14 in onnxruntime-web tests ( still run the same test for opset15 and opset17 ) - re-enable the test case by removing if from "current_failing_tests" in onnxruntime\test\testdata\onnx_backend_test_series_filters.jsonc (which should have been done when upgrading to onnx v1.10) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
… them manually. (microsoft#24505) ### Description <!-- Describe your changes. --> Add `workflow_dispatch` event to GitHub Actions workflows so we can run them manually. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make it easier to run workflows during development.
…cpkg (microsoft#24012) 1. Patch ONNX to support minimal build 2. Improve ort-web's vcpkg build scripts. The generate_vcpkg_triplets_for_emscripten function in tools\python\util\vcpkg_helpers.py didn't process the enable_rtti condition. So, when it is true, we should add -fno-rtti to cxxflags Fix an issue related to DISABLE_EXCEPTION_CATCHING. Make it clear that there are three modes: 1. No EH (-fno-exceptions, -sDISABLE_EXCEPTION_CATCHING=1): Set enable_minimal_onnx_build=True, enable_wasm_exception_catching=False 2. Full EH (-fexceptions, -sDISABLE_EXCEPTION_CATCHING=0): Set enable_minimal_onnx_build=False, enable_wasm_exception_catching=True 3. Throw Only EH (-fexceptions, -sDISABLE_EXCEPTION_CATCHING=1): Set enable_minimal_onnx_build=False, enable_wasm_exception_catching=False Debug build should only use the second one. In release build by default emscripten disables catching C++ exceptions (specifically, emitting catch blocks). That's the second case. In a normal release build(what we ship), - Usually enable_wasm_api_exception_catching is set to true - So disable_wasm_exception_catching is also True - So onnxruntime_ENABLE_WEBASSEMBLY_EXCEPTION_CATCHING is False - So, the flag DISABLE_EXCEPTION_CATCHING should not be set. Because by default it is true. We should not have "-sDISABLE_EXCEPTION_CATCHING=0" But we do not want to rely on what default value is. So the vcpkg_helper.py script still explicitly set DISABLE_EXCEPTION_CATCHING to 1. In onnxruntime_webassembly.cmake currently we have ```cmake if (NOT onnxruntime_ENABLE_WEBASSEMBLY_MEMORY64) target_link_options(onnxruntime_webassembly PRIVATE "SHELL:-s DISABLE_EXCEPTION_THROWING=0") endif() ``` But I think we need to set DISABLE_EXCEPTION_THROWING to 1 when the build is in the first mode(No EH). This PR also resolves microsoft#24279 , because vcpkg has native support for cross-compiling. Users do not need to specific a custom protoc path.
Reverts microsoft#24372 The above PR removes the `build-nuget` command-line argument from the `dml-vs-2022.yml` file. This PR reverts that change and adds the `build-nuget` back to the file. The `--build_nuget` option creates the `csharp\src\Microsoft.ML.OnnxRuntime\bin\RelWithDebInfo` directory structure and stores binaries in there. There's a subsequent task in the yaml file that tries to sign DLLs in the `csharp\src\Microsoft.ML.OnnxRuntime\bin\RelWithDebInfo`, however this task fails because the directory structure is now never created (due to removal of `--build_nuget`).
…32 (microsoft#24437) Decomposed [Skip]SimplifiedLayerNormalization will lose precision in FP16, we'd like to add cast (to: fp32) ops around it in WebNN EP to ensure its precision rather than manually add cast nodes in each model file.
…hat generates Microsoft.ML.OnnxRuntime.nuspec (microsoft#24131) ### Description As detailed in the issue (microsoft#24130), the Microsoft.ML.OnnxRuntime nuspec needs an additional dependency group for the `native` TFM, to allow it to be referenced via `PackageReference` in vcxproj projects. I took a peek at the code after filing the issue, since it seemed like it ought to be a simple fix. It looks like the nuspec file for the `Microsoft.ML.OnnxRuntime` package is generated by the python code in `generate_nuspec_for_native_nuget.py`, so I just added a line of code there. However I'm not sure how the build system invokes this generator so I haven't been able to test it in situ. Let me know if this isn't the right fix! ### Motivation and Context See detailed description in microsoft#24130
…4510) ### Description <!-- Describe your changes. --> Add `python_version < "3.13"` for `onnxscript` dependency in tools/ci_build/github/linux/python/requirements.txt. `onnxscript` has `onnx` as a dependency. Building the `onnx` wheel fails with Python 3.13. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix pipeline build failures.
### Description Fix ConvTranspose indexing errors and other minor changes. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Necessary to get Kokoro working in WebGPU (except Conv).
### Description
<!-- Describe your changes. -->
Adds support for MultiHeadAttention via WebNN matmul, transpose,
reshape, and other operations that follow the logic in the MHA subgraph
below
```
Abbreviatios: B is batch_size, S is sequence_length, W is hidden_size, P is past_sequence_length
N is number of attention heads, H is head size, and W=N*H, h=Sqrt(H)
Notes: If the datatype of the inputs (qkv and past kv) is float16, we cast them to float32 to ensure data precision.
query key value
| | |
q_Reshape k_Reshape v_Reshape (shape=B,S,H,N)
| | |
q_Transpose k_Transpose v_Transpose (perm=0,2,1,3)
\ / |
\ / |
present_key<---\----Concat <---------|----past_key
| | |
| opt_k_transpose |
\ (0,1,3,2) |
\ / | past_value
qk_MatMul | /
| scale | /
| / | /
qk_Div Concat------> present_value
| |
| /
Add <----------/---------------attention_bias
| /
Softmax /
\ /
\ /
qkv_MatMul
|
Transpose (perm=0,2,1,3)
|
Reshape---(shape=B,P,W)
|
output
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description The [1.22 release branch](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) has been cut, so we need to update the version in main from 1.22.0 to 1.23.0.
New EP - currently based on existing TensorRT EP but meant to be used on RTX GPUs with a lean version of TensorRT. ### Description Adding a new EP based on TensorRT EP. This is going to use a special version of TensorRT optimized for RTX GPUs. In the future we plan to make changes to the EP to streamline it further (e.g, get rid of dependency on CUDA EP completely). ### Motivation and Context The new TensorRT for RTX is going to have: 1. Much smaller footprint 2. Much faster model compile/load times. 3. Better usability in terms of use of cached models across multiple RTX GPUs. This effort is also targeting WCR ML workflows. --------- Co-authored-by: Maximilian Müller <maximilianm@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: iraut <iraut@nvidia.com> Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
ankitm3k
approved these changes
Apr 24, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backmerging with Msft commits