[ARM CPU] SVE support for Elementwise kernels#25238
[ARM CPU] SVE support for Elementwise kernels#25238hariharans29 merged 21 commits intomicrosoft:mainfrom
Conversation
@microsoft-github-policy-service agree company=“Fujitsu Research of India Private Ltd” |
|
Hi @edgchen1 , @hariharans29 , @snnn Kindly enable the CI pipeline. |
|
Commenter does not have sufficient privileges for PR 25238 in repo microsoft/onnxruntime |
|
Hi @yufenglee Please trigger the CI pipeline. Thanks |
There was a problem hiding this comment.
Pull Request Overview
This PR adds ARM SVE (Scalable Vector Extension) support for elementwise kernels to enhance ONNX Runtime performance on ARM platforms. The implementation includes SVE-optimized versions of the error function, logistic function, and softmax operations that can leverage variable-length vector processing.
- Creates a new SVE intrinsics wrapper header (
mlasi_sve.h) with ARM SVE-specific function implementations - Ports five core mathematical kernels to ARM SVE with runtime CPU feature detection and fallback
- Integrates SVE support into the build system with compiler feature detection and appropriate compilation flags
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
onnxruntime/core/mlas/lib/mlasi_sve.h |
New SVE intrinsics wrapper providing ARM SVE equivalents of existing SIMD operations |
onnxruntime/core/mlas/lib/mlasi.h |
Adds SVE intrinsics includes and CPU feature detection support |
onnxruntime/core/mlas/lib/logistic.cpp |
Implements SVE-optimized logistic kernel with runtime dispatch |
onnxruntime/core/mlas/lib/erf.cpp |
Implements SVE-optimized error function kernel with runtime dispatch |
onnxruntime/core/mlas/lib/compute.cpp |
Implements SVE-optimized softmax and exponential kernels with runtime dispatch |
onnxruntime/core/common/cpuid_info.h |
Adds HasArmSVE() method declaration for CPU feature detection |
onnxruntime/core/common/cpuid_info.cc |
Implements ARM SVE feature detection across Linux and Windows platforms |
cmake/onnxruntime_mlas.cmake |
Adds SVE compiler support detection and sets appropriate compilation flags |
Comments suppressed due to low confidence (1)
onnxruntime/core/mlas/lib/mlasi_sve.h:510
- This line duplicates the previous operation with the same coefficient (poly_56). The comment suggests uncertainty about this duplication. Verify if this is correct or if it should use a different coefficient like poly_6.
MlasSveReduceAddFloat32(MLAS_SVBOOL Pred, MLAS_SVFLOAT32 Vector)
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
The ARM Linux CI is failing on some tests |
I’m currently working on fixing the errors. One of them is caused by the accuracy issue in the SVE implementation of the exp() function, and the other originates from the sigmoid function. The issue in the sigmoid function was also present in the NEON implementation prior to the last commit, but it was resolved by applying a Clamp. I’m replicating the same fix in the SVE version. |
c9f2c9c to
638eb10
Compare
638eb10 to
5d77b73
Compare
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
There maybe some pending comments to be resolved - they have "pings" on them. Could you please address them ? Also, can you physically resolve (hit Resolve) on the comments that have been resolved - it is needed to merge? Thanks. |
@hariharans29 Updated and resolved the comments |
|
Can you please fix the lint issues : https://github.com/microsoft/onnxruntime/blob/main/docs/Coding_Conventions_and_Standards.md#linting |
@hariharans29 Sure, resolved in the latest commit |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Hi @sanketkaleoss : This PR crashes a test related to the 8 bit Gemms. See here- https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=952427&view=logs&j=d3350775-409a-5282-627a-f3b59b82cd3f&t=a42124b0-66aa-50d6-7780-383fc566e097&l=32556. Any chance you know why ? |
@hariharans29 I'm unable to open this link, says can't access the application. |
Hi @sanketkaleoss - The relationship is unclear, but it definitely crashes this test - .The test ARM machine SKU is Standard_D16pds_v5. More details here: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpdsv5-series?tabs=sizebasic. I think the Ampere Altra processor does not support SVE ? Would you be able to try running the test on a Linux machine (since SVE PR support is only added for that platform) and with the processor not having SVE support if you don't have access to an Ampere Altra ? Just build onnxruntime, and run |
@hariharans29 Getting "Bus error (core dumped)" |
Can you please debug it ? I don't see it in the pipeline with commit before the SVE PR. I got access to a Graviton4 machine and I see it the core dump there too. |
|
Also when I build with |
Sure, thanks for the insight, working on fixing it |
### Description The `MLAS_USE_SVE` macro was missing for some unittests/benchmark targets. In the original PR, it was scoped down to just the mlas target and this resulted in different mlas platform struct definitions across targets. ### Motivation and Context Fix pipeline crash and unblock daily pipeline run #25238 (change that introduced the issue)
### Description <!-- Describe your changes. --> Ports the `MlasErfKernel`, `MlasLogisticKernel` and `MlasComputeSoftmax` kernels to the ARM SVE backend. Specifically, the following functions have been ported. - `MlasErfKernel` (lib/erf.cpp) - `MlasLogisticKernel` (lib/logistic.cpp) - `MlasComputeSumExpF32Kernel` (lib/compute.cpp) - `MlasReduceMaximumF32Kernel` (lib/compute.cpp) - `MlasComputeSoftmaxOutputF32Kernel` (lib/compute.cpp) - `MlasComputeSoftmaxThreaded` (lib/compute.cpp) This PR uses the following design structure: adds new wrapper implementations of SVE functions in `lib/mlasi_sve.h` similar to `mlasi.h` and calls these wrapper functions in each kernel's implementation. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This work is a step toward making ONNX Runtime more performant and architecture-aware on ARM platforms. ### Performance Analysis  - Observed upto 1.4x speedup at the operator level - Performance is tested on AWS Graviton3E This PR is a joint contribution by: - @NishantPrabhuFujitsu - @sanketkaleoss --------- Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description The `MLAS_USE_SVE` macro was missing for some unittests/benchmark targets. In the original PR, it was scoped down to just the mlas target and this resulted in different mlas platform struct definitions across targets. ### Motivation and Context Fix pipeline crash and unblock daily pipeline run #25238 (change that introduced the issue)
### Description <!-- Describe your changes. --> Ports the `MlasErfKernel`, `MlasLogisticKernel` and `MlasComputeSoftmax` kernels to the ARM SVE backend. Specifically, the following functions have been ported. - `MlasErfKernel` (lib/erf.cpp) - `MlasLogisticKernel` (lib/logistic.cpp) - `MlasComputeSumExpF32Kernel` (lib/compute.cpp) - `MlasReduceMaximumF32Kernel` (lib/compute.cpp) - `MlasComputeSoftmaxOutputF32Kernel` (lib/compute.cpp) - `MlasComputeSoftmaxThreaded` (lib/compute.cpp) This PR uses the following design structure: adds new wrapper implementations of SVE functions in `lib/mlasi_sve.h` similar to `mlasi.h` and calls these wrapper functions in each kernel's implementation. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This work is a step toward making ONNX Runtime more performant and architecture-aware on ARM platforms. ### Performance Analysis  - Observed upto 1.4x speedup at the operator level - Performance is tested on AWS Graviton3E This PR is a joint contribution by: - @NishantPrabhuFujitsu - @sanketkaleoss --------- Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description The `MLAS_USE_SVE` macro was missing for some unittests/benchmark targets. In the original PR, it was scoped down to just the mlas target and this resulted in different mlas platform struct definitions across targets. ### Motivation and Context Fix pipeline crash and unblock daily pipeline run microsoft#25238 (change that introduced the issue)
Description
Ports the
MlasErfKernel,MlasLogisticKernelandMlasComputeSoftmaxkernels to the ARM SVE backend. Specifically, the following functions have been ported.MlasErfKernel(lib/erf.cpp)MlasLogisticKernel(lib/logistic.cpp)MlasComputeSumExpF32Kernel(lib/compute.cpp)MlasReduceMaximumF32Kernel(lib/compute.cpp)MlasComputeSoftmaxOutputF32Kernel(lib/compute.cpp)MlasComputeSoftmaxThreaded(lib/compute.cpp)This PR uses the following design structure: adds new wrapper implementations of SVE functions in
lib/mlasi_sve.hsimilar tomlasi.hand calls these wrapper functions in each kernel's implementation.Motivation and Context
This work is a step toward making ONNX Runtime more performant and architecture-aware on ARM platforms.
Performance Analysis
This PR is a joint contribution by: