Skip to content

[ARM CPU] SVE support for Elementwise kernels#25238

Merged
hariharans29 merged 21 commits intomicrosoft:mainfrom
sanketkaleoss:arm_sve_enablement
Sep 22, 2025
Merged

[ARM CPU] SVE support for Elementwise kernels#25238
hariharans29 merged 21 commits intomicrosoft:mainfrom
sanketkaleoss:arm_sve_enablement

Conversation

@sanketkaleoss
Copy link
Contributor

@sanketkaleoss sanketkaleoss commented Jul 1, 2025

Description

Ports the MlasErfKernel, MlasLogisticKernel and MlasComputeSoftmax kernels to the ARM SVE backend. Specifically, the following functions have been ported.

  • MlasErfKernel (lib/erf.cpp)
  • MlasLogisticKernel (lib/logistic.cpp)
  • MlasComputeSumExpF32Kernel (lib/compute.cpp)
  • MlasReduceMaximumF32Kernel (lib/compute.cpp)
  • MlasComputeSoftmaxOutputF32Kernel (lib/compute.cpp)
  • MlasComputeSoftmaxThreaded (lib/compute.cpp)

This PR uses the following design structure: adds new wrapper implementations of SVE functions in lib/mlasi_sve.h similar to mlasi.h and calls these wrapper functions in each kernel's implementation.

Motivation and Context

This work is a step toward making ONNX Runtime more performant and architecture-aware on ARM platforms.

Performance Analysis

image

  • Observed upto 1.4x speedup at the operator level
  • Performance is tested on AWS Graviton3E

This PR is a joint contribution by:

@sanketkaleoss
Copy link
Contributor Author

sanketkaleoss commented Jul 1, 2025

@sanketkaleoss please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree company=“Fujitsu Research of India Private Ltd”

@sanketkaleoss
Copy link
Contributor Author

sanketkaleoss commented Jul 4, 2025

Hi @edgchen1 , @hariharans29 , @snnn Kindly enable the CI pipeline.

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 25238 in repo microsoft/onnxruntime

@sanketkaleoss
Copy link
Contributor Author

Hi @yufenglee Please trigger the CI pipeline. Thanks

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds ARM SVE (Scalable Vector Extension) support for elementwise kernels to enhance ONNX Runtime performance on ARM platforms. The implementation includes SVE-optimized versions of the error function, logistic function, and softmax operations that can leverage variable-length vector processing.

  • Creates a new SVE intrinsics wrapper header (mlasi_sve.h) with ARM SVE-specific function implementations
  • Ports five core mathematical kernels to ARM SVE with runtime CPU feature detection and fallback
  • Integrates SVE support into the build system with compiler feature detection and appropriate compilation flags

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
onnxruntime/core/mlas/lib/mlasi_sve.h New SVE intrinsics wrapper providing ARM SVE equivalents of existing SIMD operations
onnxruntime/core/mlas/lib/mlasi.h Adds SVE intrinsics includes and CPU feature detection support
onnxruntime/core/mlas/lib/logistic.cpp Implements SVE-optimized logistic kernel with runtime dispatch
onnxruntime/core/mlas/lib/erf.cpp Implements SVE-optimized error function kernel with runtime dispatch
onnxruntime/core/mlas/lib/compute.cpp Implements SVE-optimized softmax and exponential kernels with runtime dispatch
onnxruntime/core/common/cpuid_info.h Adds HasArmSVE() method declaration for CPU feature detection
onnxruntime/core/common/cpuid_info.cc Implements ARM SVE feature detection across Linux and Windows platforms
cmake/onnxruntime_mlas.cmake Adds SVE compiler support detection and sets appropriate compilation flags
Comments suppressed due to low confidence (1)

onnxruntime/core/mlas/lib/mlasi_sve.h:510

  • This line duplicates the previous operation with the same coefficient (poly_56). The comment suggests uncertainty about this duplication. Verify if this is correct or if it should use a different coefficient like poly_6.
MlasSveReduceAddFloat32(MLAS_SVBOOL Pred, MLAS_SVFLOAT32 Vector)

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@hariharans29
Copy link
Member

The ARM Linux CI is failing on some tests

@sanketkaleoss
Copy link
Contributor Author

The ARM Linux CI is failing on some tests

I’m currently working on fixing the errors. One of them is caused by the accuracy issue in the SVE implementation of the exp() function, and the other originates from the sigmoid function.

The issue in the sigmoid function was also present in the NEON implementation prior to the last commit, but it was resolved by applying a Clamp. I’m replicating the same fix in the SVE version.

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@hariharans29 hariharans29 requested a review from edgchen1 August 11, 2025 23:12
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@hariharans29
Copy link
Member

hariharans29 commented Sep 18, 2025

There maybe some pending comments to be resolved - they have "pings" on them. Could you please address them ? Also, can you physically resolve (hit Resolve) on the comments that have been resolved - it is needed to merge? Thanks.

@sanketkaleoss
Copy link
Contributor Author

sanketkaleoss commented Sep 19, 2025

There maybe some pending comments to be resolved - they have "pings" on them. Could you please address them ? Also, can you physically resolve (hit Resolve) on the comments that have been resolved - it is needed to merge? Thanks.

@hariharans29 Updated and resolved the comments

edgchen1
edgchen1 previously approved these changes Sep 19, 2025
@hariharans29
Copy link
Member

@sanketkaleoss
Copy link
Contributor Author

sanketkaleoss commented Sep 20, 2025

Can you please fix the lint issues : https://github.com/microsoft/onnxruntime/blob/main/docs/Coding_Conventions_and_Standards.md#linting

@hariharans29 Sure, resolved in the latest commit

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29 hariharans29 merged commit 79a2d25 into microsoft:main Sep 22, 2025
90 checks passed
@hariharans29
Copy link
Member

@sanketkaleoss
Copy link
Contributor Author

sanketkaleoss commented Sep 26, 2025

Hi @sanketkaleoss : This PR crashes a test related to the 8 bit Gemms. See here- https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=952427&view=logs&j=d3350775-409a-5282-627a-f3b59b82cd3f&t=a42124b0-66aa-50d6-7780-383fc566e097&l=32556.

Any chance you know why ?

@hariharans29 I'm unable to open this link, says can't access the application.
Anyhow, this PR doesn't modify any 8-bit ops, it is only for FP32 elementwise_ops. Not sure why it would cause the crash for 8 bit gemm test. Can you provide more info on this?

@hariharans29
Copy link
Member

hariharans29 commented Sep 26, 2025

Hi @sanketkaleoss : This PR crashes a test related to the 8 bit Gemms. See here- https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=952427&view=logs&j=d3350775-409a-5282-627a-f3b59b82cd3f&t=a42124b0-66aa-50d6-7780-383fc566e097&l=32556.
Any chance you know why ?

@hariharans29 I'm unable to open this link, says can't access the application. Anyhow, this PR doesn't modify any 8-bit ops, it is only for FP32 elementwise_ops. Not sure why it would cause the crash for 8 bit gemm test. Can you provide more info on this?

Hi @sanketkaleoss - The relationship is unclear, but it definitely crashes this test -

count += MlasDirectShortExecuteTests<MlasSQ8BitQuantAKernelTest>::RegisterShortExecute();
.

The test ARM machine SKU is Standard_D16pds_v5. More details here: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpdsv5-series?tabs=sizebasic. I think the Ampere Altra processor does not support SVE ?

Would you be able to try running the test on a Linux machine (since SVE PR support is only added for that platform) and with the processor not having SVE support if you don't have access to an Ampere Altra ?

Just build onnxruntime, and run ./onnxruntime_mlas_test.exe --gtest_filter=SQ8BitQuantA.ShortExecute

@sanketkaleoss
Copy link
Contributor Author

Hi @sanketkaleoss : This PR crashes a test related to the 8 bit Gemms. See here- https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=952427&view=logs&j=d3350775-409a-5282-627a-f3b59b82cd3f&t=a42124b0-66aa-50d6-7780-383fc566e097&l=32556.
Any chance you know why ?

@hariharans29 I'm unable to open this link, says can't access the application. Anyhow, this PR doesn't modify any 8-bit ops, it is only for FP32 elementwise_ops. Not sure why it would cause the crash for 8 bit gemm test. Can you provide more info on this?

Hi @sanketkaleoss - The relationship is unclear, but it definitely crashes this test -

count += MlasDirectShortExecuteTests<MlasSQ8BitQuantAKernelTest>::RegisterShortExecute();

.
The test ARM machine SKU is Standard_D16pds_v5. More details here: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpdsv5-series?tabs=sizebasic. I think the Ampere Altra processor does not support SVE ?

Would you be able to try running the test on a Linux machine (since SVE PR support is only added for that platform) and with the processor not having SVE support if you don't have access to an Ampere Altra ?

Just build onnxruntime, and run ./onnxruntime_mlas_test.exe --gtest_filter=SQ8BitQuantA.ShortExecute

@hariharans29 Getting "Bus error (core dumped)"
while running that test on both AWS Graviton3E(SVE machine) and Graviton2(Non-sve machine)

@hariharans29
Copy link
Member

hariharans29 commented Sep 26, 2025

Hi @sanketkaleoss : This PR crashes a test related to the 8 bit Gemms. See here- https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=952427&view=logs&j=d3350775-409a-5282-627a-f3b59b82cd3f&t=a42124b0-66aa-50d6-7780-383fc566e097&l=32556.
Any chance you know why ?

@hariharans29 I'm unable to open this link, says can't access the application. Anyhow, this PR doesn't modify any 8-bit ops, it is only for FP32 elementwise_ops. Not sure why it would cause the crash for 8 bit gemm test. Can you provide more info on this?

Hi @sanketkaleoss - The relationship is unclear, but it definitely crashes this test -

count += MlasDirectShortExecuteTests<MlasSQ8BitQuantAKernelTest>::RegisterShortExecute();

.
The test ARM machine SKU is Standard_D16pds_v5. More details here: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpdsv5-series?tabs=sizebasic. I think the Ampere Altra processor does not support SVE ?
Would you be able to try running the test on a Linux machine (since SVE PR support is only added for that platform) and with the processor not having SVE support if you don't have access to an Ampere Altra ?
Just build onnxruntime, and run ./onnxruntime_mlas_test.exe --gtest_filter=SQ8BitQuantA.ShortExecute

@hariharans29 Getting "Bus error (core dumped)" while running that test on both AWS Graviton3E(SVE machine) and Graviton2(Non-sve machine)

Can you please debug it ? I don't see it in the pipeline with commit before the SVE PR. I got access to a Graviton4 machine and I see it the core dump there too.

@hariharans29
Copy link
Member

Also when I build with --no_sve , the test passes

@sanketkaleoss
Copy link
Contributor Author

sanketkaleoss commented Sep 29, 2025

Hi @sanketkaleoss : This PR crashes a test related to the 8 bit Gemms. See here- https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=952427&view=logs&j=d3350775-409a-5282-627a-f3b59b82cd3f&t=a42124b0-66aa-50d6-7780-383fc566e097&l=32556.
Any chance you know why ?

@hariharans29 I'm unable to open this link, says can't access the application. Anyhow, this PR doesn't modify any 8-bit ops, it is only for FP32 elementwise_ops. Not sure why it would cause the crash for 8 bit gemm test. Can you provide more info on this?

Hi @sanketkaleoss - The relationship is unclear, but it definitely crashes this test -

count += MlasDirectShortExecuteTests<MlasSQ8BitQuantAKernelTest>::RegisterShortExecute();

.
The test ARM machine SKU is Standard_D16pds_v5. More details here: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpdsv5-series?tabs=sizebasic. I think the Ampere Altra processor does not support SVE ?
Would you be able to try running the test on a Linux machine (since SVE PR support is only added for that platform) and with the processor not having SVE support if you don't have access to an Ampere Altra ?
Just build onnxruntime, and run ./onnxruntime_mlas_test.exe --gtest_filter=SQ8BitQuantA.ShortExecute

@hariharans29 Getting "Bus error (core dumped)" while running that test on both AWS Graviton3E(SVE machine) and Graviton2(Non-sve machine)

Can you please debug it ? I don't see it in the pipeline with commit before the SVE PR. I got access to a Graviton4 machine and I see it the core dump there too.

Sure, thanks for the insight, working on fixing it

hariharans29 added a commit that referenced this pull request Sep 30, 2025
### Description
The `MLAS_USE_SVE` macro was missing for some unittests/benchmark
targets. In the original PR, it was scoped down to just the mlas target
and this resulted in different mlas platform struct definitions across
targets.

### Motivation and Context
Fix pipeline crash and unblock daily pipeline run
#25238 (change that introduced the issue)
fs-eire pushed a commit that referenced this pull request Oct 24, 2025
### Description
<!-- Describe your changes. -->
Ports the `MlasErfKernel`, `MlasLogisticKernel` and `MlasComputeSoftmax`
kernels to the ARM SVE backend. Specifically, the following functions
have been ported.
- `MlasErfKernel` (lib/erf.cpp)
- `MlasLogisticKernel` (lib/logistic.cpp)
- `MlasComputeSumExpF32Kernel` (lib/compute.cpp)
- `MlasReduceMaximumF32Kernel` (lib/compute.cpp)
- `MlasComputeSoftmaxOutputF32Kernel` (lib/compute.cpp)
- `MlasComputeSoftmaxThreaded` (lib/compute.cpp)

This PR uses the following design structure: adds new wrapper
implementations of SVE functions in `lib/mlasi_sve.h` similar to
`mlasi.h` and calls these wrapper functions in each kernel's
implementation.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This work is a step toward making ONNX Runtime more performant and
architecture-aware on ARM platforms.

### Performance Analysis

![image](https://github.com/user-attachments/assets/34120c33-0ead-4a03-9d84-e74b1dc61856)
- Observed upto 1.4x speedup at the operator level
- Performance is tested on AWS Graviton3E

This PR is a joint contribution by:
- @NishantPrabhuFujitsu
- @sanketkaleoss

---------

Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
fs-eire pushed a commit that referenced this pull request Oct 24, 2025
### Description
The `MLAS_USE_SVE` macro was missing for some unittests/benchmark
targets. In the original PR, it was scoped down to just the mlas target
and this resulted in different mlas platform struct definitions across
targets.

### Motivation and Context
Fix pipeline crash and unblock daily pipeline run
#25238 (change that introduced the issue)
naomiOvad pushed a commit to naomiOvad/onnxruntime that referenced this pull request Nov 2, 2025
### Description
<!-- Describe your changes. -->
Ports the `MlasErfKernel`, `MlasLogisticKernel` and `MlasComputeSoftmax`
kernels to the ARM SVE backend. Specifically, the following functions
have been ported.
- `MlasErfKernel` (lib/erf.cpp)
- `MlasLogisticKernel` (lib/logistic.cpp)
- `MlasComputeSumExpF32Kernel` (lib/compute.cpp)
- `MlasReduceMaximumF32Kernel` (lib/compute.cpp)
- `MlasComputeSoftmaxOutputF32Kernel` (lib/compute.cpp)
- `MlasComputeSoftmaxThreaded` (lib/compute.cpp)

This PR uses the following design structure: adds new wrapper
implementations of SVE functions in `lib/mlasi_sve.h` similar to
`mlasi.h` and calls these wrapper functions in each kernel's
implementation.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This work is a step toward making ONNX Runtime more performant and
architecture-aware on ARM platforms.

### Performance Analysis

![image](https://github.com/user-attachments/assets/34120c33-0ead-4a03-9d84-e74b1dc61856)
- Observed upto 1.4x speedup at the operator level
- Performance is tested on AWS Graviton3E

This PR is a joint contribution by:
- @NishantPrabhuFujitsu
- @sanketkaleoss

---------

Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
naomiOvad pushed a commit to naomiOvad/onnxruntime that referenced this pull request Nov 2, 2025
### Description
The `MLAS_USE_SVE` macro was missing for some unittests/benchmark
targets. In the original PR, it was scoped down to just the mlas target
and this resulted in different mlas platform struct definitions across
targets.

### Motivation and Context
Fix pipeline crash and unblock daily pipeline run
microsoft#25238 (change that introduced the issue)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants