[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 by snadampal · Pull Request #17031 · microsoft/onnxruntime

snadampal · 2023-08-07T16:56:06Z

Description

This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"

The PR also adds new test cases for mlas and ort.

Motivation and Context

This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance.

cd onnxruntime/python/tools/transformers
python3 benchmark.py

And the unit test precision results are matching to sgemm kernel results.
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync

snadampal · 2023-08-11T15:25:26Z

appreciate if someone can review this PR.

snadampal · 2023-08-30T16:43:17Z

Hi @snnn , would you be able to review and provide feedback on this PR? appreciate your time.

snadampal · 2023-09-12T02:46:23Z

Hi, I have rebased the PR to resolve the merge conflicts. I'm happy to address any feedback you may have. Thank you!

milpuz01 · 2023-09-12T14:17:54Z

I have checked out the changes and run performance test and accuracy tests with and without flag using onnxruntime_perf_test (modified the binary to dump output for comparisons) on AWS Graviton3 instances and it was fine.

onnxruntime/core/providers/cpu/math/matmul.h

onnxruntime/core/providers/cpu/math/matmul.cc

snadampal · 2023-10-04T19:38:36Z

Hi @chenfucn , @yufenglee , I have updated the PR (1) to move to the newer gemm interface and (2) to add session option based fastmath mode control. Please review and let me know your feedback.

snadampal · 2023-10-11T18:01:38Z

Hi @chenfucn , @yufengle, appreciate if someone can trigger the CI for this PR. I have addressed all the feedback except the windows testing for which I'm waiting for the Windows CI results. Thank you!

chenfucn

As we discussed, please add mlas unit tests that call the kernel directly with different shapes are other parameters.

onnxruntime/core/providers/cpu/math/matmul.cc

onnxruntime/test/providers/base_tester.cc

chenfucn · 2023-10-12T17:39:06Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline

chenfucn · 2023-10-12T17:39:16Z

/azp run ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

azure-pipelines · 2023-10-12T17:39:41Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2023-10-12T17:39:54Z

Azure Pipelines successfully started running 9 pipeline(s).

cmake/onnxruntime_mlas.cmake

onnxruntime/core/mlas/lib/sbgemm.h

snadampal · 2023-10-13T17:09:59Z

Thanks for the review, I will update the PR to address this and also add unit tests.

snadampal · 2023-10-25T05:25:08Z

I have updated the PR to address all the feedback so far and also the learnings from my other qgemm PR.
(1) added the feature only for not Apple
(2) added mlas unit tests
(3) tested linux full build (both release and release with debug info)
(4) minimal build
(5) android build with cross compilation on x86. and (5) lintrunner and git-clang-format

Next, adding ort optimizer and provider tests to test the fastmath session.
Please review and let me know if any feedback on this version.

onnxruntime/core/mlas/lib/qgemm_kernel_ummla.cpp

onnxruntime/core/mlas/lib/sbgemm.h

snadampal · 2024-01-19T00:21:05Z

thank you, I see your point. bf16 and f16 are the potential fastmath options, but on aarch64, so far I see interest for bf16 fastmath alone. I agree that there may not be multiple of these for different platforms, so I will go ahead with a simple config key.

static const char* const kOrtSessionOptionsMlasGemmFastMathArm64Bfloat16 = "mlas.enable_gemm_fastmath_arm64_bfloat16";

onnxruntime/core/providers/cpu/math/matmul.cc

onnxruntime/test/mlas/unittest/test_sbgemm.cpp

onnxruntime/core/mlas/lib/platform.cpp

Added SbgemmKernel assembly implementation with bfmmla instructions and sbgemm utility functions to prepack Matrix B along with conversion to bfloat16.

sbgemm kernel is invoked when fastmath mode is enabled and HW supports the bf16 instruction set. It's disabled by default, set the following session option to 1 to enable it. "kOrtSessionOptionsMlasGemmFastMathArm64Bfloat16"

snadampal · 2024-01-20T17:00:37Z

Update the PR for the session name and other points discussed so far including the clang-formatting. Tested

release, debug and minimal builds on aarch64 neoverse v1 and n1 platforms
android build and linux cross compilation for aarch64 config on x86 platform

chenfucn · 2024-01-22T17:01:37Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline

chenfucn · 2024-01-22T17:01:56Z

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

azure-pipelines · 2024-01-22T17:02:14Z

Azure Pipelines successfully started running 8 pipeline(s).

azure-pipelines · 2024-01-22T17:02:32Z

Azure Pipelines successfully started running 8 pipeline(s).

snadampal · 2024-01-22T22:46:03Z

Thanks to @chenfucn , @snnn , @skottmckay and @yufenglee for the great feedback and merging the PR!

…oat16 (#17031) ### Description This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option: "kOrtSessionOptionsGemmFastMathMode" The PR also adds new test cases for mlas and ort. ### Motivation and Context This is to improve MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` And the unit test precision results are matching to sgemm kernel results. `./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync `

snnn · 2024-01-24T18:36:39Z

@snadampal , thanks for making ONNX Runtime better. Welcome to bring more changes to us. You have my email. Do not hesitate to contact me anytime when you need help on reviewing PRs.

maajidkhann · 2025-01-22T15:08:17Z

Hello @snadampal . This is a great reference PR for any SIMD based contributions to be made for ARM.
Can you please help me with how do we generate this file (onnxruntime/core/mlas/lib/aarch64/SbgemmKernelNeon.S)?
https://github.com/microsoft/onnxruntime/pull/17031/files#diff-6458cefb29cdb4ba0a976ca7ba93e0f3738f6b02e8d6063a51378c4fecfba7c4

My understanding is, we can add the SIMD intrinsic in the .cpp file (onnxruntime/core/mlas/lib/sbgemm_kernel_neon.cpp) https://github.com/microsoft/onnxruntime/pull/17031/files#diff-a6732e6798dee7a36040e9c388882279bbb70f1e53372a0a751b149429346118 like how you have added NEON code here and then use the gcc/clang compiler to generate the .S or .asm file?

snadampal · 2025-02-01T01:04:14Z

Hi @maajidkhann , Intrinsics based will be the best approach because it scales well for new architectures. but for this PR, I had hand written the assembly. since the goal to extract the best performance.

snnn · 2025-09-05T21:22:45Z

This PR has been cherry-picked into the rel-1.17.0 branch in PR #19243. Removing the release:1.17.0 label.

snadampal requested a review from a team as a code owner August 7, 2023 16:56

snadampal mentioned this pull request Aug 7, 2023

[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 #16687

Closed

snadampal force-pushed the sbgemm_aarch64 branch from be947da to 2aef2f3 Compare August 14, 2023 13:55

snadampal force-pushed the sbgemm_aarch64 branch from 2aef2f3 to 9b51325 Compare September 12, 2023 02:38

yufenglee reviewed Sep 26, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.h Outdated Show resolved Hide resolved

yufenglee reviewed Sep 26, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.h Outdated Show resolved Hide resolved

yufenglee reviewed Sep 26, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.h Outdated Show resolved Hide resolved

chenfucn reviewed Sep 26, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.h Outdated Show resolved Hide resolved

yufenglee reviewed Sep 26, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.cc Outdated Show resolved Hide resolved

snadampal force-pushed the sbgemm_aarch64 branch 4 times, most recently from eb257ff to 83a6f6e Compare October 4, 2023 19:29

snadampal force-pushed the sbgemm_aarch64 branch from 83a6f6e to 2fffd44 Compare October 11, 2023 17:57

chenfucn reviewed Oct 12, 2023

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.cc Outdated Show resolved Hide resolved

onnxruntime/test/providers/base_tester.cc Outdated Show resolved Hide resolved

yufenglee reviewed Oct 12, 2023

View reviewed changes

cmake/onnxruntime_mlas.cmake Outdated Show resolved Hide resolved

chenfucn reviewed Oct 12, 2023

View reviewed changes

onnxruntime/core/mlas/lib/sbgemm.h Show resolved Hide resolved

snadampal force-pushed the sbgemm_aarch64 branch from 2fffd44 to cef62df Compare October 25, 2023 05:19

yufenglee reviewed Jan 19, 2024

View reviewed changes

onnxruntime/core/mlas/lib/qgemm_kernel_ummla.cpp Outdated Show resolved Hide resolved

yufenglee added the release:1.17.0 label Jan 19, 2024

yufenglee reviewed Jan 19, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sbgemm.h Outdated Show resolved Hide resolved

chenfucn reviewed Jan 19, 2024

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.cc Show resolved Hide resolved

onnxruntime/test/mlas/unittest/test_sbgemm.cpp Show resolved Hide resolved

onnxruntime/core/mlas/lib/platform.cpp Show resolved Hide resolved

snadampal added 6 commits January 20, 2024 13:02

define aarch64 bf16 hwcaps checks in cpuinfo and platform

5240363

Add SBGEMM kernel to accelerate fp32 gemm with bfloat16

f8027c9

Added SbgemmKernel assembly implementation with bfmmla instructions and sbgemm utility functions to prepack Matrix B along with conversion to bfloat16.

add mlas unittests for sbgemm kernel

6376bfa

add optimizer QDQ Transformer MatMul tests for sbgemm fastmath mode

9aca49a

add ort execution provider math op matmul tests for sbgemm fastmath mode

d6d48c3

snadampal force-pushed the sbgemm_aarch64 branch from f45ef1d to d6d48c3 Compare January 20, 2024 16:11

chenfucn approved these changes Jan 22, 2024

View reviewed changes

snnn approved these changes Jan 22, 2024

View reviewed changes

snnn merged commit 77da2ef into microsoft:main Jan 22, 2024

snadampal mentioned this pull request Jan 29, 2024

add arm64 bfloat16 fastmath mode option for transformers benchmarking script #19294

Merged

snnn mentioned this pull request Mar 25, 2024

[Build] Trying to build on a embedded device that doesn't support BFLOAT16 #19920

Closed

snnn removed the release:1.17.0 label Sep 5, 2025

Rohanjames1997 mentioned this pull request Jan 12, 2026

Introducing BF16 Pointwise NCHWc Convolution for Arm64 #26838

Merged

Conversation

snadampal commented Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

snadampal commented Aug 11, 2023

Uh oh!

snadampal commented Aug 30, 2023

Uh oh!

snadampal commented Sep 12, 2023

Uh oh!

milpuz01 commented Sep 12, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snadampal commented Oct 4, 2023

Uh oh!

snadampal commented Oct 11, 2023

Uh oh!

chenfucn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chenfucn commented Oct 12, 2023

Uh oh!

chenfucn commented Oct 12, 2023

Uh oh!

azure-pipelines bot commented Oct 12, 2023

Uh oh!

azure-pipelines bot commented Oct 12, 2023

Uh oh!

Uh oh!

Uh oh!

snadampal commented Oct 13, 2023

Uh oh!

snadampal commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snadampal commented Jan 19, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snadampal commented Jan 20, 2024

Uh oh!

chenfucn commented Jan 22, 2024

Uh oh!

chenfucn commented Jan 22, 2024

Uh oh!

azure-pipelines bot commented Jan 22, 2024

Uh oh!

azure-pipelines bot commented Jan 22, 2024

Uh oh!

snadampal commented Jan 22, 2024

Uh oh!

snnn commented Jan 24, 2024

Uh oh!

maajidkhann commented Jan 22, 2025

Uh oh!

snadampal commented Feb 1, 2025

Uh oh!

snnn commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

snadampal commented Aug 7, 2023 •

edited

Loading

snadampal commented Oct 25, 2023 •

edited

Loading