MlasTranspose multi-threads support. by msy-kato · Pull Request #24261 · microsoft/onnxruntime

msy-kato · 2025-04-01T00:01:14Z

Description

MlasTranspose was running single-thread and was not performing well enough on a multi-threaded CPU. Therefore, I modified it to run with multi-thread to improve performance.

The MlasTranspose was previously running in a single-threaded, which resulted in suboptimal performance on multi-threaded CPUs. To address this, I have modified it to utilize multi-threading.

Motivation and Context

We encountered this issue while running the multilingual-e5-large, which was converted to ONNX format and executed on a multi-core CPU (Xeon 6338). Below are the performance metrics before and after the modification:

	INTER_NUM_THREADS	INTRA_NUM_THREADS	INPUT_LENGTH	BATCH_SIZE	Duration time[sec]
BEFORE	1	16	512	4	1.24
AFTER	1	16	512	4	1.09

Condition

FP32
CPUExecutionProvider

This change resulted in a performance improvement of approximately 14%. MlasTranspose stand-alone performance improvements are as follows

	INTRA_NUM_THREADS	BEFORE	AFTER
MlasTranspose [msec]	16	182.55 [ms]	11.60 [ms]

MlasTranspose is x15~16 faster.

snnn · 2025-04-01T00:07:12Z

/azp run Big Models, Linux CPU Minimal Build E2E CI Pipeline, Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-04-01T00:07:45Z

Azure Pipelines successfully started running 7 pipeline(s).

msy-kato · 2025-04-01T00:28:52Z

@microsoft-github-policy-service agree company="Fujitsu Ltd."

snnn · 2025-04-01T01:01:18Z

##[error]D:\a_work\onnxruntime\onnxruntime\onnxruntime\core\mlas\lib\transpose.cpp(986,5): error C2664: 'void MlasExecuteThreaded(MLAS_THREADED_ROUTINE (__cdecl *),void *,ptrdiff_t,MLAS_THREADPOOL *)': cannot convert argument 1 from 'void (__stdcall *)(void *,ptrdiff_t)' to 'MLAS_THREADED_ROUTINE (__cdecl *)' [D:\a_work\onnxruntime\onnxruntime\build\RelWithDebInfo\onnxruntime_mlas.vcxproj]

snnn · 2025-04-01T19:02:08Z

Lint for onnxruntime/test/mlas/unittest/test_transpose.cpp:

Warning (CLANGFORMAT) format
See https://clang.llvm.org/docs/ClangFormat.html.
Run lintrunner -a to apply this patch.

You can run `lintrunner -a` to apply this patch.

[42](https://github.com/microsoft/onnxruntime/actions/runs/14186340445/job/39791671501?pr=24261#step:7:43)  42 |   }
[43](https://github.com/microsoft/onnxruntime/actions/runs/14186340445/job/39791671501?pr=24261#step:7:44)  43 | 
44  44 |   static const std::string GetTypeString() {
[44](https://github.com/microsoft/onnxruntime/actions/runs/14186340445/job/39791671501?pr=24261#step:7:45)     |-    if(std::is_same<ElementType, float>::value) return std::string("FP32");
[45](https://github.com/microsoft/onnxruntime/actions/runs/14186340445/job/39791671501?pr=24261#step:7:46)     |-    if(std::is_same<ElementType, uint32_t>::value) return std::string("U32");
[46](https://github.com/microsoft/onnxruntime/actions/runs/14186340445/job/39791671501?pr=24261#step:7:47)     |-    if(std::is_same<ElementType, uint16_t>::value) return std::string("U16");
[47](https://github.com/microsoft/onnxruntime/actions/runs/14186340445/job/39791671501?pr=24261#step:7:48)     |-    if(std::is_same<ElementType, uint8_t>::value) return std::string("U8");
    45 |+    if (std::is_same<ElementType, float>::value) return std::string("FP32");
    46 |+    if (std::is_same<ElementType, uint32_t>::value) return std::string("U32");
    47 |+    if (std::is_same<ElementType, uint16_t>::value) return std::string("U16");
    [48](https://github.com/microsoft/onnxruntime/actions/runs/14186340445/job/39791671501?pr=24261#step:7:49) |+    if (std::is_same<ElementType, uint8_t>::value) return std::string("U8");
[49](https://github.com/microsoft/onnxruntime/actions/runs/14186340445/job/39791671501?pr=24261#step:7:50)  49 |     return std::string("unknown");
50  50 |   }
51  51 |

msy-kato · 2025-04-01T23:57:05Z

OK, I'll apply the patch and push again.

jywu-msft · 2025-04-02T00:38:27Z

@amarin16 please test this out and review. thanks!

amarin16 · 2025-04-03T17:05:41Z

The code changes look good to me. Waiting for the pipelines to pass

amarin16 · 2025-04-03T18:30:45Z

@msy-kato Could you please provide some details about how you ran the performance tests? Did you use onnxruntime-genai?

msy-kato · 2025-04-04T01:22:11Z

Thanks for the review.

Could you please provide some details about how you ran the performance tests?

Sure! I converted HF model by torch.onnx.export and run model directry by onnxruntime.InferenceSession.
This is my snippet I use.

convert.py

import torch
from transformers import AutoTokenizer, AutoModel

model = AutoModel.from_pretrained("intfloat/multilingual-e5-large")
tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-large")
input_texts = [' '.join(['Hello']) * 32] * 2
inputs = dict(tokenizer(input_texts, return_tensors="pt"))
torch.onnx.export(
    model,
    inputs,
    "model.onnx",
    input_names=list(inputs.keys()),
    output_names=['last_hidden_state', 'pooler_output'],
    dynamic_axes={
        'input_ids': {0: 'batch_size', 1: 'max_input_length'},
        'attention_mask': {0: 'batch_size', 1: 'max_input_length'},
    }
)

run.py

import onnxruntime
import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-large")
input_texts = [' '.join(['Hello']) * 510] * 4
options = onnxruntime.SessionOptions()
options.inter_op_num_threads = 1
options.intra_op_num_threads = 16
ort_session = onnxruntime.InferenceSession("model.onnx", sess_options=options)
batch_dict = dict(tokenizer(input_texts, max_length=512, return_tensors="pt"))
batch_dict = {name: tensor.numpy() for name, tensor in batch_dict.items()}

# warmup
_ = ort_session.run(['last_hidden_state'], batch_dict)

start_time = time.time()
for i in range(10):
    _ = ort_session.run(['last_hidden_state'], batch_dict)
end_time = time.time()
print('step duration(avg) = {:.7f} sec/step'.format((end_time - start_time) / 10))

commands

$ python3 convert.py
$ numactl -C 0-15 python3 run.py

msy-kato · 2025-04-08T08:35:40Z

The code changes look good to me. Waiting for the pipelines to pass

@amarin16 Thank you for approving my PR. I noticed that the CI/CD pipeline hasn't completed yet. Could you advise if there's anything I can do?

amarin16 · 2025-04-08T12:21:56Z

Could try closing the PR and re-opening it

snnn · 2025-04-08T18:25:49Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-04-08T18:26:13Z

Azure Pipelines successfully started running 5 pipeline(s).

amarin16 · 2025-04-10T21:47:26Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-04-10T21:47:48Z

Azure Pipelines successfully started running 5 pipeline(s).

### Description MlasTranspose was running single-thread and was not performing well enough on a multi-threaded CPU. Therefore, I modified it to run with multi-thread to improve performance. The `MlasTranspose` was previously running in a single-threaded, which resulted in suboptimal performance on multi-threaded CPUs. To address this, I have modified it to utilize multi-threading. ### Motivation and Context We encountered this issue while running the [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large), which was converted to ONNX format and executed on a multi-core CPU (Xeon 6338). Below are the performance metrics before and after the modification: | | INTER_NUM_THREADS | INTRA_NUM_THREADS | INPUT_LENGTH | BATCH_SIZE | Duration time[sec] | | ------ | ----------------- | ----------------- | ------------ | ---------- | ------------------ | | BEFORE | 1 | 16 | 512 | 4 | 1.24 | | AFTER | 1 | 16 | 512 | 4 | 1.09 | Condition - FP32 - CPUExecutionProvider This change resulted in a performance improvement of approximately 14%. MlasTranspose stand-alone performance improvements are as follows | | INTRA_NUM_THREADS | BEFORE | AFTER | | --------------------------------- | ---- | -------------- | ------------- | | MlasTranspose [msec] | 16 | 182.55 [ms] | 11.60 [ms] | `MlasTranspose` is x15~16 faster.

msy-kato requested a review from a team as a code owner April 1, 2025 00:01

jywu-msft requested a review from amarin16 April 2, 2025 00:38

snnn closed this Apr 3, 2025

snnn reopened this Apr 3, 2025

amarin16 approved these changes Apr 6, 2025

View reviewed changes

amarin16 closed this Apr 8, 2025

amarin16 reopened this Apr 8, 2025

msy-kato added 3 commits April 10, 2025 10:50

MlasTranspose multi-threads support.

dc93658

Fix bug detected by CI/CD pipeline

058a616

apply lintrunner -a

09aade9

msy-kato force-pushed the feature-mlastranspose-multithread-v2 branch from 1a22f09 to 09aade9 Compare April 10, 2025 01:51

msy-kato closed this Apr 10, 2025

msy-kato reopened this Apr 10, 2025

amarin16 merged commit 7a03764 into microsoft:main Apr 11, 2025
69 checks passed

Conversation

msy-kato commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

snnn commented Apr 1, 2025

Uh oh!

azure-pipelines bot commented Apr 1, 2025

Uh oh!

msy-kato commented Apr 1, 2025

Uh oh!

snnn commented Apr 1, 2025

Uh oh!

snnn commented Apr 1, 2025

Uh oh!

msy-kato commented Apr 1, 2025

Uh oh!

jywu-msft commented Apr 2, 2025

Uh oh!

amarin16 commented Apr 3, 2025

Uh oh!

amarin16 commented Apr 3, 2025

Uh oh!

msy-kato commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msy-kato commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amarin16 commented Apr 8, 2025

Uh oh!

snnn commented Apr 8, 2025

Uh oh!

azure-pipelines bot commented Apr 8, 2025

Uh oh!

amarin16 commented Apr 10, 2025

Uh oh!

azure-pipelines bot commented Apr 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

msy-kato commented Apr 1, 2025 •

edited

Loading

msy-kato commented Apr 4, 2025 •

edited

Loading

msy-kato commented Apr 8, 2025 •

edited

Loading