MlasTranspose multi-threads support.#24261
Conversation
|
/azp run Big Models, Linux CPU Minimal Build E2E CI Pipeline, Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 7 pipeline(s). |
|
@microsoft-github-policy-service agree company="Fujitsu Ltd." |
|
##[error]D:\a_work\onnxruntime\onnxruntime\onnxruntime\core\mlas\lib\transpose.cpp(986,5): error C2664: 'void MlasExecuteThreaded(MLAS_THREADED_ROUTINE (__cdecl *),void *,ptrdiff_t,MLAS_THREADPOOL *)': cannot convert argument 1 from 'void (__stdcall *)(void *,ptrdiff_t)' to 'MLAS_THREADED_ROUTINE (__cdecl *)' [D:\a_work\onnxruntime\onnxruntime\build\RelWithDebInfo\onnxruntime_mlas.vcxproj] |
Warning (CLANGFORMAT) format |
|
OK, I'll apply the patch and push again. |
|
@amarin16 please test this out and review. thanks! |
|
The code changes look good to me. Waiting for the pipelines to pass |
|
@msy-kato Could you please provide some details about how you ran the performance tests? Did you use onnxruntime-genai? |
|
Thanks for the review.
Sure! I converted HF model by
import torch
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained("intfloat/multilingual-e5-large")
tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-large")
input_texts = [' '.join(['Hello']) * 32] * 2
inputs = dict(tokenizer(input_texts, return_tensors="pt"))
torch.onnx.export(
model,
inputs,
"model.onnx",
input_names=list(inputs.keys()),
output_names=['last_hidden_state', 'pooler_output'],
dynamic_axes={
'input_ids': {0: 'batch_size', 1: 'max_input_length'},
'attention_mask': {0: 'batch_size', 1: 'max_input_length'},
}
)
import onnxruntime
import time
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-large")
input_texts = [' '.join(['Hello']) * 510] * 4
options = onnxruntime.SessionOptions()
options.inter_op_num_threads = 1
options.intra_op_num_threads = 16
ort_session = onnxruntime.InferenceSession("model.onnx", sess_options=options)
batch_dict = dict(tokenizer(input_texts, max_length=512, return_tensors="pt"))
batch_dict = {name: tensor.numpy() for name, tensor in batch_dict.items()}
# warmup
_ = ort_session.run(['last_hidden_state'], batch_dict)
start_time = time.time()
for i in range(10):
_ = ort_session.run(['last_hidden_state'], batch_dict)
end_time = time.time()
print('step duration(avg) = {:.7f} sec/step'.format((end_time - start_time) / 10))commands $ python3 convert.py
$ numactl -C 0-15 python3 run.py |
@amarin16 Thank you for approving my PR. I noticed that the CI/CD pipeline hasn't completed yet. Could you advise if there's anything I can do? |
|
Could try closing the PR and re-opening it |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
1a22f09 to
09aade9
Compare
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
### Description MlasTranspose was running single-thread and was not performing well enough on a multi-threaded CPU. Therefore, I modified it to run with multi-thread to improve performance. The `MlasTranspose` was previously running in a single-threaded, which resulted in suboptimal performance on multi-threaded CPUs. To address this, I have modified it to utilize multi-threading. ### Motivation and Context We encountered this issue while running the [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large), which was converted to ONNX format and executed on a multi-core CPU (Xeon 6338). Below are the performance metrics before and after the modification: | | INTER_NUM_THREADS | INTRA_NUM_THREADS | INPUT_LENGTH | BATCH_SIZE | Duration time[sec] | | ------ | ----------------- | ----------------- | ------------ | ---------- | ------------------ | | BEFORE | 1 | 16 | 512 | 4 | 1.24 | | AFTER | 1 | 16 | 512 | 4 | 1.09 | Condition - FP32 - CPUExecutionProvider This change resulted in a performance improvement of approximately 14%. MlasTranspose stand-alone performance improvements are as follows | | INTRA_NUM_THREADS | BEFORE | AFTER | | --------------------------------- | ---- | -------------- | ------------- | | MlasTranspose [msec] | 16 | 182.55 [ms] | 11.60 [ms] | `MlasTranspose` is x15~16 faster.
### Description MlasTranspose was running single-thread and was not performing well enough on a multi-threaded CPU. Therefore, I modified it to run with multi-thread to improve performance. The `MlasTranspose` was previously running in a single-threaded, which resulted in suboptimal performance on multi-threaded CPUs. To address this, I have modified it to utilize multi-threading. ### Motivation and Context We encountered this issue while running the [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large), which was converted to ONNX format and executed on a multi-core CPU (Xeon 6338). Below are the performance metrics before and after the modification: | | INTER_NUM_THREADS | INTRA_NUM_THREADS | INPUT_LENGTH | BATCH_SIZE | Duration time[sec] | | ------ | ----------------- | ----------------- | ------------ | ---------- | ------------------ | | BEFORE | 1 | 16 | 512 | 4 | 1.24 | | AFTER | 1 | 16 | 512 | 4 | 1.09 | Condition - FP32 - CPUExecutionProvider This change resulted in a performance improvement of approximately 14%. MlasTranspose stand-alone performance improvements are as follows | | INTRA_NUM_THREADS | BEFORE | AFTER | | --------------------------------- | ---- | -------------- | ------------- | | MlasTranspose [msec] | 16 | 182.55 [ms] | 11.60 [ms] | `MlasTranspose` is x15~16 faster.
Description
MlasTranspose was running single-thread and was not performing well enough on a multi-threaded CPU. Therefore, I modified it to run with multi-thread to improve performance.
The
MlasTransposewas previously running in a single-threaded, which resulted in suboptimal performance on multi-threaded CPUs. To address this, I have modified it to utilize multi-threading.Motivation and Context
We encountered this issue while running the multilingual-e5-large, which was converted to ONNX format and executed on a multi-core CPU (Xeon 6338). Below are the performance metrics before and after the modification:
Condition
This change resulted in a performance improvement of approximately 14%. MlasTranspose stand-alone performance improvements are as follows
MlasTransposeis x15~16 faster.