POWER : Implement MlasGemmQuantKernel using VSX builtins for M = 1#25490
POWER : Implement MlasGemmQuantKernel using VSX builtins for M = 1#25490hariharans29 merged 3 commits intomicrosoft:mainfrom
Conversation
|
@yufenglee Could you review this PR. |
@microsoft-github-policy-service agree company="IBM" |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
/azp run Build Linux arm64 Debug / build_test_pipeline, Build Linux arm64 Release / build_test_pipeline, Build Linux CUDA x64 Release / build_test_pipeline, Build Linux CUDA x64 Release / build_test_pipeline, Build Linux TensorRT x64 Release / build_test_pipeline, Build Linux x64 Debug (ASan) / build_test_pipeline, Build and Test OpenVINO EP |
|
No pipelines are associated with this pull request. |
|
/azp run Build and Test OpenVINO EP, build_x64_debug |
|
No pipelines are associated with this pull request. |
There was a problem hiding this comment.
Pull Request Overview
This PR adds a specialized VSX-based implementation of MlasGemmQuantKernel optimized for the M=1 case to improve token generation performance for models with batch size 1, achieving a 3-5% improvement.
Key changes:
- Added M=1 optimization in packing function for improved memory layout
- Implemented specialized
MlasGemmQuantKernel_M1function using VSX vector multiplication - Modified main kernel function to route M=1 cases to the optimized implementation
|
Try close & reopen to trigger the checks |
|
@BODAPATIMAHESH - Thanks for the contribution. Can you please address Copilot's comments ? |
|
Thanks. I will work on the comments |
Thanks , I will resolve and update |
fec0031 to
95e60a0
Compare
|
@hariharans29 I initially formatted the entire file using clang-format, but realized that most of the changes were unrelated to my patch. So, I updated the formatting to include only the changes relevant to my patch. I'm open to suggestions if there's a preferred approach. |
Formating your changes is fine. Could you ensure going over Copilot's comments one by one and ensuring it is addressed ? Also, does your change need unit tests or existing ones should do ? |
95e60a0 to
61a4879
Compare
Thanks. |
|
@hariharans29 I've addressed the Copilot review comments. Please take a look at the latest changes. |
Can you please "resolve" them (i.e.) click the button for the Copilot comments ? I think that is a gating requirement for merging. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
@hariharans29 I hope all are fixed now. |
Thanks. I am not familir with Power ISA and I am trying to find another reviewer before merging. CC: @jywu-msft |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
/azp run Linux QNN CI Pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Can you rebase with main ? I think you need this fix - #25877 |
Added a VSX-based implementation of MlasGemmQuantKernel optimized for the case when M = 1. Verified correctness using ONNX Runtime's built-in tests and onnxruntime_mlas_tests;no regressions observed. Evaluated performance using a Granite 8-bit quantized model and observed approximately 3-5% improvement in token generation speed.
Fixing the indexation of qgemm_kernel_power10.cpp file. Avoided const_cast and used a const-correct approach.
59f931b to
9e9e607
Compare
Thanks @hariharans29 . I have rebased with main. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
POWER : Added a VSX-based implementation of MlasGemmQuantKernel optimized for the case when M = 1.
Verified correctness using ONNX Runtime's built-in tests and onnxruntime_mlas_tests;no regressions observed.
Evaluated performance using a Granite 8-bit quantized model and observed approximately 3-5% improvement in token generation speed.
Description
when M=1 then performed a multiplication using a VSX vector builtin vec_msum
Motivation and Context
To improve token generation performance for models with a batch size of 1