[x86] matmulnbit x64 kernel for 8bits by fajin-corp · Pull Request #24491 · microsoft/onnxruntime

fajin-corp · 2025-04-21T23:07:39Z

Description

Add 8bits support for matmulnbits on x86

AVX512 VNNI

M	N	K	8-bit Time (ns)	4-bit Time (ns)	Slow down (8-bit / 4-bit)
1	4096	4096	34145	27723	1.23×
1	11008	4096	415285	68656	6.05×
1	4096	11008	407801	68061	5.99×
1	11008	11008	2674538	1003532	2.67×
4096	4096	4096	80338759	86321713	0.93×
4096	11008	4096	213421935	225245276	0.95×
4096	4096	11008	240164365	228966953	1.05×
4096	11008	11008	628352046	596738340	1.05×

AVX512

M	N	K	8-bit Time (ns)	4-bit Time (ns)	Slow down (8-bit / 4-bit)
1	4096	4096	53324	37882	1.41×
1	11008	4096	244560	103255	2.37×
1	4096	11008	435131	95734	4.55×
1	11008	11008	2790710	1075216	2.60×
4096	4096	4096	200629000	132841540	1.51×
4096	11008	4096	532141914	350613184	1.52×
4096	4096	11008	544011977	351679619	1.55×
4096	11008	11008	1421865147	925593210	1.54×

Token generation is bottlenecked at memory access. 8b model's 2x size is major reason of token generation slow down.

For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down.

Motivation and Context

MatMul4Bits model has repetition issue. 6b model resolved this issue.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

onnxruntime/test/contrib_ops/matmul_8bits_test.cc

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

onnxruntime/test/contrib_ops/matmul_8bits_test.cc

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

liqunfu

LGTM, it will be better if you add performance results, either mlas benchmark or even better ort-genai benchmark. ort-genai benchmark is preferred because I found mlas benchmark tends to show higher improvement that does not exist with genai.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/contrib_ops/matmul_8bits_test.cc

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

liqunfu · 2025-04-24T15:39:53Z

"For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down." does this issue exist with 4bit non-vnni?

fajin-corp · 2025-04-24T17:17:25Z

"For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down." does this issue exist with 4bit non-vnni?

4bit does not have this issue. it is caused by (i8 * i8) * 2. the result is put in i16. (i4 * i4) * 2 can fit in i16

### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (#24487) - (#24466) - (#24493) - (#24484) - (#24494) - (#24489) - (#24504) - (#24510) - (#24456) - (#24537) - (#24501) - (#24519) - (#24513) - (#24539) - (#24514) - (#24542) - (#24585) Not added: Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing cuda pipeline is ready - (#24491) - (#24509) - (#24564) --------- Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: minfhong-quic <quic_minfhong@quicinc.com> Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Ankan Banerjee <ankan.ban@gmail.com> Co-authored-by: Maximilian Müller <maximilianm@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: iraut <iraut@nvidia.com> Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: xhcao <xinghua.cao@intel.com>

### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (microsoft#24487) - (microsoft#24466) - (microsoft#24493) - (microsoft#24484) - (microsoft#24494) - (microsoft#24489) - (microsoft#24504) - (microsoft#24510) - (microsoft#24456) - (microsoft#24537) - (microsoft#24501) - (microsoft#24519) - (microsoft#24513) - (microsoft#24539) - (microsoft#24514) - (microsoft#24542) - (microsoft#24585) Not added: Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing cuda pipeline is ready - (microsoft#24491) - (microsoft#24509) - (microsoft#24564) --------- Co-authored-by: vraspar <vrajang@outlook.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: minfhong-quic <quic_minfhong@quicinc.com> Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Ankan Banerjee <ankan.ban@gmail.com> Co-authored-by: Maximilian Müller <maximilianm@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: iraut <iraut@nvidia.com> Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: xhcao <xinghua.cao@intel.com>

### Description Add 8bits support for matmulnbits on x86 __AVX512 VNNI__ | M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit / 4-bit) | |:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:| | 1 | 4096 | 4096 | 34145 | 27723 | **1.23×** | | 1 | 11008 | 4096 | 415285 | 68656 | **6.05×** | | 1 | 4096 | 11008 | 407801 | 68061 | **5.99×** | | 1 | 11008 | 11008 | 2674538 | 1003532 | **2.67×** | | 4096 | 4096 | 4096 | 80338759 | 86321713 | **0.93×** | | 4096 | 11008 | 4096 | 213421935 | 225245276 | **0.95×** | | 4096 | 4096 | 11008 | 240164365 | 228966953 | **1.05×** | | 4096 | 11008 | 11008 | 628352046 | 596738340 | **1.05×** | __AVX512__ | M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit / 4-bit) | |:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:| | 1 | 4096 | 4096 | 53324 | 37882 | **1.41×** | | 1 | 11008 | 4096 | 244560 | 103255 | **2.37×** | | 1 | 4096 | 11008 | 435131 | 95734 | **4.55×** | | 1 | 11008 | 11008 | 2790710 | 1075216 | **2.60×** | | 4096 | 4096 | 4096 | 200629000 | 132841540 | **1.51×** | | 4096 | 11008 | 4096 | 532141914 | 350613184 | **1.52×** | | 4096 | 4096 | 11008 | 544011977 | 351679619 | **1.55×** | | 4096 | 11008 | 11008 | 1421865147 | 925593210 | **1.54×** | Token generation is bottlenecked at memory access. 8b model's 2x size is major reason of token generation slow down. For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down. ### Motivation and Context MatMul4Bits model has repetition issue. 6b model resolved this issue.

### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (#24491) - (#24509) - (#24564) - (#24574) - (#24582) - (#24584) - (#24568) - (#24587) - (#24563) - (#24592) - (#24526) - (#24552) - (#24588) - (#24605) - (#24606) --------- Co-authored-by: Jing Fang <126209182+fajin-corp@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Mark Schofield <mschofie@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Ashwath Shankarnarayan <quic_ashwshan@quicinc.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Hector Li <hecli@microsoft.com>

### Description Add 8bits support for matmulnbits on x86 __AVX512 VNNI__ | M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit / 4-bit) | |:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:| | 1 | 4096 | 4096 | 34145 | 27723 | **1.23×** | | 1 | 11008 | 4096 | 415285 | 68656 | **6.05×** | | 1 | 4096 | 11008 | 407801 | 68061 | **5.99×** | | 1 | 11008 | 11008 | 2674538 | 1003532 | **2.67×** | | 4096 | 4096 | 4096 | 80338759 | 86321713 | **0.93×** | | 4096 | 11008 | 4096 | 213421935 | 225245276 | **0.95×** | | 4096 | 4096 | 11008 | 240164365 | 228966953 | **1.05×** | | 4096 | 11008 | 11008 | 628352046 | 596738340 | **1.05×** | __AVX512__ | M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit / 4-bit) | |:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:| | 1 | 4096 | 4096 | 53324 | 37882 | **1.41×** | | 1 | 11008 | 4096 | 244560 | 103255 | **2.37×** | | 1 | 4096 | 11008 | 435131 | 95734 | **4.55×** | | 1 | 11008 | 11008 | 2790710 | 1075216 | **2.60×** | | 4096 | 4096 | 4096 | 200629000 | 132841540 | **1.51×** | | 4096 | 11008 | 4096 | 532141914 | 350613184 | **1.52×** | | 4096 | 4096 | 11008 | 544011977 | 351679619 | **1.55×** | | 4096 | 11008 | 11008 | 1421865147 | 925593210 | **1.54×** | Token generation is bottlenecked at memory access. 8b model's 2x size is major reason of token generation slow down. For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down. ### Motivation and Context MatMul4Bits model has repetition issue. 6b model resolved this issue.

snnn · 2025-09-05T20:47:30Z

This PR has been included in the rel-1.22.0 branch. Removing the release:1.22.0 label.

fajin-corp requested a review from a team as a code owner April 21, 2025 23:07

github-actions bot reviewed Apr 21, 2025

View reviewed changes

github-advanced-security bot found potential problems Apr 21, 2025

View reviewed changes

onnxruntime/test/contrib_ops/matmul_8bits_test.cc Fixed Show fixed Hide fixed

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp Fixed Show fixed Hide fixed

liqunfu previously approved these changes Apr 21, 2025

View reviewed changes

fajin-corp dismissed liqunfu’s stale review via 33af389 April 22, 2025 00:39

fajin-corp added 25 commits April 22, 2025 17:36

added quant8 interface

75d4ab5

added q8 packb and blocksum

d5ef23e

added interface for sq8 int8 matmul

71d1f1c

fix prepack stride

caf4737

finished q8 matmul m2 n4

ced7c66

finished Q8Int8GemmR2xC1BlkLen16Avx2

fc3e979

finished Q8Int8GemmR2xC1BlkLen16Avx2

517b989

finished block16 avx2/vnni

e5da9d4

finished Q8Int8GemmR2xC4BlkLen32Avx2

d270008

finished Q8Int8GemmR2xC1BlkLen32Avx2

4d45bd8

finished Q8Int8GemmR1xC4BlkLen32Avx2

13097f0

finished Q8Int8GemmR1xC1BlkLen32Avx2

1287666

finished block64 avx2/vnni

8e24a76

added avx512/vnni kernel interface

04071b3

finished Q8Int8GemmR2xC4BlkLen16Avx512

23cb557

finished MlasQ8Int8GemmKernelBlkLen16Avx512

1548d95

finished MlasQ8Int8GemmKernelBlkLen32Avx512

d0df455

finished MlasQ8Int8GemmKernelBlkLen64Avx512

d7ae11b

finished MlasQ8Int8GemmKernelBlkLen128Avx512

702962f

fixed 512 vnni build

3011812

added prepack ut

02f74e4

added avx flags

89a3894

finished ut

6c64086

passed prepack ut

d17c840

fixing gemm ut

4cdf43b

fajin-corp added 2 commits April 22, 2025 20:13

resolving ut build error

ac17924

fix ut size

ede8f75

github-actions bot reviewed Apr 22, 2025

View reviewed changes

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp Show resolved Hide resolved

fajin-corp added 5 commits April 22, 2025 21:16

fix linting

a5332fd

reduce test count

809e8db

configure cpu ep for ut

367c568

fix dml and webgpu error

30cbf72

fix coreml ut

daf3064

github-actions bot reviewed Apr 23, 2025

View reviewed changes

onnxruntime/test/contrib_ops/matmul_8bits_test.cc Show resolved Hide resolved

fajin-corp added 2 commits April 23, 2025 18:26

fix linting

5e973df

reduce test count

159f7f1

fajin-corp closed this Apr 23, 2025

fajin-corp reopened this Apr 23, 2025

yihonglyu reviewed Apr 23, 2025

View reviewed changes

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp Outdated Show resolved Hide resolved

tianleiwu added the release:1.22.0 label Apr 24, 2025

update comments

d99c6f6

liqunfu approved these changes Apr 24, 2025

View reviewed changes

fajin-corp merged commit 7801c51 into main Apr 24, 2025
87 of 89 checks passed

fajin-corp deleted the fajin/matmul8bit_x64_kernel branch April 24, 2025 17:15

vraspar mentioned this pull request Apr 28, 2025

Cherry-picks into rel-1.22.0 #24580

Merged

vraspar mentioned this pull request May 1, 2025

Cherry-picks into rel-1.22.0 #24611

Merged

snnn removed the release:1.22.0 label Sep 5, 2025

Conversation

fajin-corp commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liqunfu left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liqunfu commented Apr 24, 2025

Uh oh!

Uh oh!

fajin-corp commented Apr 24, 2025

Uh oh!

snnn commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fajin-corp commented Apr 21, 2025 •

edited

Loading