[CPU] Add 8bit support to matmulnbits quantizer by fajin-corp · Pull Request #24384 · microsoft/onnxruntime

fajin-corp · 2025-04-10T22:14:33Z

Description

Add 8bit support to matmulnbits quantizer. matmul_4bits_quantizer now can quantize a const B in a MatMul to 8bits initializer.

Motivation and Context

MatMul4Bits has accuracy issue for phi-4 model used for foundry local.
The early prototype showed >= 6bits can fix the issue.
To mitigate the issue as soon as possible, add 8bit support to MatMulNBits.

liqunfu · 2025-04-10T22:29:01Z

More context on the accuracy issue will be helpful. For example, what is the computation accuracy (int8 vs float16, float32). what is the block size. Also good to know how the accuracy issue was detected. And, what CPU is used.

I assume there will be other PRs to support bit8 Mlas kernel?

onnxruntime/core/mlas/inc/mlas_q4.h

onnxruntime/python/onnxruntime_pybind_quant.cc

onnxruntime/test/mlas/unittest/test_blockq4.cpp

onnxruntime/core/mlas/lib/q4_dq.cpp

onnxruntime/python/onnxruntime_pybind_quant.cc

onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py

jiafatom · 2025-04-10T22:45:54Z

More context on the accuracy issue will be helpful. For example, what is the computation accuracy (int8 vs float16, float32). what is the block size. Also good to know how the accuracy issue was detected. And, what CPU is used.

I assume there will be other PRs to support bit8 Mlas kernel?

@liqunfu , We found the accuracy issue when we test phi-4 mini instruct model. The CPU model we test int4 vs float32 for /lm_head/MatMul. GenAI shows repetition for int4 but float32 is good. This motivates us to implement an int8 kernel.

liqunfu · 2025-04-11T03:48:20Z

More context on the accuracy issue will be helpful. For example, what is the computation accuracy (int8 vs float16, float32). what is the block size. Also good to know how the accuracy issue was detected. And, what CPU is used.
I assume there will be other PRs to support bit8 Mlas kernel?

@liqunfu , We found the accuracy issue when we test phi-4 mini instruct model. The CPU model we test int4 vs float32 for /lm_head/MatMul. GenAI shows repetition for int4 but float32 is good. This motivates us to implement an int8 kernel.

So embedding and output projection is not 4bit, correct? what blklen is used?

liqunfu

my comment of renaming files and extending 2bit python api can be implemented in future PRs due to urgency of this work.

### Description * Rename filename and class name since it supports 4 and 8 bits. * Update HQQWeightOnlyQuantizer to support 8 bits. * Update some comments. ### Motivation and Context #24384 added 8 bits support for the default weight only quantizer.

### Description Add 8bit support to matmulnbits quantizer. matmul_4bits_quantizer now can quantize a const B in a MatMul to 8bits initializer. ### Motivation and Context MatMul4Bits has accuracy issue for phi-4 model used for foundry local. The early prototype showed >= 6bits can fix the issue. To mitigate the issue as soon as possible, add 8bit support to MatMulNBits.

### Description * Rename filename and class name since it supports 4 and 8 bits. * Update HQQWeightOnlyQuantizer to support 8 bits. * Update some comments. ### Motivation and Context #24384 added 8 bits support for the default weight only quantizer.

…oft#24472) ### Description * Rename filename and class name since it supports 4 and 8 bits. * Update HQQWeightOnlyQuantizer to support 8 bits. * Update some comments. ### Motivation and Context microsoft#24384 added 8 bits support for the default weight only quantizer. Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>

fajin-corp requested a review from a team as a code owner April 10, 2025 22:14

fajin-corp assigned fajin-corp and liqunfu Apr 10, 2025

liqunfu reviewed Apr 10, 2025

View reviewed changes

onnxruntime/core/mlas/inc/mlas_q4.h Show resolved Hide resolved

liqunfu reviewed Apr 10, 2025

View reviewed changes

onnxruntime/python/onnxruntime_pybind_quant.cc Show resolved Hide resolved

liqunfu reviewed Apr 10, 2025

View reviewed changes

onnxruntime/test/mlas/unittest/test_blockq4.cpp Show resolved Hide resolved

jiafatom reviewed Apr 10, 2025

View reviewed changes

liqunfu previously approved these changes Apr 11, 2025

View reviewed changes

fajin-corp added 13 commits April 11, 2025 11:57

init

16a3a8e

finished quantizeAndTranspose for 8b and 2b

d61cfcd

finished dequantize for 2b, 4b, 8b

0400008

changed interface

819a3c8

fixed q4 ut

f9fff2a

fixed 4bit ut

9d9bb0e

debugging q8

ad90bac

fixed q8 ut

26b2763

finished ut

978dab1

updating quantizer

b90938c

added q8 quantizer

a5a1784

Add todo

8929e69

fix linting

700c1d5

fajin-corp dismissed liqunfu’s stale review via 700c1d5 April 11, 2025 19:00

fajin-corp force-pushed the fajin/matmulnbit8bit_quantizer branch from 1be4c8e to 700c1d5 Compare April 11, 2025 19:00

add todo

5125c52

jiafatom approved these changes Apr 14, 2025

View reviewed changes

liqunfu approved these changes Apr 14, 2025

View reviewed changes

jiafatom merged commit 9a993c3 into main Apr 14, 2025
85 of 89 checks passed

jiafatom deleted the fajin/matmulnbit8bit_quantizer branch April 14, 2025 18:26

tianleiwu mentioned this pull request Apr 19, 2025

Rename matmul_4bits_quantizer.py to matmul_nbits_quantizer.py #24472

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Add 8bit support to matmulnbits quantizer #24384

[CPU] Add 8bit support to matmulnbits quantizer #24384
jiafatom merged 14 commits intomainfrom
fajin/matmulnbit8bit_quantizer

fajin-corp commented Apr 10, 2025 •

edited

Loading

Uh oh!

liqunfu commented Apr 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiafatom commented Apr 10, 2025

Uh oh!

liqunfu commented Apr 11, 2025

Uh oh!

liqunfu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fajin-corp commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

liqunfu commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiafatom commented Apr 10, 2025

Uh oh!

liqunfu commented Apr 11, 2025

Uh oh!

liqunfu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fajin-corp commented Apr 10, 2025 •

edited

Loading

liqunfu commented Apr 10, 2025 •

edited

Loading