Skip to content

[CPU] Add 8bit support to matmulnbits quantizer #24384

Merged
jiafatom merged 14 commits intomainfrom
fajin/matmulnbit8bit_quantizer
Apr 14, 2025
Merged

[CPU] Add 8bit support to matmulnbits quantizer #24384
jiafatom merged 14 commits intomainfrom
fajin/matmulnbit8bit_quantizer

Conversation

@fajin-corp
Copy link
Contributor

@fajin-corp fajin-corp commented Apr 10, 2025

Description

Add 8bit support to matmulnbits quantizer. matmul_4bits_quantizer now can quantize a const B in a MatMul to 8bits initializer.

Motivation and Context

MatMul4Bits has accuracy issue for phi-4 model used for foundry local.
The early prototype showed >= 6bits can fix the issue.
To mitigate the issue as soon as possible, add 8bit support to MatMulNBits.

@fajin-corp fajin-corp requested a review from a team as a code owner April 10, 2025 22:14
@liqunfu
Copy link
Contributor

liqunfu commented Apr 10, 2025

More context on the accuracy issue will be helpful. For example, what is the computation accuracy (int8 vs float16, float32). what is the block size. Also good to know how the accuracy issue was detected. And, what CPU is used.

I assume there will be other PRs to support bit8 Mlas kernel?

@jiafatom
Copy link
Contributor

More context on the accuracy issue will be helpful. For example, what is the computation accuracy (int8 vs float16, float32). what is the block size. Also good to know how the accuracy issue was detected. And, what CPU is used.

I assume there will be other PRs to support bit8 Mlas kernel?

@liqunfu , We found the accuracy issue when we test phi-4 mini instruct model. The CPU model we test int4 vs float32 for /lm_head/MatMul. GenAI shows repetition for int4 but float32 is good. This motivates us to implement an int8 kernel.

@liqunfu
Copy link
Contributor

liqunfu commented Apr 11, 2025

More context on the accuracy issue will be helpful. For example, what is the computation accuracy (int8 vs float16, float32). what is the block size. Also good to know how the accuracy issue was detected. And, what CPU is used.
I assume there will be other PRs to support bit8 Mlas kernel?

@liqunfu , We found the accuracy issue when we test phi-4 mini instruct model. The CPU model we test int4 vs float32 for /lm_head/MatMul. GenAI shows repetition for int4 but float32 is good. This motivates us to implement an int8 kernel.

So embedding and output projection is not 4bit, correct? what blklen is used?

liqunfu
liqunfu previously approved these changes Apr 11, 2025
Copy link
Contributor

@liqunfu liqunfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comment of renaming files and extending 2bit python api can be implemented in future PRs due to urgency of this work.

@jiafatom jiafatom merged commit 9a993c3 into main Apr 14, 2025
85 of 89 checks passed
@jiafatom jiafatom deleted the fajin/matmulnbit8bit_quantizer branch April 14, 2025 18:26
tianleiwu added a commit that referenced this pull request Apr 19, 2025
### Description

* Rename  filename and class name since it supports 4 and 8 bits.
* Update HQQWeightOnlyQuantizer to support 8 bits.
* Update some comments.

### Motivation and Context
#24384 added 8 bits support
for the default weight only quantizer.
ashrit-ms pushed a commit that referenced this pull request Apr 24, 2025
### Description
Add 8bit support to matmulnbits quantizer. matmul_4bits_quantizer now
can quantize a const B in a MatMul to 8bits initializer.

### Motivation and Context
MatMul4Bits has accuracy issue for phi-4 model used for foundry local.
The early prototype showed >= 6bits can fix the issue.
To mitigate the issue as soon as possible, add 8bit support to
MatMulNBits.
ashrit-ms pushed a commit that referenced this pull request Apr 24, 2025
### Description

* Rename  filename and class name since it supports 4 and 8 bits.
* Update HQQWeightOnlyQuantizer to support 8 bits.
* Update some comments.

### Motivation and Context
#24384 added 8 bits support
for the default weight only quantizer.
intbf pushed a commit to intbf/onnxruntime that referenced this pull request Apr 25, 2025
…oft#24472)

### Description

* Rename  filename and class name since it supports 4 and 8 bits.
* Update HQQWeightOnlyQuantizer to support 8 bits.
* Update some comments.

### Motivation and Context
microsoft#24384 added 8 bits support
for the default weight only quantizer.

Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants