[CPU] Add 8bit support to matmulnbits quantizer #24384
Conversation
|
More context on the accuracy issue will be helpful. For example, what is the computation accuracy (int8 vs float16, float32). what is the block size. Also good to know how the accuracy issue was detected. And, what CPU is used. I assume there will be other PRs to support bit8 Mlas kernel? |
@liqunfu , We found the accuracy issue when we test phi-4 mini instruct model. The CPU model we test int4 vs float32 for /lm_head/MatMul. GenAI shows repetition for int4 but float32 is good. This motivates us to implement an int8 kernel. |
So embedding and output projection is not 4bit, correct? what blklen is used? |
liqunfu
left a comment
There was a problem hiding this comment.
my comment of renaming files and extending 2bit python api can be implemented in future PRs due to urgency of this work.
1be4c8e to
700c1d5
Compare
### Description * Rename filename and class name since it supports 4 and 8 bits. * Update HQQWeightOnlyQuantizer to support 8 bits. * Update some comments. ### Motivation and Context #24384 added 8 bits support for the default weight only quantizer.
### Description Add 8bit support to matmulnbits quantizer. matmul_4bits_quantizer now can quantize a const B in a MatMul to 8bits initializer. ### Motivation and Context MatMul4Bits has accuracy issue for phi-4 model used for foundry local. The early prototype showed >= 6bits can fix the issue. To mitigate the issue as soon as possible, add 8bit support to MatMulNBits.
### Description * Rename filename and class name since it supports 4 and 8 bits. * Update HQQWeightOnlyQuantizer to support 8 bits. * Update some comments. ### Motivation and Context #24384 added 8 bits support for the default weight only quantizer.
…oft#24472) ### Description * Rename filename and class name since it supports 4 and 8 bits. * Update HQQWeightOnlyQuantizer to support 8 bits. * Update some comments. ### Motivation and Context microsoft#24384 added 8 bits support for the default weight only quantizer. Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Description
Add 8bit support to matmulnbits quantizer. matmul_4bits_quantizer now can quantize a const B in a MatMul to 8bits initializer.
Motivation and Context
MatMul4Bits has accuracy issue for phi-4 model used for foundry local.
The early prototype showed >= 6bits can fix the issue.
To mitigate the issue as soon as possible, add 8bit support to MatMulNBits.