add 4bits channel-wised quantization capability for MatMulNbits Op#631
Closed
bopeng1234 wants to merge 2 commits intointel:ovep-developfrom
Closed
add 4bits channel-wised quantization capability for MatMulNbits Op#631bopeng1234 wants to merge 2 commits intointel:ovep-developfrom
bopeng1234 wants to merge 2 commits intointel:ovep-developfrom
Conversation
|
These file changes need to be sent directly to Micorsoft. If you are from Intel please contact @ankitm3k |
|
@bopeng1234 Please file a JIRA with all your finding and rebase this branch to the source branch asap. Kindly confirm that the change for enabling CW quantization is valid for quant_format=QuantFormat.QDQ also. Kindly share the recipe to create a int4 quantized model with me here or in the JIRA for a reproducer. |
…r phi3 model, it optimized the TPS on Intel NPU
Author
|
@ankitm3k , already filed a JIRA added the QDQ format, the command to create a NPU-friendly int4 CW quantized ONNX model is also attached in the JIRA |
Author
|
created a new one and close it, #669 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
add 4bits channel-wised quantization capability for MatMulNbits Op for phi3 model, it optimized the TPS on Intel NPU
JIRA - https://jira.devtools.intel.com/browse/EISW-163602
Motivation and Context
As Intel's NPU support to LLM shows https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/text_generation#npu-support
if we want to use onnx quantized model in intel NPU, like phi3, the quantized model need to meet two requirements:
So this PR's changes is to enable the channel wised quantize, and symmetric.
Quantize to int4 [-8, 7], we tested it with onnxruntime_genai changes (we created a PR to onnxruntime_genai too to support this extra args, microsoft/onnxruntime-genai#1362).
command:
python -m onnxruntime_genai.models.builder -o E:\download\onnx\Phi-3-mini-4k-instruct-onnx-channelwise-modified -p int4 -e cpu -i E:\download\huggingface\Phi-3-mini-4k-instruct --extra_options use_channel_wised_quantization=1normally without channel-wised quantize model, the phi3 with NPUW, runs about 4000ms per token when kv cache model.
apply this PR, and phi3 with NPUW, runs about 150ms per token, speed up 20+ times.