Add support for uint8_t as data type for GatherBlockQuantized#24239
Add support for uint8_t as data type for GatherBlockQuantized#24239sushraja-msft merged 8 commits intomainfrom
Conversation
…d.cc Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…d.cc Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
|
in theory we'd need to rev to opdef but I belief there is no harm in this case: |
Yes there shall, can you tell me where the existing tests are and how to run them locally so I can add to it ? I am new to making ORT CPU changes. |
|
Are the tests here that tests seems to indicate that it I would fail because I added uint8_t support, yet I am passing the CI. |
yihonglyu
left a comment
There was a problem hiding this comment.
I believe some .md files under the docs should be updated as well.
Could you point me to it please, I see a documentation string in contrib_defs.cc. That has been updated, is there some other documentation. |
Done updated |
…oft#24239) ### Description This change adds support for GatherBlockQuantized to use uin8_t as data's type with the same semantics as MatMulNBits. Zero_Points and Gather Axis other than 0 are not yet supported, in order to keep the change scoped. ### Motivation and Context With the newer llama models like Phi4 trained with shared embeddings, the weights of the lm_head matrix and the embeddings table are exactly the same. These embeddings are huge, unquantized embeddings are 1.2GB in Phi4 mini instruct, at int4 quantization the weights are still 300MB. We can go a step further and have these two ops the lm_head matmulnbits and GatherBlockQuantized share the same weights, that would save 300MB on the model size. The two things that hinder that are the shape expectations for GatherBlockQuantized and the data type supported for data in GatherBlockQuantized. The shape can be solved via a simple reshape op, but the data type needs code changes and that is what this change does. Here is Phi4 modified with shared weights between lm_head and matmulnbits, this model is just 2.1GB on disk. <img width="164" alt="image" src="https://github.com/user-attachments/assets/8bdddbb9-5b44-4839-ab48-605bee53d66b" /> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
| `block_size must` be a power of 2 and not smaller than 16, like 16, 32, 64, 128, .. | ||
| 2. Input `data`'s scale and zero point are specified by input `scales` and `zero_points`. `scales` and `zero_points` are also constants. | ||
| If `zero_points` is not provided, 0 is the zero point. | ||
| If `zero_points` is not provided, 0 is the zero point except when data is uint8 type then the default zero point is 8. |
There was a problem hiding this comment.
why default zero point is 8 for uint8? That does not sound reasonable to me.
Normally, the default is the middle value 2^(bits - 1), so 128 for 8 bits, and 8 for 4 bits.
Maybe add a description that this operator only supports 4 bits.
There was a problem hiding this comment.
this uint8 stores two packed uint4s because this is how matmulnbits works. To resolve this issue, I was recently discussing adding a bits attribute - that would let the uint8_t be intepretted as packed uint4s or a single uint8.
Description
This change adds support for GatherBlockQuantized to use uin8_t as data's type with the same semantics as MatMulNBits. Zero_Points and Gather Axis other than 0 are not yet supported, in order to keep the change scoped.
Motivation and Context
With the newer llama models like Phi4 trained with shared embeddings, the weights of the lm_head matrix and the embeddings table are exactly the same. These embeddings are huge, unquantized embeddings are 1.2GB in Phi4 mini instruct, at int4 quantization the weights are still 300MB. We can go a step further and have these two ops the lm_head matmulnbits and GatherBlockQuantized share the same weights, that would save 300MB on the model size.
The two things that hinder that are the shape expectations for GatherBlockQuantized and the data type supported for data in GatherBlockQuantized. The shape can be solved via a simple reshape op, but the data type needs code changes and that is what this change does.
Here is Phi4 modified with shared weights between lm_head and matmulnbits, this model is just 2.1GB on disk.
