Add support for uint8_t as data type for GatherBlockQuantized by sushraja-msft · Pull Request #24239 · microsoft/onnxruntime

sushraja-msft · 2025-03-28T19:29:51Z

Description

This change adds support for GatherBlockQuantized to use uin8_t as data's type with the same semantics as MatMulNBits. Zero_Points and Gather Axis other than 0 are not yet supported, in order to keep the change scoped.

Motivation and Context

With the newer llama models like Phi4 trained with shared embeddings, the weights of the lm_head matrix and the embeddings table are exactly the same. These embeddings are huge, unquantized embeddings are 1.2GB in Phi4 mini instruct, at int4 quantization the weights are still 300MB. We can go a step further and have these two ops the lm_head matmulnbits and GatherBlockQuantized share the same weights, that would save 300MB on the model size.

The two things that hinder that are the shape expectations for GatherBlockQuantized and the data type supported for data in GatherBlockQuantized. The shape can be solved via a simple reshape op, but the data type needs code changes and that is what this change does.

Here is Phi4 modified with shared weights between lm_head and matmulnbits, this model is just 2.1GB on disk.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/quantization/gather_block_quantized.cc

…d.cc Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

guschmue · 2025-03-31T18:04:55Z

in theory we'd need to rev to opdef but I belief there is no harm in this case:
it is a contrib op and if a model hits an older version of onnxruntime with uint8 there is no registration for that type so it will still gracefully fail.

liqunfu

shall there be test?

sushraja-msft · 2025-04-01T22:26:34Z

shall there be test?

Yes there shall, can you tell me where the existing tests are and how to run them locally so I can add to it ? I am new to making ORT CPU changes.

sushraja-msft · 2025-04-02T00:08:08Z

Are the tests here

https://github.com/microsoft/onnxruntime/blob/24620e70d9f14956a0dc84bb8a332dcd64c95a94/onnxruntime/test/contrib_ops/gather_block_quantized_op_test.cc#L125C3-L125C27 ?

that tests seems to indicate that it I would fail because I added uint8_t support, yet I am passing the CI.

yihonglyu

I believe some .md files under the docs should be updated as well.

sushraja-msft · 2025-04-04T02:36:04Z

I believe some .md files under the docs should be updated as well.

Could you point me to it please, I see a documentation string in contrib_defs.cc. That has been updated, is there some other documentation.

sushraja-msft · 2025-04-04T16:54:39Z

I believe some .md files under the docs should be updated as well.

Could you point me to it please, I see a documentation string in contrib_defs.cc. That has been updated, is there some other documentation.

Done updated

…oft#24239) ### Description This change adds support for GatherBlockQuantized to use uin8_t as data's type with the same semantics as MatMulNBits. Zero_Points and Gather Axis other than 0 are not yet supported, in order to keep the change scoped. ### Motivation and Context With the newer llama models like Phi4 trained with shared embeddings, the weights of the lm_head matrix and the embeddings table are exactly the same. These embeddings are huge, unquantized embeddings are 1.2GB in Phi4 mini instruct, at int4 quantization the weights are still 300MB. We can go a step further and have these two ops the lm_head matmulnbits and GatherBlockQuantized share the same weights, that would save 300MB on the model size. The two things that hinder that are the shape expectations for GatherBlockQuantized and the data type supported for data in GatherBlockQuantized. The shape can be solved via a simple reshape op, but the data type needs code changes and that is what this change does. Here is Phi4 modified with shared weights between lm_head and matmulnbits, this model is just 2.1GB on disk. <img width="164" alt="image" src="https://github.com/user-attachments/assets/8bdddbb9-5b44-4839-ab48-605bee53d66b" /> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

tianleiwu · 2025-05-01T20:48:24Z

docs/ContribOperators.md

       `block_size must` be a power of 2 and not smaller than 16, like 16, 32, 64, 128, ..
    2. Input `data`'s scale and zero point are specified by input `scales` and `zero_points`. `scales` and `zero_points` are also constants.
-       If `zero_points` is not provided, 0 is the zero point.
+       If `zero_points` is not provided, 0 is the zero point except when data is uint8 type then the default zero point is 8.


why default zero point is 8 for uint8? That does not sound reasonable to me.
Normally, the default is the middle value 2^(bits - 1), so 128 for 8 bits, and 8 for 4 bits.

Maybe add a description that this operator only supports 4 bits.

this uint8 stores two packed uint4s because this is how matmulnbits works. To resolve this issue, I was recently discussing adding a bits attribute - that would let the uint8_t be intepretted as packed uint4s or a single uint8.

Add support for uint8_t as data type for GatherBlockQuantized

2667f4b

sushraja-msft requested a review from guschmue March 28, 2025 19:30

sushraja-msft marked this pull request as ready for review March 28, 2025 19:30

github-actions bot reviewed Mar 28, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/gather_block_quantized.cc Show resolved Hide resolved

onnxruntime/contrib_ops/cpu/quantization/gather_block_quantized.cc Show resolved Hide resolved

sushraja-msft and others added 2 commits March 31, 2025 10:58

Update onnxruntime/contrib_ops/cpu/quantization/gather_block_quantize…

444a1ce

…d.cc Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update onnxruntime/contrib_ops/cpu/quantization/gather_block_quantize…

177f7bd

…d.cc Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

guschmue previously approved these changes Mar 31, 2025

View reviewed changes

Remove unnecessary cast

bd1f83f

sushraja-msft dismissed guschmue’s stale review via bd1f83f March 31, 2025 22:05

jywu-msft requested review from liqunfu and yihonglyu April 1, 2025 20:19

liqunfu reviewed Apr 1, 2025

View reviewed changes

yihonglyu reviewed Apr 2, 2025

View reviewed changes

sushraja-msft added 2 commits April 3, 2025 17:14

Add tests

8b5d9b3

Fix test

99932e1

update documentation

9fd3310

sushraja-msft requested a review from liqunfu April 4, 2025 16:54

liqunfu previously approved these changes Apr 4, 2025

View reviewed changes

update docs

ca9e4c7

sushraja-msft dismissed liqunfu’s stale review via ca9e4c7 April 4, 2025 18:14

guschmue approved these changes Apr 4, 2025

View reviewed changes

sushraja-msft merged commit a4976e3 into main Apr 4, 2025
85 of 89 checks passed

sushraja-msft deleted the user/sushraja/gather_dequantize branch April 4, 2025 22:43

tianleiwu reviewed May 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for uint8_t as data type for GatherBlockQuantized#24239

Add support for uint8_t as data type for GatherBlockQuantized#24239
sushraja-msft merged 8 commits intomainfrom
user/sushraja/gather_dequantize

sushraja-msft commented Mar 28, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

guschmue commented Mar 31, 2025

Uh oh!

liqunfu left a comment

Uh oh!

sushraja-msft commented Apr 1, 2025

Uh oh!

sushraja-msft commented Apr 2, 2025

Uh oh!

yihonglyu left a comment

Uh oh!

sushraja-msft commented Apr 4, 2025

Uh oh!

sushraja-msft commented Apr 4, 2025

Uh oh!

Uh oh!

tianleiwu May 1, 2025 •

edited

Loading

Uh oh!

sushraja-msft May 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

sushraja-msft commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

guschmue commented Mar 31, 2025

Uh oh!

liqunfu left a comment

Choose a reason for hiding this comment

Uh oh!

sushraja-msft commented Apr 1, 2025

Uh oh!

sushraja-msft commented Apr 2, 2025

Uh oh!

yihonglyu left a comment

Choose a reason for hiding this comment

Uh oh!

sushraja-msft commented Apr 4, 2025

Uh oh!

sushraja-msft commented Apr 4, 2025

Uh oh!

Uh oh!

tianleiwu May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sushraja-msft May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sushraja-msft commented Mar 28, 2025 •

edited

Loading

tianleiwu May 1, 2025 •

edited

Loading