cache_utils: fix QuantizedLayer to correctly propagate reorder_cache, crop, and batch ops to quantized buffers by GitGlimpse895 · Pull Request #45510 · huggingface/transformers

GitGlimpse895 · 2026-04-19T07:34:56Z

What does this PR do?

QuantizedLayer maintains two separate storage regions: a full-precision
residual buffer (self.keys / self.values) and a quantized buffer
(self._quantized_keys / self._quantized_values). However, the four
mutation methods inherited from DynamicLayer — reorder_cache,
crop, batch_repeat_interleave, and batch_select_indices — only
operated on the residual buffer, silently leaving the quantized buffer
untouched.

Concrete failure modes:

Beam search (reorder_cache): the quantized buffer stays in
original beam order while the residual reorders, causing crossed
attention across beams with no error raised.
Constrained generation rollback (crop): cumulative_length
diverges from the actual stored state, corrupting subsequent
get_seq_length calls.
Group beam search / contrastive decoding (batch_select_indices,
batch_repeat_interleave): batch dimension of the quantized buffer
is never updated, producing mismatched batch sizes between the two
storage regions.

This PR overrides all four methods in QuantizedLayer. Since
_quantized_keys/_quantized_values are opaque backend objects for
both QuantoQuantizedLayer and HQQQuantizedLayer, the fix uses a
dequantize → operate → re-quantize pattern, which is backend-agnostic
and does not compound quantization error meaningfully beyond what is
already introduced at the original quantization step.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@gante @SunMarc

I confirm that this is not a pure code agent PR.

zucchini-nlp · 2026-04-21T12:55:52Z

        """Returns the sequence length of the cached states."""
        return self.cumulative_length

+    def reorder_cache(self, beam_idx: torch.LongTensor) -> None:


tbh haven't seen any requets to add beam search support with quantized cache prev, do you have a need for that or is this an improvement PR?

If you need this feature, I don't mind adding it. But we need a test in tests/generation/test_utils.py. Otherwise, I'd prefer to not increase maintenance cost for things that are "nice to have but not used"

… crop, and batch ops to quantized buffers

GitGlimpse895 · 2026-04-21T16:07:53Z

Hi @zucchini-nlp, thanks for the feedback!

You’re right—this was primarily an improvement PR and isn't a strict requirement for my current workflow. Since it falls into the "nice to have" category and I definitely understand the goal of keeping maintenance costs down, I’ll go ahead and close this one out.

Thanks again for your time and the review!

github-actions · 2026-04-21T16:21:58Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45510&sha=3a04d2

SunMarc requested a review from zucchini-nlp April 20, 2026 15:51

zucchini-nlp reviewed Apr 21, 2026

View reviewed changes

cache_utils: fix QuantizedLayer to correctly propagate reorder_cache,…

3a04d20

… crop, and batch ops to quantized buffers

GitGlimpse895 force-pushed the fix/quantized-layer-cache-ops branch from 851a938 to 3a04d20 Compare April 21, 2026 16:06

GitGlimpse895 closed this Apr 21, 2026

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache_utils: fix QuantizedLayer to correctly propagate reorder_cache, crop, and batch ops to quantized buffers#45510

cache_utils: fix QuantizedLayer to correctly propagate reorder_cache, crop, and batch ops to quantized buffers#45510
GitGlimpse895 wants to merge 1 commit intohuggingface:mainfrom
GitGlimpse895:fix/quantized-layer-cache-ops

GitGlimpse895 commented Apr 19, 2026 •

edited

Loading

Uh oh!

zucchini-nlp Apr 21, 2026

Uh oh!

zucchini-nlp Apr 21, 2026

Uh oh!

GitGlimpse895 commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GitGlimpse895 commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

zucchini-nlp Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

GitGlimpse895 commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GitGlimpse895 commented Apr 19, 2026 •

edited

Loading