Skip to content

cache_utils: fix QuantizedLayer to correctly propagate reorder_cache, crop, and batch ops to quantized buffers#45510

Closed
GitGlimpse895 wants to merge 1 commit intohuggingface:mainfrom
GitGlimpse895:fix/quantized-layer-cache-ops
Closed

cache_utils: fix QuantizedLayer to correctly propagate reorder_cache, crop, and batch ops to quantized buffers#45510
GitGlimpse895 wants to merge 1 commit intohuggingface:mainfrom
GitGlimpse895:fix/quantized-layer-cache-ops

Conversation

@GitGlimpse895
Copy link
Copy Markdown

@GitGlimpse895 GitGlimpse895 commented Apr 19, 2026

What does this PR do?

QuantizedLayer maintains two separate storage regions: a full-precision
residual buffer (self.keys / self.values) and a quantized buffer
(self._quantized_keys / self._quantized_values). However, the four
mutation methods inherited from DynamicLayerreorder_cache,
crop, batch_repeat_interleave, and batch_select_indices — only
operated on the residual buffer, silently leaving the quantized buffer
untouched.

Concrete failure modes:

  • Beam search (reorder_cache): the quantized buffer stays in
    original beam order while the residual reorders, causing crossed
    attention across beams with no error raised.
  • Constrained generation rollback (crop): cumulative_length
    diverges from the actual stored state, corrupting subsequent
    get_seq_length calls.
  • Group beam search / contrastive decoding (batch_select_indices,
    batch_repeat_interleave): batch dimension of the quantized buffer
    is never updated, producing mismatched batch sizes between the two
    storage regions.

This PR overrides all four methods in QuantizedLayer. Since
_quantized_keys/_quantized_values are opaque backend objects for
both QuantoQuantizedLayer and HQQQuantizedLayer, the fix uses a
dequantize → operate → re-quantize pattern, which is backend-agnostic
and does not compound quantization error meaningfully beyond what is
already introduced at the original quantization step.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@gante @SunMarc

  • I confirm that this is not a pure code agent PR.

@SunMarc SunMarc requested a review from zucchini-nlp April 20, 2026 15:51
"""Returns the sequence length of the cached states."""
return self.cumulative_length

def reorder_cache(self, beam_idx: torch.LongTensor) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh haven't seen any requets to add beam search support with quantized cache prev, do you have a need for that or is this an improvement PR?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need this feature, I don't mind adding it. But we need a test in tests/generation/test_utils.py. Otherwise, I'd prefer to not increase maintenance cost for things that are "nice to have but not used"

@GitGlimpse895 GitGlimpse895 force-pushed the fix/quantized-layer-cache-ops branch from 851a938 to 3a04d20 Compare April 21, 2026 16:06
@GitGlimpse895
Copy link
Copy Markdown
Author

Hi @zucchini-nlp, thanks for the feedback!

You’re right—this was primarily an improvement PR and isn't a strict requirement for my current workflow. Since it falls into the "nice to have" category and I definitely understand the goal of keeping maintenance costs down, I’ll go ahead and close this one out.

Thanks again for your time and the review!

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45510&sha=3a04d2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants