cache_utils: fix QuantizedLayer to correctly propagate reorder_cache, crop, and batch ops to quantized buffers#45510
Conversation
| """Returns the sequence length of the cached states.""" | ||
| return self.cumulative_length | ||
|
|
||
| def reorder_cache(self, beam_idx: torch.LongTensor) -> None: |
There was a problem hiding this comment.
tbh haven't seen any requets to add beam search support with quantized cache prev, do you have a need for that or is this an improvement PR?
There was a problem hiding this comment.
If you need this feature, I don't mind adding it. But we need a test in tests/generation/test_utils.py. Otherwise, I'd prefer to not increase maintenance cost for things that are "nice to have but not used"
… crop, and batch ops to quantized buffers
851a938 to
3a04d20
Compare
|
Hi @zucchini-nlp, thanks for the feedback! You’re right—this was primarily an improvement PR and isn't a strict requirement for my current workflow. Since it falls into the "nice to have" category and I definitely understand the goal of keeping maintenance costs down, I’ll go ahead and close this one out. Thanks again for your time and the review! |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45510&sha=3a04d2 |
What does this PR do?
QuantizedLayermaintains two separate storage regions: a full-precisionresidual buffer (
self.keys/self.values) and a quantized buffer(
self._quantized_keys/self._quantized_values). However, the fourmutation methods inherited from
DynamicLayer—reorder_cache,crop,batch_repeat_interleave, andbatch_select_indices— onlyoperated on the residual buffer, silently leaving the quantized buffer
untouched.
Concrete failure modes:
reorder_cache): the quantized buffer stays inoriginal beam order while the residual reorders, causing crossed
attention across beams with no error raised.
crop):cumulative_lengthdiverges from the actual stored state, corrupting subsequent
get_seq_lengthcalls.batch_select_indices,batch_repeat_interleave): batch dimension of the quantized bufferis never updated, producing mismatched batch sizes between the two
storage regions.
This PR overrides all four methods in
QuantizedLayer. Since_quantized_keys/_quantized_valuesare opaque backend objects forboth
QuantoQuantizedLayerandHQQQuantizedLayer, the fix uses adequantize → operate → re-quantize pattern, which is backend-agnostic
and does not compound quantization error meaningfully beyond what is
already introduced at the original quantization step.
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@gante @SunMarc