[cache] make all classes cache compatible finally by zucchini-nlp · Pull Request #38635 · huggingface/transformers

zucchini-nlp · 2025-06-06T08:12:02Z

What does this PR do?

As per title, and let's get rid of _supports_cache/_supports_quantized_cache flags. We will assume all models from now on support cache and initialize a DynamicCache (model-specific cache in case of mamba) by default

For the static cache, we can't yet assume all models support it because even if the model technically can use StaticCache, it can't always compile fullgraph. We have an auto-compilation for static cache enabled, so maybe the compilation should check for smth like _can_compile_fullgraph and not _supports_static_cache?

I checked all models are updated and the generation tests are passing. Note, the PR depends on #38751 which cleans up non-generative models from past_key_values

HuggingFaceDocBuilderDev · 2025-06-06T08:24:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-06-11T09:01:09Z

-                )
+                beam_idx = self._flatten_beam_dim(running_beam_indices[..., cur_len - decoder_prompt_len])
+                if hasattr(self, "_reorder_cache"):
+                    model_kwargs["past_key_values"] = self._reorder_cache(model_kwargs["past_key_values"], beam_idx)


models like rag or reformer have their own special cache reorder logic, which I didn't remove. I don't think it is worth aligning these models with past_key_values.reorder_cache because they're pretty low usage

zucchini-nlp · 2025-07-16T12:00:06Z

Merging, hopefully not so many tests fails after 🤞🏻

* dump * push other models * fix simple greedy generation * xmod * add fmst and clean up some mentions of old cache format * gpt-bigcode now follows standards * delete tuple cache reference in generation * fix some models * fix some models * fix mambas and support cache in tapas * fix some more tests * fix copies * delete `_reorder_cache` * another fix copies * fix typos and delete unnecessary test * fix rag generate, needs special cache reordering * fix tapas and superglue * reformer create special cache * recurrent gemma `reorder_cache` was a no-op, delete * fix-copies * fix blio and musicgen pipeline tests * fix reformer * fix reformer, again... * delete `_supports_cache_class` * delete `supports_quantized_cache` * fix failing tests * fix copies * some minor clean up * style * style * fix copies * fix tests * fix copies * create causal mask now needs positions? * fixc copies * style * Update tests/test_modeling_common.py Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * clean-up of non-generative model after merging main * check `is_decoder` for cache * delete transpose for scores * remove tuple cache from docs everywhere * fix tests * fix copies * fix copies once more * properly deprecate `encoder_attention_mask` in Bert-like models * import `deprecate_kwarg` where needed * fix copies again * fix copies * delete `nex_decoder_cache` * fix copies asks to update for PLM * fix copies * rebasing had a few new models, fix them and merge asap! * fix copies once more * fix slow tests * fix tests and updare PLM checkpoint * add read token and revert accidentally removed line * oh com -on, style * just skip it, read token has no access to PLM yet --------- Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

Due to huggingface/transformers#38635, several tests involving prefix tuning broke: https://github.com/huggingface/peft/actions/runs/16417140904/job/46385751329 This PR fixes this by resoling two issues: 1. The _supports_cache_class attribute was removed, we can now assume that it is True if the attribute does not exist. 2. We had special handling of past_key_values for GPTBigCodeForCausalLM which is no longer required (nor valid) after that PR, so it is removed depending on the transformers version.

* dump * push other models * fix simple greedy generation * xmod * add fmst and clean up some mentions of old cache format * gpt-bigcode now follows standards * delete tuple cache reference in generation * fix some models * fix some models * fix mambas and support cache in tapas * fix some more tests * fix copies * delete `_reorder_cache` * another fix copies * fix typos and delete unnecessary test * fix rag generate, needs special cache reordering * fix tapas and superglue * reformer create special cache * recurrent gemma `reorder_cache` was a no-op, delete * fix-copies * fix blio and musicgen pipeline tests * fix reformer * fix reformer, again... * delete `_supports_cache_class` * delete `supports_quantized_cache` * fix failing tests * fix copies * some minor clean up * style * style * fix copies * fix tests * fix copies * create causal mask now needs positions? * fixc copies * style * Update tests/test_modeling_common.py Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * clean-up of non-generative model after merging main * check `is_decoder` for cache * delete transpose for scores * remove tuple cache from docs everywhere * fix tests * fix copies * fix copies once more * properly deprecate `encoder_attention_mask` in Bert-like models * import `deprecate_kwarg` where needed * fix copies again * fix copies * delete `nex_decoder_cache` * fix copies asks to update for PLM * fix copies * rebasing had a few new models, fix them and merge asap! * fix copies once more * fix slow tests * fix tests and updare PLM checkpoint * add read token and revert accidentally removed line * oh com -on, style * just skip it, read token has no access to PLM yet --------- Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

Support for gradient checkpointing was lost in the major refactoring in PR huggingface#38635 and this is the attempt to re-add it. I extended the tests to - test `use_reentrant=True` and `False` - make sure `model.train` is called so that gradient checkpointing works; this is a limiation of the tests currently used by GPTBigCode - make sure that one (the first) gradient checkpointing layer is called - make sure that the same non-zero grads are there for normal and checkpointing runs - this is something we tripped over before in PEFT due to the possibly incompletely stored runtime environment in the checkpointed forward step, see also peft#2826 Note that the invocation of `GPTBigCodeBlock.forward` has changed: - `layer_past` is now passed as a keyword argument so that `GradientCheckpointingLayer.__call__` can see and filter this parameter (`use_reentrant=False` fails otherwise) - `{encoder_}hidden_states` are still passed as positional arguments so that `torch.utils.checkpoint.checkpoint` receives them as pos. args and computes gradients for these (kwargs would be filtered by `GradientCheckpointingLayer`).

…stness (#41818) * Implement gradient checkpointing in GPTBigCode Support for gradient checkpointing was lost in the major refactoring in PR #38635 and this is the attempt to re-add it. I extended the tests to - test `use_reentrant=True` and `False` - make sure `model.train` is called so that gradient checkpointing works; this is a limiation of the tests currently used by GPTBigCode - make sure that one (the first) gradient checkpointing layer is called - make sure that the same non-zero grads are there for normal and checkpointing runs - this is something we tripped over before in PEFT due to the possibly incompletely stored runtime environment in the checkpointed forward step, see also peft#2826 Note that the invocation of `GPTBigCodeBlock.forward` has changed: - `layer_past` is now passed as a keyword argument so that `GradientCheckpointingLayer.__call__` can see and filter this parameter (`use_reentrant=False` fails otherwise) - `{encoder_}hidden_states` are still passed as positional arguments so that `torch.utils.checkpoint.checkpoint` receives them as pos. args and computes gradients for these (kwargs would be filtered by `GradientCheckpointingLayer`). * Improve gradient checkpointing tests - Compare that the non-zero gradients in a reference run are present in the checkpointing run - Make sure that the forward of at least one gradient checkpointing layer is actually called more than once (as expected during gradient checkpointing backward) Currently there are some problems with Bert-derived MultipleChoice models, when dropout is enabled there are scenarios during gradient checkpointing where `classifier.bias.grad` is None. I don't yet have a good explanation for this, disabling dropout resolves this. I would have understood, if it is dropout on the classification layer but enabling attention dropout is also leading to this behavior. MoE models have selective sparsity depending on the selected experts, for this reason we only compare gradients on parameters collected on the reference backward run. * Remove duplicated gradient checkpointing code * Address review comments * Make test output consistent * GradientCheckpointingLayer for xlstm, zamba, zamba2 * GradientCheckpointingLayer for swiftformer also drop janus from ignore list - only the VQVAE case is without gradient checkpointing and it is doubtful that it is usefule in that case. Training with gradient checkpointing is not tested anyway. * Make an exception for CLVP The implementation of GradientCheckpointingLayers is not trivial and may break behavior that was previously expected. Therefore we keep it as-is for now. * Remove unneeded exceptions --------- Co-authored-by: nemo <git@ningu.net> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>

…stness (huggingface#41818) * Implement gradient checkpointing in GPTBigCode Support for gradient checkpointing was lost in the major refactoring in PR huggingface#38635 and this is the attempt to re-add it. I extended the tests to - test `use_reentrant=True` and `False` - make sure `model.train` is called so that gradient checkpointing works; this is a limiation of the tests currently used by GPTBigCode - make sure that one (the first) gradient checkpointing layer is called - make sure that the same non-zero grads are there for normal and checkpointing runs - this is something we tripped over before in PEFT due to the possibly incompletely stored runtime environment in the checkpointed forward step, see also peft#2826 Note that the invocation of `GPTBigCodeBlock.forward` has changed: - `layer_past` is now passed as a keyword argument so that `GradientCheckpointingLayer.__call__` can see and filter this parameter (`use_reentrant=False` fails otherwise) - `{encoder_}hidden_states` are still passed as positional arguments so that `torch.utils.checkpoint.checkpoint` receives them as pos. args and computes gradients for these (kwargs would be filtered by `GradientCheckpointingLayer`). * Improve gradient checkpointing tests - Compare that the non-zero gradients in a reference run are present in the checkpointing run - Make sure that the forward of at least one gradient checkpointing layer is actually called more than once (as expected during gradient checkpointing backward) Currently there are some problems with Bert-derived MultipleChoice models, when dropout is enabled there are scenarios during gradient checkpointing where `classifier.bias.grad` is None. I don't yet have a good explanation for this, disabling dropout resolves this. I would have understood, if it is dropout on the classification layer but enabling attention dropout is also leading to this behavior. MoE models have selective sparsity depending on the selected experts, for this reason we only compare gradients on parameters collected on the reference backward run. * Remove duplicated gradient checkpointing code * Address review comments * Make test output consistent * GradientCheckpointingLayer for xlstm, zamba, zamba2 * GradientCheckpointingLayer for swiftformer also drop janus from ignore list - only the VQVAE case is without gradient checkpointing and it is doubtful that it is usefule in that case. Training with gradient checkpointing is not tested anyway. * Make an exception for CLVP The implementation of GradientCheckpointingLayers is not trivial and may break behavior that was previously expected. Therefore we keep it as-is for now. * Remove unneeded exceptions --------- Co-authored-by: nemo <git@ningu.net> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>

Due to huggingface/transformers#38635, several tests involving prefix tuning broke: https://github.com/huggingface/peft/actions/runs/16417140904/job/46385751329 This PR fixes this by resoling two issues: 1. The _supports_cache_class attribute was removed, we can now assume that it is True if the attribute does not exist. 2. We had special handling of past_key_values for GPTBigCodeForCausalLM which is no longer required (nor valid) after that PR, so it is removed depending on the transformers version.

zucchini-nlp added 4 commits June 4, 2025 17:30

dump

5d14a87

push other models

5c5825b

fix simple greedy generation

051fe7f

xmod

b04ddbc

zucchini-nlp added 19 commits June 6, 2025 10:57

add fmst and clean up some mentions of old cache format

6a289a7

gpt-bigcode now follows standards

b3be72b

delete tuple cache reference in generation

85061bc

fix some models

1424600

fix some models

02fb0d2

fix mambas and support cache in tapas

f7494bc

fix some more tests

576fb7b

fix copies

8757e84

delete _reorder_cache

bcf0cc7

another fix copies

91d92f1

fix typos and delete unnecessary test

edf5f6e

fix rag generate, needs special cache reordering

b236e90

fix tapas and superglue

1893f8a

reformer create special cache

46e50b5

recurrent gemma reorder_cache was a no-op, delete

204ed55

fix-copies

7b61dfd

fix blio and musicgen pipeline tests

69c20ae

Merge branch 'main' into cache-class-finalize

d281a6c

fix reformer

b508814

zucchini-nlp commented Jun 11, 2025

View reviewed changes

zucchini-nlp mentioned this pull request Jun 11, 2025

🚨 Don't use cache in non-generative models #38751

Merged

zucchini-nlp added 4 commits June 11, 2025 15:57

fix reformer, again...

b7deae6

delete _supports_cache_class

ae88ecc

delete supports_quantized_cache

f1ec0ba

fix failing tests

8f5d8a0

zucchini-nlp merged commit c8524ae into huggingface:main Jul 16, 2025
25 checks passed

BenjaminBossan mentioned this pull request Jul 22, 2025

FIX Prefix tuning after transformers PR 38635 huggingface/peft#2662

Merged

zucchini-nlp mentioned this pull request Jul 28, 2025

Fix cache-related tests #39676

Merged

This was referenced Jul 29, 2025

encoder decoder model compile failed after refactor cache #39746

Closed

Blip model got performance regression on compile mode after refactor cache. #39774

Closed

This was referenced Aug 7, 2025

Add support transformers v4.54 v4.55 huggingface/optimum-intel#1406

Merged

Fix gpt_bigcode input generator for transformers 4.54 huggingface/optimum#2336

Merged

ydshieh mentioned this pull request Sep 20, 2025

[testing] Fix qwen2_audio #41018

Merged

githubnemo mentioned this pull request Oct 23, 2025

🚨 Fix gradient checkpointing for several models and improve test robustness #41818

Merged

adityachoksi2512 mentioned this pull request May 1, 2026

MusicgenMelody ignores audio conditioning (regression between 4.48 and 4.57) #45647

Open

voodoovampire mentioned this pull request May 1, 2026

Add regression test for MusicgenMelody audio conditioning (GH #45647) #45737

Open

adityachoksi2512 mentioned this pull request May 1, 2026

fix(musicgen_melody): use DynamicCache instead of EncoderDecoderCache #45738

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cache] make all classes cache compatible finally#38635

[cache] make all classes cache compatible finally#38635
zucchini-nlp merged 65 commits intohuggingface:mainfrom
zucchini-nlp:cache-class-finalize

zucchini-nlp commented Jun 6, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 6, 2025

Uh oh!

zucchini-nlp Jun 11, 2025

Uh oh!

zucchini-nlp commented Jul 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zucchini-nlp commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jun 6, 2025

Uh oh!

zucchini-nlp Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zucchini-nlp commented Jun 6, 2025 •

edited

Loading

zucchini-nlp commented Jul 16, 2025 •

edited

Loading