Skip to content

[cache] make all classes cache compatible finally#38635

Merged
zucchini-nlp merged 65 commits intohuggingface:mainfrom
zucchini-nlp:cache-class-finalize
Jul 16, 2025
Merged

[cache] make all classes cache compatible finally#38635
zucchini-nlp merged 65 commits intohuggingface:mainfrom
zucchini-nlp:cache-class-finalize

Conversation

@zucchini-nlp
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp commented Jun 6, 2025

What does this PR do?

As per title, and let's get rid of _supports_cache/_supports_quantized_cache flags. We will assume all models from now on support cache and initialize a DynamicCache (model-specific cache in case of mamba) by default

For the static cache, we can't yet assume all models support it because even if the model technically can use StaticCache, it can't always compile fullgraph. We have an auto-compilation for static cache enabled, so maybe the compilation should check for smth like _can_compile_fullgraph and not _supports_static_cache?

I checked all models are updated and the generation tests are passing. Note, the PR depends on #38751 which cleans up non-generative models from past_key_values

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

)
beam_idx = self._flatten_beam_dim(running_beam_indices[..., cur_len - decoder_prompt_len])
if hasattr(self, "_reorder_cache"):
model_kwargs["past_key_values"] = self._reorder_cache(model_kwargs["past_key_values"], beam_idx)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

models like rag or reformer have their own special cache reorder logic, which I didn't remove. I don't think it is worth aligning these models with past_key_values.reorder_cache because they're pretty low usage

@zucchini-nlp
Copy link
Copy Markdown
Member Author

zucchini-nlp commented Jul 16, 2025

Merging, hopefully not so many tests fails after 🤞🏻

@zucchini-nlp zucchini-nlp merged commit c8524ae into huggingface:main Jul 16, 2025
25 checks passed
qubvel pushed a commit to qubvel/transformers that referenced this pull request Jul 16, 2025
* dump

* push other models

* fix simple greedy generation

* xmod

* add fmst and clean up some mentions of old cache format

* gpt-bigcode now follows standards

* delete tuple cache reference in generation

* fix some models

* fix some models

* fix mambas and support cache in tapas

* fix some more tests

* fix copies

* delete `_reorder_cache`

* another fix copies

* fix typos and delete unnecessary test

* fix rag generate, needs special cache reordering

* fix tapas and superglue

* reformer create special cache

* recurrent gemma `reorder_cache` was a no-op, delete

* fix-copies

* fix blio and musicgen pipeline tests

* fix reformer

* fix reformer, again...

* delete `_supports_cache_class`

* delete `supports_quantized_cache`

* fix failing tests

* fix copies

* some minor clean up

* style

* style

* fix copies

* fix tests

* fix copies

* create causal mask now needs positions?

* fixc copies

* style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* clean-up of non-generative model after merging main

* check `is_decoder` for cache

* delete transpose for scores

* remove tuple cache from docs everywhere

* fix tests

* fix copies

* fix copies once more

* properly deprecate `encoder_attention_mask` in Bert-like models

* import `deprecate_kwarg` where needed

* fix copies again

* fix copies

* delete `nex_decoder_cache`

* fix copies asks to update for PLM

* fix copies

* rebasing had a few new models, fix them and merge asap!

* fix copies once more

* fix slow tests

* fix tests and updare PLM checkpoint

* add read token and revert accidentally removed line

* oh com -on, style

* just skip it, read token has no access to PLM yet

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
rjgleaton pushed a commit to rjgleaton/transformers that referenced this pull request Jul 17, 2025
* dump

* push other models

* fix simple greedy generation

* xmod

* add fmst and clean up some mentions of old cache format

* gpt-bigcode now follows standards

* delete tuple cache reference in generation

* fix some models

* fix some models

* fix mambas and support cache in tapas

* fix some more tests

* fix copies

* delete `_reorder_cache`

* another fix copies

* fix typos and delete unnecessary test

* fix rag generate, needs special cache reordering

* fix tapas and superglue

* reformer create special cache

* recurrent gemma `reorder_cache` was a no-op, delete

* fix-copies

* fix blio and musicgen pipeline tests

* fix reformer

* fix reformer, again...

* delete `_supports_cache_class`

* delete `supports_quantized_cache`

* fix failing tests

* fix copies

* some minor clean up

* style

* style

* fix copies

* fix tests

* fix copies

* create causal mask now needs positions?

* fixc copies

* style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* clean-up of non-generative model after merging main

* check `is_decoder` for cache

* delete transpose for scores

* remove tuple cache from docs everywhere

* fix tests

* fix copies

* fix copies once more

* properly deprecate `encoder_attention_mask` in Bert-like models

* import `deprecate_kwarg` where needed

* fix copies again

* fix copies

* delete `nex_decoder_cache`

* fix copies asks to update for PLM

* fix copies

* rebasing had a few new models, fix them and merge asap!

* fix copies once more

* fix slow tests

* fix tests and updare PLM checkpoint

* add read token and revert accidentally removed line

* oh com -on, style

* just skip it, read token has no access to PLM yet

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
zucchini-nlp added a commit to zucchini-nlp/transformers that referenced this pull request Jul 22, 2025
* dump

* push other models

* fix simple greedy generation

* xmod

* add fmst and clean up some mentions of old cache format

* gpt-bigcode now follows standards

* delete tuple cache reference in generation

* fix some models

* fix some models

* fix mambas and support cache in tapas

* fix some more tests

* fix copies

* delete `_reorder_cache`

* another fix copies

* fix typos and delete unnecessary test

* fix rag generate, needs special cache reordering

* fix tapas and superglue

* reformer create special cache

* recurrent gemma `reorder_cache` was a no-op, delete

* fix-copies

* fix blio and musicgen pipeline tests

* fix reformer

* fix reformer, again...

* delete `_supports_cache_class`

* delete `supports_quantized_cache`

* fix failing tests

* fix copies

* some minor clean up

* style

* style

* fix copies

* fix tests

* fix copies

* create causal mask now needs positions?

* fixc copies

* style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* clean-up of non-generative model after merging main

* check `is_decoder` for cache

* delete transpose for scores

* remove tuple cache from docs everywhere

* fix tests

* fix copies

* fix copies once more

* properly deprecate `encoder_attention_mask` in Bert-like models

* import `deprecate_kwarg` where needed

* fix copies again

* fix copies

* delete `nex_decoder_cache`

* fix copies asks to update for PLM

* fix copies

* rebasing had a few new models, fix them and merge asap!

* fix copies once more

* fix slow tests

* fix tests and updare PLM checkpoint

* add read token and revert accidentally removed line

* oh com -on, style

* just skip it, read token has no access to PLM yet

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Jul 22, 2025
Due to huggingface/transformers#38635, several
tests involving prefix tuning broke:

https://github.com/huggingface/peft/actions/runs/16417140904/job/46385751329

This PR fixes this by resoling two issues:

1. The _supports_cache_class attribute was removed, we can now assume
that it is True if the attribute does not exist.

2. We had special handling of past_key_values for GPTBigCodeForCausalLM
which is no longer required (nor valid) after that PR, so it is removed
depending on the transformers version.
BenjaminBossan added a commit to huggingface/peft that referenced this pull request Jul 22, 2025
Due to huggingface/transformers#38635, several
tests involving prefix tuning broke:

https://github.com/huggingface/peft/actions/runs/16417140904/job/46385751329

This PR fixes this by resoling two issues:

1. The _supports_cache_class attribute was removed, we can now assume
that it is True if the attribute does not exist.

2. We had special handling of past_key_values for GPTBigCodeForCausalLM
which is no longer required (nor valid) after that PR, so it is removed
depending on the transformers version.
BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Jul 28, 2025
Due to huggingface/transformers#38635, several
tests involving prefix tuning broke:

https://github.com/huggingface/peft/actions/runs/16417140904/job/46385751329

This PR fixes this by resoling two issues:

1. The _supports_cache_class attribute was removed, we can now assume
that it is True if the attribute does not exist.

2. We had special handling of past_key_values for GPTBigCodeForCausalLM
which is no longer required (nor valid) after that PR, so it is removed
depending on the transformers version.
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* dump

* push other models

* fix simple greedy generation

* xmod

* add fmst and clean up some mentions of old cache format

* gpt-bigcode now follows standards

* delete tuple cache reference in generation

* fix some models

* fix some models

* fix mambas and support cache in tapas

* fix some more tests

* fix copies

* delete `_reorder_cache`

* another fix copies

* fix typos and delete unnecessary test

* fix rag generate, needs special cache reordering

* fix tapas and superglue

* reformer create special cache

* recurrent gemma `reorder_cache` was a no-op, delete

* fix-copies

* fix blio and musicgen pipeline tests

* fix reformer

* fix reformer, again...

* delete `_supports_cache_class`

* delete `supports_quantized_cache`

* fix failing tests

* fix copies

* some minor clean up

* style

* style

* fix copies

* fix tests

* fix copies

* create causal mask now needs positions?

* fixc copies

* style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* clean-up of non-generative model after merging main

* check `is_decoder` for cache

* delete transpose for scores

* remove tuple cache from docs everywhere

* fix tests

* fix copies

* fix copies once more

* properly deprecate `encoder_attention_mask` in Bert-like models

* import `deprecate_kwarg` where needed

* fix copies again

* fix copies

* delete `nex_decoder_cache`

* fix copies asks to update for PLM

* fix copies

* rebasing had a few new models, fix them and merge asap!

* fix copies once more

* fix slow tests

* fix tests and updare PLM checkpoint

* add read token and revert accidentally removed line

* oh com -on, style

* just skip it, read token has no access to PLM yet

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* dump

* push other models

* fix simple greedy generation

* xmod

* add fmst and clean up some mentions of old cache format

* gpt-bigcode now follows standards

* delete tuple cache reference in generation

* fix some models

* fix some models

* fix mambas and support cache in tapas

* fix some more tests

* fix copies

* delete `_reorder_cache`

* another fix copies

* fix typos and delete unnecessary test

* fix rag generate, needs special cache reordering

* fix tapas and superglue

* reformer create special cache

* recurrent gemma `reorder_cache` was a no-op, delete

* fix-copies

* fix blio and musicgen pipeline tests

* fix reformer

* fix reformer, again...

* delete `_supports_cache_class`

* delete `supports_quantized_cache`

* fix failing tests

* fix copies

* some minor clean up

* style

* style

* fix copies

* fix tests

* fix copies

* create causal mask now needs positions?

* fixc copies

* style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* clean-up of non-generative model after merging main

* check `is_decoder` for cache

* delete transpose for scores

* remove tuple cache from docs everywhere

* fix tests

* fix copies

* fix copies once more

* properly deprecate `encoder_attention_mask` in Bert-like models

* import `deprecate_kwarg` where needed

* fix copies again

* fix copies

* delete `nex_decoder_cache`

* fix copies asks to update for PLM

* fix copies

* rebasing had a few new models, fix them and merge asap!

* fix copies once more

* fix slow tests

* fix tests and updare PLM checkpoint

* add read token and revert accidentally removed line

* oh com -on, style

* just skip it, read token has no access to PLM yet

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* dump

* push other models

* fix simple greedy generation

* xmod

* add fmst and clean up some mentions of old cache format

* gpt-bigcode now follows standards

* delete tuple cache reference in generation

* fix some models

* fix some models

* fix mambas and support cache in tapas

* fix some more tests

* fix copies

* delete `_reorder_cache`

* another fix copies

* fix typos and delete unnecessary test

* fix rag generate, needs special cache reordering

* fix tapas and superglue

* reformer create special cache

* recurrent gemma `reorder_cache` was a no-op, delete

* fix-copies

* fix blio and musicgen pipeline tests

* fix reformer

* fix reformer, again...

* delete `_supports_cache_class`

* delete `supports_quantized_cache`

* fix failing tests

* fix copies

* some minor clean up

* style

* style

* fix copies

* fix tests

* fix copies

* create causal mask now needs positions?

* fixc copies

* style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* clean-up of non-generative model after merging main

* check `is_decoder` for cache

* delete transpose for scores

* remove tuple cache from docs everywhere

* fix tests

* fix copies

* fix copies once more

* properly deprecate `encoder_attention_mask` in Bert-like models

* import `deprecate_kwarg` where needed

* fix copies again

* fix copies

* delete `nex_decoder_cache`

* fix copies asks to update for PLM

* fix copies

* rebasing had a few new models, fix them and merge asap!

* fix copies once more

* fix slow tests

* fix tests and updare PLM checkpoint

* add read token and revert accidentally removed line

* oh com -on, style

* just skip it, read token has no access to PLM yet

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* dump

* push other models

* fix simple greedy generation

* xmod

* add fmst and clean up some mentions of old cache format

* gpt-bigcode now follows standards

* delete tuple cache reference in generation

* fix some models

* fix some models

* fix mambas and support cache in tapas

* fix some more tests

* fix copies

* delete `_reorder_cache`

* another fix copies

* fix typos and delete unnecessary test

* fix rag generate, needs special cache reordering

* fix tapas and superglue

* reformer create special cache

* recurrent gemma `reorder_cache` was a no-op, delete

* fix-copies

* fix blio and musicgen pipeline tests

* fix reformer

* fix reformer, again...

* delete `_supports_cache_class`

* delete `supports_quantized_cache`

* fix failing tests

* fix copies

* some minor clean up

* style

* style

* fix copies

* fix tests

* fix copies

* create causal mask now needs positions?

* fixc copies

* style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* clean-up of non-generative model after merging main

* check `is_decoder` for cache

* delete transpose for scores

* remove tuple cache from docs everywhere

* fix tests

* fix copies

* fix copies once more

* properly deprecate `encoder_attention_mask` in Bert-like models

* import `deprecate_kwarg` where needed

* fix copies again

* fix copies

* delete `nex_decoder_cache`

* fix copies asks to update for PLM

* fix copies

* rebasing had a few new models, fix them and merge asap!

* fix copies once more

* fix slow tests

* fix tests and updare PLM checkpoint

* add read token and revert accidentally removed line

* oh com -on, style

* just skip it, read token has no access to PLM yet

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* dump

* push other models

* fix simple greedy generation

* xmod

* add fmst and clean up some mentions of old cache format

* gpt-bigcode now follows standards

* delete tuple cache reference in generation

* fix some models

* fix some models

* fix mambas and support cache in tapas

* fix some more tests

* fix copies

* delete `_reorder_cache`

* another fix copies

* fix typos and delete unnecessary test

* fix rag generate, needs special cache reordering

* fix tapas and superglue

* reformer create special cache

* recurrent gemma `reorder_cache` was a no-op, delete

* fix-copies

* fix blio and musicgen pipeline tests

* fix reformer

* fix reformer, again...

* delete `_supports_cache_class`

* delete `supports_quantized_cache`

* fix failing tests

* fix copies

* some minor clean up

* style

* style

* fix copies

* fix tests

* fix copies

* create causal mask now needs positions?

* fixc copies

* style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* clean-up of non-generative model after merging main

* check `is_decoder` for cache

* delete transpose for scores

* remove tuple cache from docs everywhere

* fix tests

* fix copies

* fix copies once more

* properly deprecate `encoder_attention_mask` in Bert-like models

* import `deprecate_kwarg` where needed

* fix copies again

* fix copies

* delete `nex_decoder_cache`

* fix copies asks to update for PLM

* fix copies

* rebasing had a few new models, fix them and merge asap!

* fix copies once more

* fix slow tests

* fix tests and updare PLM checkpoint

* add read token and revert accidentally removed line

* oh com -on, style

* just skip it, read token has no access to PLM yet

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* dump

* push other models

* fix simple greedy generation

* xmod

* add fmst and clean up some mentions of old cache format

* gpt-bigcode now follows standards

* delete tuple cache reference in generation

* fix some models

* fix some models

* fix mambas and support cache in tapas

* fix some more tests

* fix copies

* delete `_reorder_cache`

* another fix copies

* fix typos and delete unnecessary test

* fix rag generate, needs special cache reordering

* fix tapas and superglue

* reformer create special cache

* recurrent gemma `reorder_cache` was a no-op, delete

* fix-copies

* fix blio and musicgen pipeline tests

* fix reformer

* fix reformer, again...

* delete `_supports_cache_class`

* delete `supports_quantized_cache`

* fix failing tests

* fix copies

* some minor clean up

* style

* style

* fix copies

* fix tests

* fix copies

* create causal mask now needs positions?

* fixc copies

* style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* clean-up of non-generative model after merging main

* check `is_decoder` for cache

* delete transpose for scores

* remove tuple cache from docs everywhere

* fix tests

* fix copies

* fix copies once more

* properly deprecate `encoder_attention_mask` in Bert-like models

* import `deprecate_kwarg` where needed

* fix copies again

* fix copies

* delete `nex_decoder_cache`

* fix copies asks to update for PLM

* fix copies

* rebasing had a few new models, fix them and merge asap!

* fix copies once more

* fix slow tests

* fix tests and updare PLM checkpoint

* add read token and revert accidentally removed line

* oh com -on, style

* just skip it, read token has no access to PLM yet

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* dump

* push other models

* fix simple greedy generation

* xmod

* add fmst and clean up some mentions of old cache format

* gpt-bigcode now follows standards

* delete tuple cache reference in generation

* fix some models

* fix some models

* fix mambas and support cache in tapas

* fix some more tests

* fix copies

* delete `_reorder_cache`

* another fix copies

* fix typos and delete unnecessary test

* fix rag generate, needs special cache reordering

* fix tapas and superglue

* reformer create special cache

* recurrent gemma `reorder_cache` was a no-op, delete

* fix-copies

* fix blio and musicgen pipeline tests

* fix reformer

* fix reformer, again...

* delete `_supports_cache_class`

* delete `supports_quantized_cache`

* fix failing tests

* fix copies

* some minor clean up

* style

* style

* fix copies

* fix tests

* fix copies

* create causal mask now needs positions?

* fixc copies

* style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* clean-up of non-generative model after merging main

* check `is_decoder` for cache

* delete transpose for scores

* remove tuple cache from docs everywhere

* fix tests

* fix copies

* fix copies once more

* properly deprecate `encoder_attention_mask` in Bert-like models

* import `deprecate_kwarg` where needed

* fix copies again

* fix copies

* delete `nex_decoder_cache`

* fix copies asks to update for PLM

* fix copies

* rebasing had a few new models, fix them and merge asap!

* fix copies once more

* fix slow tests

* fix tests and updare PLM checkpoint

* add read token and revert accidentally removed line

* oh com -on, style

* just skip it, read token has no access to PLM yet

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
githubnemo pushed a commit to githubnemo/transformers that referenced this pull request Oct 23, 2025
Support for gradient checkpointing was lost in the major refactoring in PR huggingface#38635
and this is the attempt to re-add it.

I extended the tests to
- test `use_reentrant=True` and `False`
- make sure `model.train` is called so that gradient checkpointing works;
  this is a limiation of the tests currently used by GPTBigCode
- make sure that one (the first) gradient checkpointing layer is called
- make sure that the same non-zero grads are there for normal and checkpointing
  runs - this is something we tripped over before in PEFT due to the possibly
  incompletely stored runtime environment in the checkpointed forward step,
  see also peft#2826

Note that the invocation of `GPTBigCodeBlock.forward` has changed:

- `layer_past` is now passed as a keyword argument so that
  `GradientCheckpointingLayer.__call__` can see and filter this parameter
  (`use_reentrant=False` fails otherwise)
- `{encoder_}hidden_states` are still passed as positional arguments
  so that `torch.utils.checkpoint.checkpoint` receives them as pos. args
  and computes gradients for these (kwargs would be filtered by
  `GradientCheckpointingLayer`).
vasqu pushed a commit that referenced this pull request Nov 11, 2025
…stness (#41818)

* Implement gradient checkpointing in GPTBigCode

Support for gradient checkpointing was lost in the major refactoring in PR #38635
and this is the attempt to re-add it.

I extended the tests to
- test `use_reentrant=True` and `False`
- make sure `model.train` is called so that gradient checkpointing works;
  this is a limiation of the tests currently used by GPTBigCode
- make sure that one (the first) gradient checkpointing layer is called
- make sure that the same non-zero grads are there for normal and checkpointing
  runs - this is something we tripped over before in PEFT due to the possibly
  incompletely stored runtime environment in the checkpointed forward step,
  see also peft#2826

Note that the invocation of `GPTBigCodeBlock.forward` has changed:

- `layer_past` is now passed as a keyword argument so that
  `GradientCheckpointingLayer.__call__` can see and filter this parameter
  (`use_reentrant=False` fails otherwise)
- `{encoder_}hidden_states` are still passed as positional arguments
  so that `torch.utils.checkpoint.checkpoint` receives them as pos. args
  and computes gradients for these (kwargs would be filtered by
  `GradientCheckpointingLayer`).

* Improve gradient checkpointing tests

- Compare that the non-zero gradients in a reference run are present in the checkpointing run
- Make sure that the forward of at least one gradient checkpointing layer is actually called
  more than once (as expected during gradient checkpointing backward)

Currently there are some problems with Bert-derived MultipleChoice models, when dropout is
enabled there are scenarios during gradient checkpointing where `classifier.bias.grad` is None.
I don't yet have a good explanation for this, disabling dropout resolves this. I would have
understood, if it is dropout on the classification layer but enabling attention dropout is
also leading to this behavior.

MoE models have selective sparsity depending on the selected experts, for this reason we
only compare gradients on parameters collected on the reference backward run.

* Remove duplicated gradient checkpointing code

* Address review comments

* Make test output consistent

* GradientCheckpointingLayer for xlstm, zamba, zamba2

* GradientCheckpointingLayer for swiftformer

also drop janus from ignore list - only the VQVAE case is without
gradient checkpointing and it is doubtful that it is usefule in that
case. Training with gradient checkpointing is not tested anyway.

* Make an exception for CLVP

The implementation of GradientCheckpointingLayers is not trivial and may break behavior
that was previously expected. Therefore we keep it as-is for now.

* Remove unneeded exceptions

---------

Co-authored-by: nemo <git@ningu.net>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
…stness (huggingface#41818)

* Implement gradient checkpointing in GPTBigCode

Support for gradient checkpointing was lost in the major refactoring in PR huggingface#38635
and this is the attempt to re-add it.

I extended the tests to
- test `use_reentrant=True` and `False`
- make sure `model.train` is called so that gradient checkpointing works;
  this is a limiation of the tests currently used by GPTBigCode
- make sure that one (the first) gradient checkpointing layer is called
- make sure that the same non-zero grads are there for normal and checkpointing
  runs - this is something we tripped over before in PEFT due to the possibly
  incompletely stored runtime environment in the checkpointed forward step,
  see also peft#2826

Note that the invocation of `GPTBigCodeBlock.forward` has changed:

- `layer_past` is now passed as a keyword argument so that
  `GradientCheckpointingLayer.__call__` can see and filter this parameter
  (`use_reentrant=False` fails otherwise)
- `{encoder_}hidden_states` are still passed as positional arguments
  so that `torch.utils.checkpoint.checkpoint` receives them as pos. args
  and computes gradients for these (kwargs would be filtered by
  `GradientCheckpointingLayer`).

* Improve gradient checkpointing tests

- Compare that the non-zero gradients in a reference run are present in the checkpointing run
- Make sure that the forward of at least one gradient checkpointing layer is actually called
  more than once (as expected during gradient checkpointing backward)

Currently there are some problems with Bert-derived MultipleChoice models, when dropout is
enabled there are scenarios during gradient checkpointing where `classifier.bias.grad` is None.
I don't yet have a good explanation for this, disabling dropout resolves this. I would have
understood, if it is dropout on the classification layer but enabling attention dropout is
also leading to this behavior.

MoE models have selective sparsity depending on the selected experts, for this reason we
only compare gradients on parameters collected on the reference backward run.

* Remove duplicated gradient checkpointing code

* Address review comments

* Make test output consistent

* GradientCheckpointingLayer for xlstm, zamba, zamba2

* GradientCheckpointingLayer for swiftformer

also drop janus from ignore list - only the VQVAE case is without
gradient checkpointing and it is doubtful that it is usefule in that
case. Training with gradient checkpointing is not tested anyway.

* Make an exception for CLVP

The implementation of GradientCheckpointingLayers is not trivial and may break behavior
that was previously expected. Therefore we keep it as-is for now.

* Remove unneeded exceptions

---------

Co-authored-by: nemo <git@ningu.net>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Guy-Bilitski pushed a commit to Guy-Bilitski/UIOrthoLoRA that referenced this pull request Feb 5, 2026
Due to huggingface/transformers#38635, several
tests involving prefix tuning broke:

https://github.com/huggingface/peft/actions/runs/16417140904/job/46385751329

This PR fixes this by resoling two issues:

1. The _supports_cache_class attribute was removed, we can now assume
that it is True if the attribute does not exist.

2. We had special handling of past_key_values for GPTBigCodeForCausalLM
which is no longer required (nor valid) after that PR, so it is removed
depending on the transformers version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants