Apply GradientCheckpointingLayer to the whole repo by qubvel · Pull Request #38913 · huggingface/transformers

qubvel · 2025-06-19T12:51:32Z

What does this PR do?

Apply GradientCheckpointingLayer to the remaining models in the repository.

Most of the PR is pretty much similar changes for all models:

Add import for GradientCheckpointingLayer
Inherit *Layer modules from GradientCheckpointingLayer
Remove if/else path for gradient checkpointing, keeping the else path only
3a) Some changes were required to make sure all tensors with gradients were passed as positional arguments.

Additionally, GradientCheckpointingLayer was modified slightly. I added handling for use_cache and past_key_values within the layer to disable them in case gradient checkpointing is enabled.

We still have to keep some redundant code, though:

Case 1.

        if self.gradient_checkpointing and self.training and use_cache:
            logger.warning_once(
                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
            )
            use_cache = False

because, later, most of the models rely on the use_cache parameter as follows

            if use_cache:
                next_decoder_cache = layer_outputs[2 if output_attentions else 1]

In case it is handled only by GradientCheckpointingLayer and not modified in the outer module, it leads to an IndexError.

Case 2.

In some cases layer parameters order doesn't allow to handle past_key_values as kwargs, e.g. GPT2

            outputs = block(
                hidden_states,
                past_key_values if not (self.gradient_checkpointing and self.training) else None,
                cache_position,
                causal_mask,
                head_mask[i],
                encoder_hidden_states,  # as a positional argument for gradient checkpointing
                encoder_attention_mask=encoder_attention_mask,
                use_cache=use_cache,
                output_attentions=output_attentions,
                **kwargs,
            )

We have to pass all params up to encoder_hidden_states as positional args (tensors that require grads have to be passed that way), so past_key_values is also passed a positional argument and resolved manually.

Alternatively, we can refactor layers' params order, but that would be a breaking change. Not that many models and mostly the old ones.

Not supported models

Also, there are a couple of exceptions where GradientCheckpointingLayer does not work. I tried to fix it, but I didn't go too far and just kept it in its original state

zamba / zamba2
mllama

cc @ArthurZucker @Cyrilvallez

HuggingFaceDocBuilderDev · 2025-06-19T13:05:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qubvel · 2025-06-23T10:35:00Z

+            do_warn = False
+            layer_name = self.__class__.__name__
+            message = f"Caching is incompatible with gradient checkpointing in {layer_name}. Setting"
+
+            if "use_cache" in kwargs and kwargs["use_cache"]:
+                kwargs["use_cache"] = False
+                message += " `use_cache=False`,"
+                do_warn = True
+
+            # different names for the same thing in different layers
+            if "past_key_value" in kwargs and kwargs["past_key_value"] is not None:
+                kwargs["past_key_value"] = None
+                message += " `past_key_value=None`,"
+                do_warn = True
+
+            if "past_key_values" in kwargs and kwargs["past_key_values"] is not None:
+                kwargs["past_key_values"] = None
+                message += " `past_key_values=None`,"
+                do_warn = True
+
+            if "layer_past" in kwargs and kwargs["layer_past"] is not None:
+                kwargs["layer_past"] = None
+                message += " `layer_past=None`,"
+                do_warn = True
+
+            # warn if anything was changed
+            if do_warn:
+                message = message.rstrip(",") + "."
+                logger.warning(message)
+


update for GradientCheckpointingLayer

Cyrilvallez

Big big PR, and super welcome! 🚀🤗 Can we add a common test for gradient checkpointing though? I see we don't have one yet (only in trainer) - just instantiate a small model and run a single forward with gradient checkpointing and ensure that it runs correctly would be super nice

Cyrilvallez

Alright, did not see the test_training_gradient_checkpointing... before, my bad! All good then! Let's merge! 🤗

ArthurZucker

SUper nice! Thanks

- delete_gate_mask is a tensor with `requires_grad=True`, so it must be passed as a positional arg to work with gradient checkpointing, according to this PR - huggingface/transformers#38913 - Without this change, running with `batch_size_multiplier=8,` or `gradient_checkpointing=True` would cause the following error: ``` RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward. ```

qubvel added 6 commits June 19, 2025 12:49

first batch (4)

a97ca9f

align

2627646

altclip

c8926f7

beit

ae1b29a

bert

4dff076

yolos

0d0d8c7

qubvel added 23 commits June 19, 2025 13:08

dino, pvt_v2

8b66428

bark, bart, bert_generation

0d387eb

big_bird, biogpt

6faee3f

blnderbot, bloom

3f34606

bridgetower

3bb70d9

camambert, canine, chameleon

5757f3e

chinese clip, clap, clip

c59a7d5

codegen, conditional detr, convbert

d7cb795

dab_detr, data2vec

39784f7

dbrx, deberta

203348d

deberta, decicion_tranformer, deformable_detr

b2719f3

deit, deta, mctct

2ed2c5b

detr, dinov2, distilbert

87704a7

donut, dpt, electra

cd69033

ernie, esm, falcon

9a54ad1

flava, fnet, falcon_mamba

6855515

focalnet, git, gpt2

f4f8319

gpt - bigcode, neo, neox

b8f4ecf

gptj, groupvit

d844b12

idefics2, idefics3

700d20d

ijepa, imagegpt, internvl

0b3ffba

jetmoe, kosmos2, layoutlm

9ed27ef

layoutlm2-3, led

6d3ecbc

qubvel added 20 commits June 20, 2025 18:07

modular mlcd

8493bad

modular modernbert

adf5c60

modular phi

6270ff7

modular qwen2_5_omni

3ad1fa9

modular qwen2_5_vl

7626b31

modular sam_hq

e3c61ce

modular sew

01934cf

wav2vec2_bert

62683dc

modular wav2vec2_conformer

28bd09c

modular wavlm

5bc6525

fixup

31dbec4

Update by modular instructblipvideo

b989ba6

modular data2vec_audio

cdb4c70

nit modular mistral

d50dd86

apply modular minimax

4c5aa0b

fix modular moonshine

6585288

revert zamba2

4ac7c96

fix mask2former

58847e7

Merge branch 'main' into gradient-checkpointing-layer-propagation

2e4e2b1

refactor idefics

9b8e965

qubvel marked this pull request as ready for review June 23, 2025 10:26

qubvel commented Jun 23, 2025

View reviewed changes

qubvel requested a review from Cyrilvallez June 23, 2025 10:37

Cyrilvallez reviewed Jun 23, 2025

View reviewed changes

Cyrilvallez approved these changes Jun 23, 2025

View reviewed changes

Merge branch 'main' into gradient-checkpointing-layer-propagation

8a8898c

Cyrilvallez merged commit 84d19be into huggingface:main Jun 23, 2025
18 of 20 checks passed

ArthurZucker reviewed Jun 23, 2025

View reviewed changes

molbap mentioned this pull request Dec 4, 2025

Make gradient-checkpoint enabling tolerant of models without get_input_embeddings #42558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply GradientCheckpointingLayer to the whole repo#38913

Apply GradientCheckpointingLayer to the whole repo#38913
Cyrilvallez merged 148 commits intohuggingface:mainfrom
qubvel:gradient-checkpointing-layer-propagation

qubvel commented Jun 19, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 19, 2025

Uh oh!

qubvel Jun 23, 2025

Uh oh!

Cyrilvallez left a comment

Uh oh!

Cyrilvallez left a comment

Uh oh!

Uh oh!

ArthurZucker left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

qubvel commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Case 1.

Case 2.

Not supported models

Uh oh!

HuggingFaceDocBuilderDev commented Jun 19, 2025

Uh oh!

qubvel Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qubvel commented Jun 19, 2025 •

edited

Loading