[core] 🚨 Completely remove cache positions by Cyrilvallez · Pull Request #44181 · huggingface/transformers

Cyrilvallez · 2026-02-20T15:24:39Z

What does this PR do?

As per the title! Follow up of #44130 and #44226.
Finally remove the cache_position everywhere (not ALL models, but all the recent/most important models that use modular). This means we still create them and pass them around in generate, but they are ignored by the models. I will gradually remove them from all models, and remove them from generate once this is done.

This PR basically tweaks the logic of cache_utils.py and masking_utils.py to not rely on the cache_position, and then remove the argument from the models' forwards (not a regression as they are absorbed by the **kwargs for all those models).

Motivation

cache_position are basically not needed, as our Cache classes already and masking primitives contains all the necessary informations. We can simply recreate them quickly in the StaticCache when needed. Removing them will make all modeling files much easier to read and understand. Indeed, people are often confused by cache_position that seem to only be a 1D version of position_ids (thus it seems fully redundant) when you read modeling files, even though it's not exactly the case and cache_position don't take padding into account. Also, will allow to make generate much easier for input preparation once they will be removed from all models.

Compile compatibilty

I made EXTRA SURE that we do not have any regresions in terms of compile-compatibility: we have exactly the same scope of compatibility as before, i.e. full compatibility (fullgraph=True, dynamic=False, cudagraphs) for StaticLayer without any recompiles, and (fullgraph=True, cudagraphs) for StaticSlidingWindowLayer (it can work with fullgraph=False, but will recompile every iteration - otherwise it recompiles only once to make the internal int a dynamic Symint, and once again if it changes cache-regime (i.e. cache becomes full etc). There is no way around that as StaticSlidingWindowLayer has data-dependent control flows that are unavoidable. It's exactly the same currently on main.

Breaking changes

This PR is not really breaking. The 🚨 marker is only for the following detail:

The only breaking change in this PR is in the masking_utils API, as cache_position become optional everywhere in the create_xxx_mask functions (they are not used anymore), and thus past_key_values need to be passed as kwarg now due to the position of the args in the signature, as we cannot have an arg without default value following an arg with default value (this was already the case everywhere in Transformers).

The internal functions sdpa_mask, eager_mask etc as well move from having cache_position in their signature to having q_length and q_offset instead. Those should be private anyway, but you never know.

Review pointers

The easiest way to review this PR is to only look at the very few files that are not modeling files, especially cache_utils.py and masking_utils.py. Then, all modeling files are basically the same: just remove the arg cache_position everywhere (they are absorbed in the **kwargs so no regression, they can still be externally passed, and they are still technically passed by generate, just not used)

HuggingFaceDocBuilderDev · 2026-02-20T15:34:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu

Since I'm on the train, it's a bit hard to review but iiuc we gradually go to remove cache positions and focused on a subset of important models for now

Can you run slow for bert/bart: just me being a bit too anxious about these

vasqu · 2026-02-27T12:34:03Z

+        # Since StaticSlidingWindow have dynamic control flow that cannot be avoided, we have to replace them here by
+        # simple StaticLayer... It means that any generation beyond the window is unfortunately unsupported
+        for i, layer in enumerate(self.static_cache.layers):
+            if isinstance(layer, StaticSlidingWindowLayer):
+                self.static_cache.layers[i] = StaticLayer(layer.max_cache_len)


Do we add something in the docs to clarify, this seems like something unseemingly hard to catch by outsiders - and I don't think we will remove this limitation any time soon 😓

We could, but note that this is ALREADY the case on main, and it looks like nobody noticed/raised issue... It would generate garbage beyond the sliding window, it's just explicit in the code now!
To solve it, we could either have a mimimal version of sliding cache for export, which would work ONLY for decoding (i.e. 1 new tokens all the time, without being able to feed more than 1 token after prefill), or we could juste use full cache with proper sliding masking, but we loose some compute

Yea, still think it would be nice to clarify somewhere. True, it's also broken in main, maybe out of scope for this PR

vasqu · 2026-02-27T12:37:00Z

    @abstractmethod
    def update(
-        self, key_states: torch.Tensor, value_states: torch.Tensor, cache_kwargs: dict[str, Any] | None = None
+        self, key_states: torch.Tensor, value_states: torch.Tensor, *args, **kwargs


Do we really need the args/kwargs signature here? Does kwargs suffice to be BC?

Just not a fan of the *args tbh

Yes we need it - a lot of models pass cache_kwarg as an arg unfortunately, and other as a kwarg like cache_kwargs=cache_kwargs 🥲

vasqu · 2026-02-27T12:37:35Z

+            # It can either be an int for dynamic layers, or a tensor for static layers
+            if isinstance(self.cumulative_length, int):
+                self.cumulative_length = 0
+            else:
+                self.cumulative_length.zero_()



Could we default to tensor instead or too bothersome?

Or python ints always, not sure about the impacts.

More of a nit tbh

It's easier with ints for dynamic caches, and static ones REALLY need tensors for compile with cudagraphs!

what about dynamic + compile 🙃

Dynamic layers will never be able to use cudagraphs anyway as the kv tensors will change shape dynamically! However, I have strong hopes that we will be able to make them compile-compatible with dynamic=True/None and mode="default"!

vasqu · 2026-02-27T12:42:40Z

+
+    def reset(self):
+        super().reset()
+        self.cumulative_length_int = 0


The _int suffix is a bit weird, I know you mean to make it explicit but then it won't match with a lot of the other layer types.

It's because this LayerCache has both - the base cumulative_length is a Tensor (to be able to use cudagraph in the regime when the cache is not yet full), and cumulative_length_int is the equivalent but as a python int, to avoid data-dependent branching

Ah gotcha, didn't really came through the diff for me but makes sense

vasqu · 2026-02-27T12:47:14Z

    attention_mask: torch.Tensor | None,
-    cache_position: torch.Tensor,
+    cache_position: torch.Tensor | None = None,  # not used anymore but kept for BC
+    *,


Suggested change

*,

don't see a reason to add it?

happens elsewhere, keeping it here only

It's because now we want it to be optional, but the next argument past_key_values was not optional either, and we don't really want to make it optional (args with default value cannot be in front of other args without default 🥲) - this way it makes it clear that ? past_key_value` has to be passed over

vasqu · 2026-02-27T12:49:06Z

+    embeds = encoder_hidden_states if encoder_hidden_states is not None else inputs_embeds
+    batch_size, dtype, device = embeds.shape[0], embeds.dtype, embeds.device


I dont think we need this ternary at all anymore? Both embeds should have the same factory data, e.g. device,dtype,batch

Probably not, wasn't entirely sure as sometimes with the device_map they can end up on different devices - let's keep it for now and see after the PR is merged if we can remove!

vasqu · 2026-02-27T12:51:14Z

I love it, feels good to see

Cyrilvallez · 2026-02-27T13:12:03Z

run-slow: bert, bart

github-actions · 2026-02-27T13:13:22Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/bart", "models/bert"]
quantizations: []

github-actions · 2026-02-27T13:36:24Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	ab1b03f4	workflow commit (merge commit)
PR	c2d8180f	branch commit (from PR)
main	fe3cb66e	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

zucchini-nlp · 2026-03-02T08:39:20Z

Can we also update/delete docs where cache position is mentioned, such as https://huggingface.co/docs/transformers/v5.2.0/en/cache_explanation#cache-position?

Cyrilvallez · 2026-03-02T09:30:11Z

Updated audioflamingo, thanks the heads-up @zucchini-nlp! For the doc, I think it's best to wait for more models to remove them before deleting!

ArthurZucker

I like this, but lets update the PR so the community understands why we are doing this breaking change (it is kinda breaking for the mask API). let's add our strong motivations please!

Also we could deprecate without breaking for some of the changes

ArthurZucker · 2026-03-02T16:06:42Z

+            # It can either be an int for dynamic layers, or a tensor for static layers
+            if isinstance(self.cumulative_length, int):
+                self.cumulative_length = 0
+            else:
+                self.cumulative_length.zero_()



what about dynamic + compile 🙃

ArthurZucker · 2026-03-02T16:09:22Z

        else:
+            # Note: very important to use the tensor version of the cumulative length here, as otherwise cudagraphs
+            # (triggered by mode="reduced_overhead") will lead to random crashes, as the int would be overwritten
+            cache_position = torch.arange(kv_length, device=self.device) + self.cumulative_length


its annoying to me that we have to allocate new memory all the time here..... 😿

It's basically free - it's a tensor of 1 element 99% of the time, and even when it's not it's always super small!

ArthurZucker · 2026-03-02T16:11:29Z

    """
    # The masks for eager attention are simply boolean mask from sdpa, casted to 0 and -inf
    _ = kwargs.pop("allow_is_causal_skip", None)
    _ = kwargs.pop("allow_torch_fix", None)


should we just deprecate it for 2 releases at lease (cache postition as arg?)

Done! Until v5.6!

ArthurZucker · 2026-03-02T16:15:19Z

-    def update_conv_state(
-        self, layer_idx: int, new_conv_state: torch.Tensor, cache_position: torch.LongTensor
-    ) -> torch.Tensor:
-        conv_state = self.conv_states[layer_idx]
-        cache_position = cache_position.clamp(0, self.conv_kernel_size - 1)
-
-        conv_state = conv_state.roll(shifts=-1, dims=-1)
-        conv_state[:, :, cache_position] = new_conv_state.to(conv_state.device)
-        self.conv_states[layer_idx].zero_()
-        self.conv_states[layer_idx] += conv_state
-        return self.conv_states[layer_idx]


this is a good catch but unrelated no?

Yes, fully unrelated, but I stumbled upon it and it's never used anywhere...

ArthurZucker

LGTM 🤗

github-actions · 2026-03-04T17:45:26Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, apertus, arcee, aria, audioflamingo3, bamba, bitnet, cohere, cohere2, csm, cwm, deepseek_v2, deepseek_v3, dia, diffllama

…chmarks Rewrite static_sample_investigation.md with: - Context: goal is to determine neuron-only vs general static path - Methodology: align on newest _sample algorithm, not neuron_sample fork - Full comparison table: _static_sample vs neuron_sample (14 items) - Benchmark results for Items A (output_ids CPU) and B (4D mask) - Recent PRs affecting _sample (#44226, #44130, #44181, #44126) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cyrilvallez added 6 commits February 20, 2026 19:47

fully remove from the mask api

57b5870

fully remove from cache api

b9672ff

remove it from llama + modulars

88a0b2e

remove them from all modulars of llama descendants

f5dc06c

reapply all those modualrs

b792266

a few more

c2fde0a

Cyrilvallez force-pushed the remove-cache-pos branch from 81f8086 to c2fde0a Compare February 20, 2026 18:47

Cyrilvallez added 2 commits February 23, 2026 10:37

style

0f12a85

merge main

3a33d19

Cyrilvallez mentioned this pull request Feb 24, 2026

[generate] Always pass full input_ids in prepare_inputs_for_generation #44226

Merged

Cyrilvallez added 19 commits February 24, 2026 11:15

style

b2d91fc

better get_seq_length

49b47fe

merge main again

2a21bfa

fix signatures

94a44d0

fix device

417da7a

win against dynamo

beffc18

merge main AGAIN

924fb63

compatible with cuda graphs

f15531f

add args to absorb

3102c0b

fix lfm2

5ce0c8a

fix executorch

318f621

fix zamba2

0cf77d6

fix encoder-decoder

e86122e

fix

4b21a57

fix t5gemma

367f44d

oupsi

8d9a5d4

oupsi again

82bb20c

paddleocr

c066e27

fix lfms

32fd1ee

vasqu reviewed Feb 27, 2026

View reviewed changes

add doc

c2d8180

Cyrilvallez changed the title ~~[core] Completely remove cache positions~~ [core] 🚨 Completely remove cache positions Feb 27, 2026

Merge branch 'main' into remove-cache-pos

5385b2f

zucchini-nlp reviewed Mar 2, 2026

View reviewed changes

Comment thread src/transformers/models/audioflamingo3/modeling_audioflamingo3.py Outdated

Cyrilvallez and others added 2 commits March 2, 2026 10:23

Merge branch 'main' into remove-cache-pos

428cce2

fix audioflamingo

bec6545

ArthurZucker reviewed Mar 2, 2026

View reviewed changes

Cyrilvallez and others added 4 commits March 2, 2026 18:37

doc

3b3f519

bc on internal mask functions

4056eb4

oupsi version

8f6ca3d

Merge branch 'main' into remove-cache-pos

b3c0b33

ArthurZucker approved these changes Mar 3, 2026

View reviewed changes

Cyrilvallez and others added 3 commits March 4, 2026 10:17

Merge branch 'main' into remove-cache-pos

5eb5459

nemotron

7ac53a9

Merge branch 'main' into remove-cache-pos

598a83c

eurobert

9030f13

Cyrilvallez merged commit 421c7f6 into main Mar 4, 2026
21 of 27 checks passed

Cyrilvallez deleted the remove-cache-pos branch March 4, 2026 18:08

Cyrilvallez mentioned this pull request Mar 9, 2026

Remove cache_position in more models #44330

Merged

ydshieh mentioned this pull request Mar 31, 2026

fix bug for janus model image generation #45044

Merged

davidmezzetti mentioned this pull request May 1, 2026

Bug with detecting cache positions in sdpa_mask #45735

Open

4 tasks

		embeds = encoder_hidden_states if encoder_hidden_states is not None else inputs_embeds
		batch_size, dtype, device = embeds.shape[0], embeds.dtype, embeds.device

Conversation

Cyrilvallez commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Compile compatibilty

Breaking changes

Review pointers

Uh oh!

HuggingFaceDocBuilderDev commented Feb 20, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez commented Feb 27, 2026

Uh oh!

github-actions Bot commented Feb 27, 2026

Uh oh!

github-actions Bot commented Feb 27, 2026

CI Results

Commit Info

Uh oh!

Uh oh!

zucchini-nlp commented Mar 2, 2026

Uh oh!

Cyrilvallez commented Mar 2, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez commented Feb 20, 2026 •

edited

Loading

Cyrilvallez Feb 27, 2026 •

edited

Loading