🔴[`Attention`] Attention refactor for Whisper-based models by vasqu · Pull Request #38235 · huggingface/transformers

vasqu · 2025-05-20T15:25:40Z

Whisper attention refactor according to the same strategies applied in #38108

Also, several fixes on Whisper along the way, reducing the number of failed tests to 3 (disregarding skipped tests):

test_small_longform_timestamps_generation
test_tiny_token_timestamp_batch_generation
test_whisper_longform_multi_batch_hard_prev_cond

HuggingFaceDocBuilderDev · 2025-05-20T15:38:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2025-05-22T16:04:01Z

Ready for another round of reviews imo

ArthurZucker

Very nice!
What are the compilation issues with flex? Pretty sure updating to the new causal mask would solve!

ArthurZucker · 2025-05-23T07:38:32Z

+    # Copied from transformers.models.bart.modeling_bart.BartPreTrainedModel._update_causal_mask
+    def _update_causal_mask(
+        self,


I will be noisy but let's use the new cache creation no? 👀

It's introducing even more issues (torchscript + another integration test is starting to fail then - test_tiny_static_generation_long_form). Would leave it for now if that's ok. Don't want to make whisper even more broken tbh, wdyt?

cc @Cyrilvallez for the new mask creation, tried replacing it with

# previously #causal_mask = self._update_causal_mask( # attention_mask, # inputs_embeds, # cache_position, # past_key_values.self_attention_cache if past_key_values is not None else None, #) causal_mask = create_causal_mask( config=self.config, input_embeds=inputs_embeds, attention_mask=attention_mask, cache_position=cache_position, past_key_values=past_key_values.self_attention_cache if past_key_values is not None else None, )

torchscript with a mask is a known issue, but it's not super important anyway - for the other one, if it's compile-related it will probably be fixed in #38319 - TLDR I had introduced a workaround for Python<3.11 on torch.export, but it's also an issue with compile and fullgraph=True for those same python versions (which was obvious as they use the same dynamo tracing, but I missed it 🙃), so made the workaround the default

I'll try to apply the fix locally and check if it works but sounds good :)

For torchscript, I would skip those tests and add a todo for you (if we try to make it work again). Agree that it's not the most important feature. Wdyt?

Ok, it seems the failures are unrelated to the compiling issues. It seems that encoder-decoder compiling relies on the mask functions to be under PretrainedXXX, which leads to the integration test failing. This needs a closer look tbh, will leave it to a future PR to address the new masking.

cc @gante @zucchini-nlp if you know anything about the encoder-decoder caches relying on the functions under PretrainedXXX

A bit late to the party sorry, but if you use the new mask API you should get rid of _prepare_4d_causal_attention_mask_with_cache_position entirely everywhere, otherwise generate will use it instead of the new create_masks_for_generate! Just check a bit further here 🤗

(Otherwise generate will not create the correct mask with flex, or custom attention, and you're mixing new and old API which is not good)

You may need to overwrite the general one though, to account for the EncoderDecoderCache, but it can be done super easily as in Gemma3 for example, where we need to overwrite to account for the additional mask for the image tokens in training

LMK on slack if you cannot make it work, but TLDR we should not mix the old/new mask APIs, and the new one will be more general as flex will work correctly!

def get_mask_sizes(self, cache_position: torch.Tensor, layer_idx: int) -> tuple[int, int]: """ Return a tuple (kv_length, kv_offset) corresponding to the length and offset that will be returned for the given layer at `layer_idx`. The masks are then prepared according to the given lengths (kv_length, kv_offset) and patterns (i.e. sliding_window, chunk_size), for each layer. """ return self.self_attention_cache.get_mask_sizes(cache_position, layer_idx)

in EncoderDecoderCache is probably even better/cleaner, as you don't need to overwrite create_masks_for_generate, and it will always work independently of the type of Cache being used

vasqu · 2025-05-23T09:40:43Z

Investigating what's happening with flex attention: Something weird is going on 👀

Edit: Found the issue, gonna open another PR since it affects more models

vasqu · 2025-05-23T11:31:39Z

Found the root cause behind flex attention failing, will open another PR for this which should be merged before this PR. See #38321

Cyrilvallez · 2025-05-23T11:50:05Z

Found the root cause behind flex attention failing, will open another PR for this which should be merged before this PR.

If you're talking about compilation with flex, it's a known issue as well, as flex auto-compiles itself it seems to interfere when the forward is compiled as well 🥲 Super nice if you found a workaround!

vasqu · 2025-05-23T12:37:37Z

If you're talking about compilation with flex, it's a known issue as well, as flex auto-compiles itself it seems to interfere when the forward is compiled as well 🥲 Super nice if you found a workaround!

Ah, no I think I'm talking about something different. We have one flex attention test and it was failing for like 99% of the models. Fixing this in another PR. Iiuc, then it's basically an issue with torch compile doubly compiling as we compile forward and flex tries to compile again - that's indeed not nice 👀

gante

One more nit

(whisper is fun :D)

gante · 2025-05-23T15:02:27Z

+            input_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            cache_position=cache_position,
+            past_key_values=past_key_values.self_attention_cache if past_key_values is not None else None,


do we need to type check this statement and add more logic?

if we follow the nesting from WhisperForCausalLM (-> WhisperDecoderWrapper -> WhisperDecoder [this class]), then past_key_values can also be a decoder-only cache and past_key_values.self_attention_cache is fail-prone

this also means:

cache initialization above is incomplete

the docstring for past_key_values is imprecise

☠️

Yea, fair point lemme check what's even happening here 👀

Ok, so it seems to me that we create an EncoderDecoderCache even if we use the decoder-only model. And, it's not only for whisper the case but also for any encoder-decoder model that has a decoder-only model flavor (with cache class support).

In short, this handles our cases:

if use_cache or past_key_values is not None: if isinstance(past_key_values, Cache) and not isinstance(past_key_values, EncoderDecoderCache): return_self_attention_cache = True past_key_values = EncoderDecoderCache(past_key_values, DynamicCache()) elif not isinstance(past_key_values, EncoderDecoderCache): return_legacy_cache = True logger.warning_once( "Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. " "You should pass an instance of `EncoderDecoderCache` instead, e.g. " "`past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`." ) past_key_values = EncoderDecoderCache.from_legacy_cache(past_key_values)

Decoder-only cache is passed --> internally we use an encoder-decoder cache --> return decoder-only again after everything

Defaulting to encoder-decoder in any other case

We could imo add to the docs that decoder-only is possible? Not necessarily a fan of this tbh, but I can see it.

Ah, good point!

I think a cleaner approach would move the conversion logic to the decoder-only classes. In other words, the shared classes always assume EncoderDecoderCache, the decoder-only AutoModelFor... classes hold the conversion logic. This would keep the reference classes and their docs as clean as possible, with the expansions (decoder-only) being responsible for the adaptation.

WDYT?

the thing is, this can be completely done inside the EncoderDecoderCache.
SOmething like:

class EncoderDecoderCache: def __init__(self, past_key_values): if isinstance(past_key_values, Cache) and not isinstance(past_key_values, EncoderDecoderCache): ... # do the init return self if not isinstance(past_key_values, EncoderDecoderCache): return self.from_legacy_cache(past_key_values)

then the only check in the model to do is :

if use_cache or past_key_values is not None:

@ArthurZucker that handles cache initialization, but doesn't solve the part where the whisper class may selectively return a Cache in certain circumstances (as opposed to an EncoderDecoderCache).

IMO, the root issue comes from the PR that introduced this Cache<>EncoderDecoderCache logic: to minimize cross-class dependencies, the more recent decoder-only model should wrap its Cache into an EncoderDecoderCache before calling the main trunk of Whisper. This is a solution without if/elses where all main classes can be kept without cross-references :)

I feel like this change to the cache is out of scope for this PR. It's affecting multiple models not related to whisper (e.g. bart). It would make more sense to tackle this in a separate PR that covers all of these models.

There are either two options imo:

Update cache logic to move the decoder-only cache logic out (Joao's suggestion)

Update docstrings to reflect the logic that happens here

Yeah, makes sense to do it in another PR :)

Yeah, in general we want to move away from sub modules returning cache classes!

gante

(Happy with the PR, with the exception of the related but out-of-scope issue discussed above)

Cyrilvallez · 2025-05-26T16:08:49Z

Please see my comments #38235 (comment) here before merging! I believe we can make it much cleaner to set a proper example for all similar models!

Cyrilvallez

LGTM, just left 2 last comments to try to simplify the code as much as possible (especially the part on general FA2 attention as it would unnecessarily impact every model here I think)! Thanks for making the changes 🤗

Cyrilvallez · 2025-05-28T08:44:09Z

+    if attention_mask is not None and attention_mask.ndim == 2:
+        attention_mask = attention_mask[:, : key.shape[-2]]


Is this still needed based on latest mask refactors in the modeling? It should already have been taken care of upstream in the mask creation function

Seems to be an old relic from the mask creations which caused this :D removed it

Cyrilvallez · 2025-05-28T08:45:06Z

+            cache_position=cache_position,
+            past_key_values=past_key_values.self_attention_cache if past_key_values is not None else None,


Here we should not need the if/else with the change to the Cache itself 🤗

Yup good point! Also changed to just return past_kvs

ArthurZucker

Very happy with the changes! 🤗

ArthurZucker · 2025-05-28T08:48:23Z

-            cache_position,
-            past_key_values.self_attention_cache if past_key_values is not None else None,
-            output_attentions,
+        causal_mask = create_causal_mask(


haha damn the change is soooo much cleaner!

vasqu · 2025-05-28T11:10:16Z

Running slow tests locally and then merge (if nothings broken)

Edit: slow tests pass as expected

vasqu added 6 commits May 20, 2025 16:17

start refactoring whisper

0d65fec

revert for now

45cd987

first step

91fb2ff

carry over attn fixes

57dd18d

check if this works

04abc5d

whisper has an off by one somewhere - cutting mask in any interface

6f8af14

vasqu added 15 commits May 20, 2025 18:35

make it based on interface

7c9bc04

remove some tests that were skipped but now work

bd0618f

some fixes for whisper tests

c46a56f

interface changes

8c9613b

change the order of fix

1a05fe6

some attention adjustments for eager + TP

d293c37

fix scaling

44e050b

mask changes

2f3036a

why does whisper contain those extra seq lens?

b6ab429

fix from config for fa2 as input_ids is invalid

50b8550

fix another test

0288ef9

another fix

2151533

disable flex attn due to compile issues

00da5db

copies and refactor for qwen audio since it somewhat relies on whisper

ba3092a

fix scaling and smaller things

9744491

vasqu marked this pull request as ready for review May 21, 2025 12:53

github-actions Bot requested review from ArthurZucker and eustlb May 21, 2025 12:54

retrigger

0d5f01b

vasqu requested a review from gante May 21, 2025 12:55

vasqu commented May 21, 2025

View reviewed changes

Comment thread src/transformers/cache_utils.py

vasqu commented May 21, 2025

View reviewed changes

Comment thread src/transformers/generation/logits_process.py

vasqu commented May 21, 2025

View reviewed changes

Comment thread src/transformers/generation/utils.py Outdated

vasqu added 4 commits May 22, 2025 17:27

add comment

2602b0f

forgot this one

10666c6

change copies as whisper cuts on the mask

28951f0

add guard

eeb5cf6

vasqu requested a review from gante May 22, 2025 16:03

ArthurZucker approved these changes May 23, 2025

View reviewed changes

add flex attention

98836e6

switch to new mask function + add skips for torchscript

d4b0bd8

gante reviewed May 23, 2025

View reviewed changes

gante approved these changes May 26, 2025

View reviewed changes

vasqu added 2 commits May 28, 2025 10:09

remove old api with cache position

2d02cb1

Merge branch 'main' into vas-whisper-attn-refactor

395de62

Cyrilvallez approved these changes May 28, 2025

View reviewed changes

ArthurZucker approved these changes May 28, 2025

View reviewed changes

last changes?

b6cf40c

trigger ci

338921c

vasqu merged commit badc71b into main May 28, 2025
21 checks passed

vasqu deleted the vas-whisper-attn-refactor branch May 28, 2025 11:32

guangy10 mentioned this pull request Jun 18, 2025

Move pinned transformers huggingface/optimum-executorch#86

Merged

JustinTong0323 mentioned this pull request Jun 28, 2025

Fix: Minicpm sgl-project/sglang#7612

Merged

6 tasks

vasqu mentioned this pull request Jun 30, 2025

Whisper models appear to be broken with Flash Attention 2 #38662

Closed

4 tasks

guangy10 mentioned this pull request Jul 4, 2025

Fix Whisper export w/ transformers 4.51.0 huggingface/optimum-executorch#98

Open

kunal-vaishnavi mentioned this pull request Jul 9, 2025

Problems trying to convert & run whisper model microsoft/onnxruntime-genai#1611

Closed

		if attention_mask is not None and attention_mask.ndim == 2:
		attention_mask = attention_mask[:, : key.shape[-2]]

		cache_position=cache_position,
		past_key_values=past_key_values.self_attention_cache if past_key_values is not None else None,

Conversation

vasqu commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu commented May 22, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cyrilvallez commented May 23, 2025

Uh oh!

vasqu commented May 23, 2025

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

gante May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

vasqu commented May 20, 2025 •

edited

Loading

vasqu May 23, 2025 •

edited

Loading

Cyrilvallez May 23, 2025 •

edited

Loading

vasqu May 23, 2025 •

edited

Loading

vasqu May 23, 2025 •

edited

Loading

Cyrilvallez May 26, 2025 •

edited

Loading

vasqu commented May 23, 2025 •

edited

Loading

vasqu commented May 23, 2025 •

edited

Loading

gante May 23, 2025 •

edited

Loading

vasqu May 23, 2025 •

edited

Loading

Cyrilvallez May 28, 2025 •

edited

Loading

vasqu commented May 28, 2025 •

edited

Loading