:rotating_light: [`Attn`] New attn mask interface everywhere by vasqu · Pull Request #42848 · huggingface/transformers

vasqu · 2025-12-12T19:11:45Z

As per title ~

HuggingFaceDocBuilderDev · 2025-12-12T19:25:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2025-12-12T22:03:58Z

run-slow: gpt2,mllama,opt,biogpt,blt,decision_transformer

github-actions · 2025-12-12T22:05:04Z

This comment contains run-slow, running the specified jobs:

models: ["models/biogpt", "models/blt", "models/decision_transformer", "models/gpt2", "models/mllama", "models/opt"]
quantizations: []

github-actions · 2025-12-12T22:26:23Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

decision_transformer:
tests/models/decision_transformer/test_modeling_decision_transformer.py::DecisionTransformerModelIntegrationTest::test_autoregressive_prediction

vasqu · 2025-12-12T22:33:25Z

run-slow: bert, bert_generation, blt, data2vec, decision_transformer, electra, ernie, glm46v, glm4v, glm4v_moe, gpt2, mllama, opt, paddleocr_vl, roberta, roberta_prelayernorm

github-actions · 2025-12-12T22:34:38Z

This comment contains run-slow, running the specified jobs:

models: ["models/bert", "models/bert_generation", "models/blt", "models/data2vec", "models/decision_transformer", "models/electra", "models/ernie", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/gpt2", "models/mllama", "models/opt", "models/paddleocr_vl", "models/roberta", "models/roberta_prelayernorm"]
quantizations: []

github-actions · 2025-12-12T23:32:05Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

opt:
tests/models/opt/test_modeling_opt.py::OPTModelTest::test_eager_padding_matches_padding_free_with_position_ids
tests/models/opt/test_modeling_opt.py::OPTModelTest::test_sdpa_padding_matches_padding_free_with_position_ids

vasqu · 2025-12-13T01:29:07Z

run-slow: bamba,falcon_h1,mllama,moshi,zamba2,zamba,pop2piano

github-actions · 2025-12-13T01:30:12Z

This comment contains run-slow, running the specified jobs:

models: ["models/bamba", "models/falcon_h1", "models/mllama", "models/moshi", "models/pop2piano", "models/zamba", "models/zamba2"]
quantizations: []

github-actions · 2026-02-06T16:25:45Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: altclip, autoformer, bamba, bark, bigbird_pegasus, bloom, blt, clipseg, clvp, codegen, conditional_detr, dab_detr, decision_transformer, detr, falcon, falcon_h1

github-actions · 2026-02-06T16:38:10Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: altclip, autoformer, bamba, bark, bigbird_pegasus, bloom, blt, clipseg, clvp, codegen, conditional_detr, dab_detr, decision_transformer, detr, falcon, falcon_h1

github-actions · 2026-02-06T16:56:32Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: altclip, autoformer, bamba, bark, bigbird_pegasus, bloom, blt, clipseg, clvp, codegen, conditional_detr, dab_detr, decision_transformer, detr, falcon, falcon_h1

github-actions · 2026-02-06T17:05:59Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: altclip, autoformer, bamba, bark, bigbird_pegasus, bloom, blt, clipseg, clvp, codegen, conditional_detr, dab_detr, decision_transformer, detr, falcon, falcon_h1

vasqu

Self-review on points that may seem weird to clarify a bit

vasqu · 2026-02-06T16:39:07Z

Just reapplied modular

vasqu · 2026-02-06T16:40:07Z

        attention_mask: torch.Tensor | None = None,
-        causal_attention_mask: torch.Tensor | None = None,
        output_attentions: bool | None = False,
+        **kwargs,
    ) -> tuple[torch.Tensor, torch.Tensor | None]:


You will see this pattern a few times and it comes from an old clip implementation which did the padding mask and causal (naive triu) separately and then added them up - we do this ourselves at the same time

vasqu · 2026-02-06T16:41:19Z

+        kwargs.pop("is_causal", None)
        encoder_outputs = self.encoder(
            inputs_embeds=hidden_states,
            attention_mask=attention_mask,
-            causal_attention_mask=causal_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
+            is_causal=True,
+            **kwargs,


Technically not needed in a few models but it's the same as in clip so I'd rather preemptively add correct kwargs if someone refactors these models

vasqu · 2026-02-06T16:42:52Z

-    if not module.is_cross_attention:
-        # if only "normal" attention layer implements causal mask
-        query_length, key_length = query.size(-2), key.size(-2)
-        causal_mask = module.bias[:, :, key_length - query_length : key_length, :key_length]
-        mask_value = torch.finfo(attn_weights.dtype).min
-        # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
-        # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
-        mask_value = torch.full([], mask_value, dtype=attn_weights.dtype, device=attn_weights.device)
-        attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)


Mentioning it here but it comes from gpt2 - what essentially happened back in the days was to create the padding mask and then add a triu as buffer on top aka padding + causal (might remind you of what happened with the clip likes)

I have checked to ignore these on load now and results are the same

vasqu · 2026-02-06T16:43:51Z

+        causal_mask = create_causal_mask(
+            config=self.config,
+            input_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            cache_position=cache_position,
+            past_key_values=past_key_values,
+            # Force mask creation for alibi
+            and_mask_function=lambda *args: torch.tensor(True, dtype=torch.bool),
        )
+        if alibi is not None and causal_mask is not None and causal_mask.ndim == 4:
+            min_dtype = torch.finfo(inputs_embeds.dtype).min
+
+            # Only using non-bool mask for alibi
+            if causal_mask.dtype == torch.bool:
+                causal_mask = torch.where(
+                    causal_mask, torch.tensor(0.0, device=causal_mask.device, dtype=inputs_embeds.dtype), min_dtype
+                )
+
+            # We take care to integrate alibi bias in the causal_mask here
+            alibi = alibi.reshape(batch_size, -1, *alibi.shape[1:])
+            causal_mask = torch.masked_fill(
+                alibi / math.sqrt(self.config.hidden_size // self.num_heads),
+                causal_mask < -1,
+                min_dtype,
+            )


This is a bit special but I went as far as I could with what is available - alibi is applied on top of the mask so it needs a float mask

yep, I don't really think our API supports this better than what you did. Tho this only works if the mask is not a BlockMask from flex right?

Yup, but the flags don't support flex so we are fine

vasqu · 2026-02-06T16:50:13Z

+                # We need to prepare position ids according to the attention mask as we use it to extract embeddings that
+                # rely on the correct position - naively increasing sequences do not suffice anymore atp. The solution here
+                # calculates an increasing sequences for all 1s and puts 0s else.
+                inputs_dict["position_ids"] = ((inputs_dict["attention_mask"] == 1).long().cumsum(dim=1) - 1) * (
+                    inputs_dict["attention_mask"] == 1
+                ).long()
+


This is super important and allows native support for absolute position embeddings like bert without overwriting their tests

vasqu · 2026-02-06T16:51:16Z

It makes the tests work but unsure if we want this

I think its fine!

vasqu · 2026-02-06T16:51:46Z

                    device=torch_device,
                )
                inputs_dict["input_ids"] = inputs_dict["labels"]
+                inputs_dict["attention_mask"] = torch.tril(torch.ones_like(inputs_dict["input_ids"]).to(torch_device))


This allows us to test gpt2 on the padding free tests - it was skipped before

vasqu · 2026-02-06T16:52:34Z

Low usage, not worth to fix imo with a lot of custom stuff happening - it's the most closely related model to the old API

vasqu · 2026-02-06T16:53:08Z

+    # the following models should have been PreTrainedModels
+    "Owlv2TextTransformer",
+    "Owlv2VisionTransformer",
+    "OwlViTTextTransformer",
+    "OwlViTVisionTransformer",
+    "XCLIPTextTransformer",
+    "CLIPSegTextTransformer",
+    "DetrDecoder",
+    "GroupViTTextTransformer",
+    "CLIPTextTransformer",
+    "CLIPVisionTransformer",
+    "MetaClip2TextTransformer",
+    "MetaClip2VisionTransformer",
+    "MLCDVisionTransformer",
+    # end of should have beens


Same as in Cyril's PR but now for attention related things :)

ArthurZucker

Huge cleanup, much welcome.
Got the idea with passing some models to pretrained ones, not super sure we should vs placing the mask creation code in the parent class that uses it

ArthurZucker · 2026-02-09T12:50:33Z

that's a really nice cleanup!

ArthurZucker · 2026-02-09T12:55:22Z

+        causal_mask = create_causal_mask(
+            config=self.config,
+            input_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            cache_position=cache_position,
+            past_key_values=past_key_values,
+            # Force mask creation for alibi
+            and_mask_function=lambda *args: torch.tensor(True, dtype=torch.bool),
        )
+        if alibi is not None and causal_mask is not None and causal_mask.ndim == 4:
+            min_dtype = torch.finfo(inputs_embeds.dtype).min
+
+            # Only using non-bool mask for alibi
+            if causal_mask.dtype == torch.bool:
+                causal_mask = torch.where(
+                    causal_mask, torch.tensor(0.0, device=causal_mask.device, dtype=inputs_embeds.dtype), min_dtype
+                )
+
+            # We take care to integrate alibi bias in the causal_mask here
+            alibi = alibi.reshape(batch_size, -1, *alibi.shape[1:])
+            causal_mask = torch.masked_fill(
+                alibi / math.sqrt(self.config.hidden_size // self.num_heads),
+                causal_mask < -1,
+                min_dtype,
+            )


yep, I don't really think our API supports this better than what you did. Tho this only works if the mask is not a BlockMask from flex right?

ArthurZucker · 2026-02-09T12:55:54Z

        output_attentions: bool | None = None,
        output_hidden_states: bool | None = None,
        return_dict: bool | None = None,
+        **kwargs,


I think we should typeDict enforce the kwargs for all the onces you added.

ArthurZucker · 2026-02-09T13:06:35Z



-class GroupViTTextTransformer(nn.Module):
+class GroupViTTextTransformer(GroupViTPreTrainedModel):


that's weird because the config object should be just passed to the encoder no (meaning changing it globally not for GroupViTTextTransformer but the one that has GroupViTTextTransformer.

But yeah not a big deal a lot of these vision models have shitty design with wrapper around wrappers

ArthurZucker · 2026-02-09T13:06:57Z

-            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
-            attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)
-
+        kwargs.pop("is_causal", None)


arf not super super clean but fine

Yea, it's not ideal tbh :(

ArthurZucker · 2026-02-09T13:31:00Z

 class Wav2Vec2BertAdapterLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
+        self.config = config


bit wird not to have this one go to pretrainedModel but no worries

I have rechecked those to not need it in 28f8a74 - initially did pretrained models there too but they have proper top modules which handle setting the attention etc

ArthurZucker · 2026-02-09T13:33:48Z

I think its fine!

ArthurZucker · 2026-02-09T13:34:32Z

        self.assertEqual(position_ids.shape, expected_positions.shape)
        self.assertTrue(torch.all(torch.eq(position_ids, expected_positions)))

-    def attention_mask_padding_matches_padding_free_with_position_ids(


So the attention_mask_padding_matches_padding_free_with_position_ids from GenerationTesterMixin now works, that's actually cool ty!

Yup, the position ids are way more sensitive to the absolute position embeddings than rope so this was silently having wrong positions at times

ArthurZucker · 2026-02-09T13:37:10Z

-    @unittest.skip(reason="doesn't support padding yet")
-    def test_eager_matches_sdpa_inference_1_bfloat16(self):
+    # TODO: vasqu
+    @unittest.skip(reason="why the heck does this have bigger tols")


github-actions · 2026-02-09T15:03:55Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: altclip, autoformer, bamba, bark, bigbird_pegasus, bloom, blt, clipseg, clvp, codegen, conditional_detr, dab_detr, decision_transformer, detr, falcon, falcon_h1

github-actions · 2026-02-09T15:34:58Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: align, altclip, autoformer, bamba, bark, bigbird_pegasus, bloom, blt, chinese_clip, clap, clipseg, clvp, codegen, conditional_detr, dab_detr, decision_transformer

github-actions · 2026-02-09T15:36:01Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: align, altclip, autoformer, bamba, bark, bigbird_pegasus, bloom, blt, chinese_clip, clap, clipseg, clvp, codegen, conditional_detr, dab_detr, decision_transformer

github-actions · 2026-02-09T15:36:11Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: align, altclip, autoformer, bamba, bark, bigbird_pegasus, bloom, blt, chinese_clip, clap, clipseg, clvp, codegen, conditional_detr, dab_detr, decision_transformer

* fix * fix order * style * vision 3d rope get extra test for now * fix gpt2 * more gpt2 fixes * let's see... * fix * test * fix opt+biogpt * fix * fix * fix * fix opt * mask exchange test * style * several small fixes * shouldnt be needed * fix zamba models * retrigger ci * force skip for now * this wont work, will fix step by step * to git * another batch * fix a few models, clip related models are gonna be hard... * another batch * style * fix gpt2 attempt * another batch + some models do not set their attn implementation? TODO * fix * last models * style * repo fix * check * some quick fixes, error to catch wrong inits in some models * small fixes * fixes for wrong mask pretrained model relation * fix * remove mask defaulting --> that's part of the prep + fixup some other tests * small fixes * fix last few models --> last to check recurrent gemma + repo consistency * fixup test cleanup * revert these tests * these were not necessary, they have a proper top module * fixup kwargs * remove old API * more kwargs * let's revert this - im in a fork :D * fix * dang * revert removal and add deprecation msg * kwargs typing * style

vasqu added 3 commits December 12, 2025 20:07

fix

a6ba941

fix order

c8d6984

style

8095a1c

vasqu mentioned this pull request Dec 12, 2025

Pass in position_ids into attention for GPT2 #42842

Closed

vasqu added 6 commits December 12, 2025 21:44

vision 3d rope get extra test for now

fed0582

fix gpt2

8522356

more gpt2 fixes

69be66d

let's see...

c35529d

fix

e5f4fdb

test

021732c

vasqu added 2 commits December 12, 2025 23:12

fix opt+biogpt

d24b530

fix

41a805c

fix

1cdd34d

vasqu added 4 commits December 12, 2025 23:48

fix

7a69035

fix opt

15ac740

mask exchange test

c08292b

style

5ecbb66

vasqu changed the title ~~[FA] Fix paddingfree tests to properly consider position ids and default create a mask~~ [Attn] More new interface switches and proper paddingfree test Dec 12, 2025

vasqu and others added 2 commits December 13, 2025 00:36

Merge branch 'main' into fix-fa-posids-tests

7ce86ec

several small fixes

fc0b716

shouldnt be needed

fe6f1e5

vasqu added 3 commits February 6, 2026 17:29

remove old API

87e56b0

more kwargs

ca27bcc

let's revert this - im in a fork :D

9a32c44

fix

1b602dc

dang

26c3aaa

vasqu commented Feb 6, 2026

View reviewed changes

vasqu requested review from ArthurZucker and Cyrilvallez February 6, 2026 17:15

vasqu marked this pull request as ready for review February 6, 2026 17:15

ArthurZucker approved these changes Feb 9, 2026

View reviewed changes

revert removal and add deprecation msg

0a67029

vasqu added 2 commits February 9, 2026 16:33

kwargs typing

9f53f1c

style

bef3e2b

Merge branch 'main' into fix-fa-posids-tests

8116e63

vasqu merged commit 4b8ba25 into huggingface:main Feb 9, 2026
25 checks passed

vasqu deleted the fix-fa-posids-tests branch February 9, 2026 15:44

Cyrilvallez mentioned this pull request Feb 9, 2026

Remove mask slicing in all eager attentions #42186

Merged

Cyrilvallez mentioned this pull request Feb 18, 2026

Simplify input preparation in generate #44126

Merged

vasqu mentioned this pull request Feb 26, 2026

fix(modeling_attn_mask_utils): remove FutureWarning from logger.warning_once() #44307

Merged

5 tasks

jiqing-feng mentioned this pull request Mar 5, 2026

Mllama compile failed after new attn mask #44458

Closed

4 tasks

jiqing-feng mentioned this pull request Mar 19, 2026

Fix Mllama torch.compile failure caused by new attention mask logic #44845

Closed



		class GroupViTTextTransformer(nn.Module):
		class GroupViTTextTransformer(GroupViTPreTrainedModel):

Conversation

vasqu commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 12, 2025

Uh oh!

vasqu commented Dec 12, 2025

Uh oh!

github-actions Bot commented Dec 12, 2025

Uh oh!

github-actions Bot commented Dec 12, 2025

CI Results

Model CI Report

❌ Failed tests

Uh oh!

vasqu commented Dec 12, 2025

Uh oh!

github-actions Bot commented Dec 12, 2025

Uh oh!

github-actions Bot commented Dec 12, 2025

CI Results

Model CI Report

❌ Failed tests

Uh oh!

vasqu commented Dec 13, 2025

Uh oh!

github-actions Bot commented Dec 13, 2025

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu commented Dec 12, 2025 •

edited

Loading