SDPA for T5 Attention by huseinzol05 · Pull Request #31167 · huggingface/transformers

huseinzol05 · 2024-05-31T14:58:13Z

What does this PR do?

SDPA for T5 Attention

ArthurZucker

Hey! Could you make sure the cis go green!? 🤗

huseinzol05 · 2024-06-06T14:36:14Z

Hey! Could you make sure the cis go green!? 🤗

Hi! Im sorry, what is cis?

ArthurZucker · 2024-06-19T11:01:54Z

Hey! It is the integration tests right below this message that are all red!

huseinzol05 · 2024-06-20T03:43:36Z

passed except for the quality

ArthurZucker

Thanks for working on this! Let's try to re-use what we have in other modeling codes to have constant standards 🤗

ArthurZucker · 2024-10-03T08:00:53Z

+        def shape(states):
+            """projection"""
+            return states.view(batch_size, -1, self.n_heads, self.key_value_proj_dim).transpose(1, 2)
+
+        def unshape(states):
+            """reshape"""
+            return states.transpose(1, 2).contiguous().view(batch_size, -1, self.inner_dim)


no let's remove these one liners

ArthurZucker · 2024-10-03T08:00:58Z

+        def project(hidden_states, proj_layer, key_value_states, past_key_value):
+            """projects hidden states correctly to key/query states"""
+            if key_value_states is None:
+                # self-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(hidden_states))
+            elif past_key_value is None:
+                # cross-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(key_value_states))
+
+            if past_key_value is not None:
+                if key_value_states is None:
+                    # self-attn
+                    # (batch_size, n_heads, key_length, dim_per_head)
+                    hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
+                elif past_key_value.shape[2] != key_value_states.shape[1]:
+                    # checking that the `sequence_length` of the `past_key_value` is the same as
+                    # the provided `key_value_states` to support prefix tuning
+                    # cross-attn
+                    # (batch_size, n_heads, seq_length, dim_per_head)
+                    hidden_states = shape(proj_layer(key_value_states))
+                else:
+                    # cross-attn
+                    hidden_states = past_key_value
+            return hidden_states


ArthurZucker · 2024-10-03T08:01:18Z

+        # get query states
+        query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
+
+        # get key/value states
+        key_states = project(
+            hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
+        )
+        value_states = project(
+            hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
+        )
+
+        if position_bias is None:
+            if not self.has_relative_attention_bias:
+                position_bias = torch.zeros(
+                    (1, self.n_heads, real_seq_length, key_length), 
+                    device=query_states.device, dtype=query_states.dtype
+                )
+                if self.gradient_checkpointing and self.training:
+                    position_bias.requires_grad = True
+            else:
+                position_bias = self.compute_bias(real_seq_length, key_length, device=query_states.device)
+
+            # if key and values are already calculated
+            # we want only the last query position bias
+            if past_key_value is not None:
+                position_bias = position_bias[:, :, -hidden_states.size(1) :, :]
+
+            if mask is not None:
+                position_bias = position_bias + mask  # (batch_size, n_heads, seq_length, key_length)


the closer we are from UMT5 or WhisperSDPAAttention, the better! 🤗

noted, i will try to patch it

bghira · 2025-04-09T01:29:25Z

@huseinzol05 @ArthurZucker how are we feeling about these changes now?

ArthurZucker

I am happy for T5 to have sdpa but we updated the framework to have a better apppraoch to attention. It should take inspiration from #38108 and #38301

added sdpa

6c428d8

huseinzol05 changed the title ~~added sdpa~~ SDPA for T5 Attention May 31, 2024

remove flash def

1453097

LysandreJik requested a review from ArthurZucker June 3, 2024 15:31

ArthurZucker reviewed Jun 6, 2024

View reviewed changes

fix T5LayerCrossAttention

77a0c10

added output attentions

13fd400

Merge branch 'huggingface:main' into main

21bfd13

ArthurZucker self-requested a review July 26, 2024 10:17

huggingface deleted a comment from github-actions Bot Aug 22, 2024

ArthurZucker mentioned this pull request Sep 6, 2024

[T5] lm_head weights initialization: set variance to reciprocal of hidden dim #26441

Open

5 tasks

ArthurZucker reviewed Oct 3, 2024

View reviewed changes

ArthurZucker reviewed Jun 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDPA for T5 Attention#31167

SDPA for T5 Attention#31167
huseinzol05 wants to merge 5 commits intohuggingface:mainfrom
mesolitica:main

huseinzol05 commented May 31, 2024

Uh oh!

ArthurZucker left a comment

Uh oh!

huseinzol05 commented Jun 6, 2024

Uh oh!

ArthurZucker commented Jun 19, 2024

Uh oh!

huseinzol05 commented Jun 20, 2024

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Oct 3, 2024

Uh oh!

ArthurZucker Oct 3, 2024

Uh oh!

ArthurZucker Oct 3, 2024

Uh oh!

huseinzol05 Oct 5, 2024

Uh oh!

bghira commented Apr 9, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

huseinzol05 commented May 31, 2024

What does this PR do?

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

huseinzol05 commented Jun 6, 2024

Uh oh!

ArthurZucker commented Jun 19, 2024

Uh oh!

huseinzol05 commented Jun 20, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

huseinzol05 Oct 5, 2024

Choose a reason for hiding this comment

Uh oh!

bghira commented Apr 9, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants