Proper performant flex attention implementation by bursteratom · Pull Request #36103 · huggingface/transformers

bursteratom · 2025-02-08T22:22:52Z

What does this PR do?

Current flex attention implementation does not take advantage of the performance and memory efficiency promised in this official blog post from pytorch

This PR, inspired by https://github.com/pytorch/torchtune/blob/main/torchtune/modules/attention_utils.py rectifies that by making flex attention always compiled and utilizing the sparse-optimised BlockMask data type for attention masking in lieu of regular torch tensor. Performance and memory utilization are now comparable to flash attention.

BlockMask creation has been implemented for the following models:

Llama

Let's add support for other models in a separate PR

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

vasqu

A fan of flex attn :) hope you dont mind the comments but overall pro this. Torchtune is optimized for training iirc, is creating the block mask ok for inference? Like speed wise, I have no idea if there were any downsides/advantages for one or the other

Not sure if this is relevant to the PR tbh, but benchmarks might be a good thing to look out for in the future.

vasqu · 2025-02-09T01:04:56Z


+"""
+Inspired by torchtune's flex attention implementation
+"""


Nit: would move this to top of the file

Yep! And we forgot to add a licence!

@ArthurZucker can you point me to an example of how a proper licence string should be added?

I think something along these lines is meant

transformers/src/transformers/models/siglip2/modular_siglip2.py

Lines 1 to 14 in d18d9c3

# coding=utf-8

# Copyright 2025 The HuggingFace Inc. team.

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

(you can add the torchtune or pytorch team imo, not sure how finegrained it should be)

vasqu · 2025-02-09T21:41:12Z

Example for inference with flex attn: meta-pytorch/gpt-fast#196

On first glance i can spot a few things:

offset wrapper function
create initial bigger sparse block mask and index as necessary
compile block mask creation function

I think avoiding recreating the block mask is especially important here to avoid the memory/speed overhead - but not sure as I haven't measured speeds/memory myself. Might be more appropriate for a different PR, no idea; I just think inference especially should be handled with more care.

ArthurZucker

Very much needed! Thanks a lot 🤗
cc @molbap as you had issues with this recently !

ArthurZucker · 2025-02-10T09:44:43Z


+"""
+Inspired by torchtune's flex attention implementation
+"""


Yep! And we forgot to add a licence!

molbap

Big fan of this work! Thanks a lot for tackling it. I'd be interested in benchmarks especially in a couple models like PaliGemma and models with bidirectional attention 👀

bursteratom · 2025-02-12T04:28:19Z

@vasqu @molbap @ArthurZucker I made some changes according to your inputs, wondering if you can give it another pass? Thank you!

vasqu

Honestly, think the core is fine - just a few nits and smaller things. Would leave inference for another PR :)

vasqu · 2025-02-12T18:56:06Z

+    return create_block_causal_mask_flex(
+        causal_mask_mod,
+        batch_size,
+        None,
+        Q_LEN=total_seq_len,
+        KV_LEN=total_seq_len,
+        device=device,
+    )


I think my marking last time made it a bit confusing - kwargs on all args would be beneficial imo, especially on the None arg (attention heads).

ArthurZucker

Very nice!
Just missing some doc/ small perf comparisons!

ArthurZucker

LGTM! could you just add some documentation about perffs comparison ! 🤗

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

bursteratom · 2025-03-04T14:30:51Z

@ArthurZucker thank you! I will add the doc and perf comparison shortly! I'm wondering where in the docs/ subdirectory should I add the doc pertaining to flex attention, and what should the doc entail?

ArthurZucker · 2025-03-11T08:26:01Z

Let's merge for now IMO and you can open a new PR for doc!

ArthurZucker · 2025-03-11T08:59:38Z

See #36643 needed to flix the conflicts

ArthurZucker · 2025-03-11T09:25:55Z

We can close PR is merged! 🤗

github-actions · 2025-03-11T12:58:56Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

vasqu reviewed Feb 9, 2025

View reviewed changes

ArthurZucker reviewed Feb 10, 2025

View reviewed changes

ArthurZucker mentioned this pull request Feb 10, 2025

ModernBERT FlexAttention #35423

Closed

molbap reviewed Feb 10, 2025

View reviewed changes

Comment thread src/transformers/integrations/flex_attention.py

bursteratom force-pushed the proper_flex branch from 516de45 to ae9a2b0 Compare February 10, 2025 23:36

bursteratom requested review from ArthurZucker, molbap and vasqu February 10, 2025 23:52

bursteratom force-pushed the proper_flex branch 2 times, most recently from 7519b3c to 8c28c9d Compare February 12, 2025 16:25

vasqu approved these changes Feb 12, 2025

View reviewed changes

bursteratom force-pushed the proper_flex branch 2 times, most recently from 432bafa to 7483314 Compare February 13, 2025 05:05

ArthurZucker reviewed Feb 14, 2025

View reviewed changes

Comment thread src/transformers/models/llama/modeling_llama.py Outdated

Comment thread src/transformers/integrations/flex_attention.py

bursteratom force-pushed the proper_flex branch 2 times, most recently from fb9c4c6 to 3d9377f Compare February 21, 2025 17:16

bursteratom force-pushed the proper_flex branch 3 times, most recently from 99e62c0 to c50468c Compare February 25, 2025 18:20

bursteratom mentioned this pull request Feb 26, 2025

Flex Attention + Packing with BlockMask support axolotl-ai-cloud/axolotl#2363

Merged

bursteratom force-pushed the proper_flex branch from 8aaeda8 to 864efb2 Compare February 28, 2025 19:30

ArthurZucker approved these changes Mar 1, 2025

View reviewed changes

ArthurZucker mentioned this pull request Mar 1, 2025

torch compiled flex attention #36487

Closed

2 tasks

shethaadit approved these changes Mar 3, 2025

View reviewed changes

bursteratom added 4 commits March 4, 2025 09:29

proper performant flex attention implementation

800a7e7

wrapper for flex attention to compile only when triggered

c331bb3

wrapper for flex attention to compile only when triggered

e1438ad

attention mask type detection

68bd4e6

bursteratom and others added 17 commits March 4, 2025 09:29

Update src/transformers/integrations/flex_attention.py

cf0ad12

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

nit

2afa102

nit

a78f7bc

nit

9d1ee83

nit

c269190

gemma2 support

4e58c63

add citation for torchtune

6237ae4

Update src/transformers/models/llama/modeling_llama.py

eb254a8

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Update flex_attention.py

f593d3a

nit

6cf7ea9

nit

4da7947

nit

743ab13

reset gemma2 modifications

5ab3582

nit

ad63890

nit

f3b3bae

nit

43871f7

licencing

f094580

bursteratom force-pushed the proper_flex branch from 864efb2 to f094580 Compare March 4, 2025 14:29

ArthurZucker added flex attention Compilation Issues related to torchdynamo and torchinductor labels Mar 11, 2025

Merge branch 'main' into proper_flex

b81e90f

bursteratom closed this Mar 11, 2025

bursteratom reopened this Mar 11, 2025

github-actions Bot marked this pull request as draft March 11, 2025 12:58

bursteratom closed this Mar 11, 2025

	# coding=utf-8
	# Copyright 2025 The HuggingFace Inc. team.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.

Conversation

bursteratom commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu Feb 9, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

bursteratom Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bursteratom commented Feb 12, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vasqu Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

bursteratom commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Mar 11, 2025

Uh oh!

ArthurZucker commented Mar 11, 2025

Uh oh!

ArthurZucker commented Mar 11, 2025

Uh oh!

github-actions Bot commented Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

bursteratom commented Feb 8, 2025 •

edited

Loading

vasqu commented Feb 9, 2025 •

edited

Loading

bursteratom commented Mar 4, 2025 •

edited

Loading