Use new attention API for LayoutLMv3 (SDPA, Flash Attn v2 support) by stancld · Pull Request #35469 · huggingface/transformers

stancld · 2024-12-31T12:28:18Z

What does this PR do?

Performance benchmark

Speed & memory req consumption on a token classification ntraining of LayoutLMv3-like model with multilingual support, various auxiliary tasks, masked language modelling.

GPU: 1x A100 80 GB
Batch size: 16, Accumulated gradient batches: 8

Impl.	Speed	Peak memory
Eager	~2.0 it/s	66.7 Gi
SDPA	~3.0 it/s	47.2 Gi

Overall, ~50% speed-up and memory reqs reduction is observed.

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

cc: @ArthurZucker

stancld · 2025-01-14T10:45:03Z

Just pinging @NielsRogge or @ArthurZucker .)

qubvel

Hi @stancld! Thanks for working on this, looks great 🤗 It would be nice to add some benchmark results to docs to ensure SDPA works after than eager one

qubvel · 2025-01-17T11:46:40Z

Thanks for adding a test, but we have a common test test_eager_matches_sdpa_inference, so no need for another one (it's enabled once _supports_sdpa = True set). But it's great to see this one pass

stancld · 2025-01-17T14:09:17Z

@qubvel Thanks for the notes :] Will run some speed benchmarks with various seq lens & batch sizes tonight and add to the docs :]

stevhliu

Thanks!

stevhliu · 2025-01-17T19:38:35Z

+PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function 
+encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the 
+[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) 
+or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
+page for more information.


Suggested change

PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function

encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the

[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)

or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)

page for more information.

PyTorch includes a native scaled dot-product attention (SDPA) operator as part of [torch.nn.functional](https://pytorch.org/docs/stable/nn.functional.html). This function

encompasses several memory-efficient attention implementations that can be applied depending on the inputs and hardware. See the

[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)

or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)

page for more information.

stevhliu · 2025-01-17T19:39:48Z

+SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set 
+`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
+
+```
+from transformers import LayoutLMv3Model
+
+model = LayoutLMv3Model.from_pretrained("microsoft/layoutlmv3-base", torch_dtype=torch.float16, attn_implementation="sdpa")
+...
+```
+
+For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).


Suggested change

SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set

`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.

```

from transformers import LayoutLMv3Model

model = LayoutLMv3Model.from_pretrained("microsoft/layoutlmv3-base", torch_dtype=torch.float16, attn_implementation="sdpa")

...

```

For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).

SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set

`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA.

For the best speedups, we recommend loading the model in half-precision (`torch.float16` or `torch.bfloat16`).

```py

from transformers import LayoutLMv3Model

model = LayoutLMv3Model.from_pretrained("microsoft/layoutlmv3-base", torch_dtype=torch.float16, attn_implementation="sdpa")

ArthurZucker

Very welcome!
Sorry for delaying the review, I saw that it was not clean (unrealted example files)
If you want to push this through let's maybe use the latest api for attention interface! (see modeling llama's attention layer!)

stancld · 2025-02-17T13:29:31Z

@ArthurZucker It required rebase, dunno why it looked like those unrelated files were touched.

Will check the new API.

stancld · 2025-02-17T15:37:35Z

@ArthurZucker Flash Attn impl is still broken here. Will have a look as time allows .]

ArthurZucker

thanks

ArthurZucker · 2025-04-08T15:40:15Z

+        self.self = nn.ModuleDict(
+            {
+                "query": nn.Linear(config.hidden_size, config.num_attention_heads * self.attention_head_size),
+                "key": nn.Linear(config.hidden_size, config.num_attention_heads * self.attention_head_size),
+                "value": nn.Linear(config.hidden_size, config.num_attention_heads * self.attention_head_size),
+            }


guessing we cannot remove this for BC! OK 🤗

ArthurZucker · 2025-04-08T15:40:54Z

        )


+def _cogview_attention(attention_scores: torch.Tensor, alpha: Union[int, float] = 32) -> torch.Tensor:


we dont'really need a separate function but if we keep it place it at the top please

ArthurZucker · 2025-04-08T15:41:22Z

On nit solve conflicts and good to go!

stancld changed the title ~~[WIP] Add SDPA support for LayoutLMv3 model~~ Add SDPA support for LayoutLMv3 model Dec 31, 2024

stancld force-pushed the ds/feat/layoutlmv3-flash-attn branch 5 times, most recently from c5de661 to 923cdea Compare January 2, 2025 10:03

NielsRogge requested a review from qubvel January 17, 2025 08:28

qubvel reviewed Jan 17, 2025

View reviewed changes

stancld force-pushed the ds/feat/layoutlmv3-flash-attn branch from 923cdea to 7853314 Compare January 17, 2025 13:35

stancld requested review from ArthurZucker, NielsRogge, Rocketknight1 and stevhliu as code owners January 17, 2025 13:35

stevhliu approved these changes Jan 17, 2025

View reviewed changes

qubvel added SDPA Multimodal labels Jan 21, 2025

stancld force-pushed the ds/feat/layoutlmv3-flash-attn branch from 7853314 to f324038 Compare January 22, 2025 10:34

ArthurZucker reviewed Feb 12, 2025

View reviewed changes

stancld force-pushed the ds/feat/layoutlmv3-flash-attn branch from e469f2d to a5828b2 Compare February 17, 2025 13:28

models.layoutlmv3: Add SDPA support for LayoutLMv3 model

e90b2c3

stancld force-pushed the ds/feat/layoutlmv3-flash-attn branch from a5828b2 to d9c6c88 Compare February 17, 2025 14:11

stancld changed the title ~~Add SDPA support for LayoutLMv3 model~~ Use new attention API for LayoutLMv3 (SDPA, Flash Attn v2 support) Feb 17, 2025

stancld force-pushed the ds/feat/layoutlmv3-flash-attn branch from d9c6c88 to 37a9581 Compare February 17, 2025 14:14

stancld changed the title ~~Use new attention API for LayoutLMv3 (SDPA, Flash Attn v2 support)~~ Draft: Use new attention API for LayoutLMv3 (SDPA, Flash Attn v2 support) Feb 17, 2025

stancld force-pushed the ds/feat/layoutlmv3-flash-attn branch 3 times, most recently from 060b80d to b01d58f Compare February 17, 2025 15:02

stancld changed the title ~~Draft: Use new attention API for LayoutLMv3 (SDPA, Flash Attn v2 support)~~ Use new attention API for LayoutLMv3 (SDPA, Flash Attn v2 support) Feb 17, 2025

stancld force-pushed the ds/feat/layoutlmv3-flash-attn branch from b01d58f to 7b51a17 Compare February 17, 2025 15:24

stancld changed the title ~~Use new attention API for LayoutLMv3 (SDPA, Flash Attn v2 support)~~ Draft: Use new attention API for LayoutLMv3 (SDPA, Flash Attn v2 support) Feb 17, 2025

models.layoutlmv3: Use new Attention API

172f327

stancld force-pushed the ds/feat/layoutlmv3-flash-attn branch from 7b51a17 to 172f327 Compare February 17, 2025 18:56

stancld changed the title ~~Draft: Use new attention API for LayoutLMv3 (SDPA, Flash Attn v2 support)~~ Use new attention API for LayoutLMv3 (SDPA, Flash Attn v2 support) Feb 17, 2025

ArthurZucker approved these changes Apr 8, 2025

View reviewed changes

stevhliu mentioned this pull request Apr 8, 2025

docs: Update LayoutLMv3 model card with standardized format and impro… #37155

Open

5 tasks

		)


		def _cogview_attention(attention_scores: torch.Tensor, alpha: Union[int, float] = 32) -> torch.Tensor:

Conversation

stancld commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Performance benchmark

Before submitting

Who can review?

Uh oh!

stancld commented Jan 14, 2025

Uh oh!

qubvel left a comment

Choose a reason for hiding this comment

Uh oh!

qubvel Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stancld commented Jan 17, 2025

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

stevhliu Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

stevhliu Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

stancld commented Feb 17, 2025

Uh oh!

stancld commented Feb 17, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Apr 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stancld commented Dec 31, 2024 •

edited

Loading

qubvel Jan 17, 2025 •

edited

Loading