[`Docs` / `BetterTransformer` ] Added more details about flash attention + SDPA by younesbelkada · Pull Request #25265 · huggingface/transformers

younesbelkada · 2023-08-02T12:59:23Z

What does this PR do?

as discussed offline with @LysandreJik

This PR clarifies to users how it is possible to use Flash Attention as a backend for most used models in transformers. As we have a seen some questions from users asking whether it is possible to integrate flash attention into HF models, whereas you can already benefit from it when using model.to_bettertransformer(), leveraging the BetterTransformer API from 🤗 optimum.

The informations are based from the official documentation of torch.nn.functional.scaled_dot_product

In the near future, we could also have a small blogpost explaining this as well

To do list / To clarify list:

Clarify that it is possible to do that for training as well (I did not added much on the training section)
Maybe add a few lines in overview of performance and scalability to emphasize this?

Let me know if I missed anything else

cc @fxmarty @MKhalusova @stevhliu

HuggingFaceDocBuilderDev · 2023-08-02T13:20:56Z

The documentation is not available anymore as the PR was closed or merged.

stevhliu

Thanks for adding these additional details! 😄

stevhliu · 2023-08-02T18:48:03Z

+
 As of PyTorch 2.0, the attention fastpath is supported for both encoders and decoders. The list of supported architectures can be found [here](https://huggingface.co/docs/optimum/bettertransformer/overview#supported-models).

+For decoder-based models (e.g. GPT, T5, Llama, etc.), the `BetterTransformer` API will convert all attention operations to use the [`torch.nn.functional.scaled_dot_product_attention` method](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA), that is available only from PyTorch 2.0 and onwards. 


Same comments for the rest of this section as in perf_infer_gpu_many.md (you can probably copy the changes over) :)

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

younesbelkada · 2023-08-03T07:15:57Z

Thanks a lot for the extensive review @stevhliu ! 🎉

fxmarty

Thanks a lot, that is much better.

I'll release on Optimum side to include huggingface/optimum#1225 that allows training with encoder models + SDPA as well.

It could be worth noting that a few models (Falcon, M4) start to have native SDPA support in transformers (but they may not dispatch to flash), see these discussions:

fxmarty · 2023-08-03T09:03:42Z

+For encoder models, the [`~PreTrainedModel.reverse_bettertransformer`] method reverts to the original model, which should be used before saving the model to use the canonical transformers modeling:
+
+```python
+model = model.reverse_bettertransformer()
+model.save_pretrained("saved_model")
+```


I think we should not make the distinction between encoder / decoder models when it come to using reverse_bettertransformer.

For example, for encoder-decoder models (e.g. t5), both SDPA (in the decoder) and nestedtensor (in the encoder) are used. So in case one wants to save the model, he'll need to use reverse_bettertransformer.

To me the distinction is more in that you can get speedups for inference with encoder models (since nestedtensor is used), but for decoder models the speedup / dispatch to flash will only come (in pytorch 2.0) for training & batch size = 1 for inference.

Thanks for the suggestion! I refactored a bit that section and removed the reverse_bettertransformer part as it is relevant only for training (that section is for inference only)

fxmarty · 2023-08-03T09:06:50Z

+# Use it for training or inference
+```
+
+SDPA can also call [Flash-Attention](https://arxiv.org/abs/2205.14135) kernels under the hood. If you want to force the usage of Flash Attention, use [`torch.backends.cuda.sdp_kernel(enable_flash=True)`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel):


torch.backends.cuda.sdp_kernel(enable_flash=True) is not enough. You need torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False as below

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

younesbelkada · 2023-08-17T15:01:57Z

Thanks @fxmarty for all the reviews, @stevhliu this is ready for another pass !

stevhliu

Looks awesome! I added some minor comments to make it a bit easier to read, and if you could also copy the changes from perf_infer_gpu_many to their corresponding sections in perf_infer_gpu_one that'd be great 🤗

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

ArthurZucker

Thanks for working on this! 🚀

added more details about flash attention

fddb958

correct and add more details

e113c86

stevhliu reviewed Aug 2, 2023

View reviewed changes

younesbelkada and others added 4 commits August 3, 2023 09:07

Apply suggestions from code review

b625e9e

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

few modifs

36d33de

more details

3400318

up

0bc132a

younesbelkada requested review from LysandreJik, fxmarty and stevhliu August 3, 2023 07:15

fxmarty approved these changes Aug 3, 2023

View reviewed changes

younesbelkada and others added 3 commits August 3, 2023 12:03

Apply suggestions from code review

c150cc8

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

Merge remote-tracking branch 'upstream/main' into sdpa-docs

af51646

adapt from suggestion

8acc2ae

younesbelkada requested a review from fxmarty August 17, 2023 10:36

fxmarty approved these changes Aug 17, 2023

View reviewed changes

Apply suggestions from code review

0377105

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

younesbelkada mentioned this pull request Aug 17, 2023

[SFTTrainer] Flash attention support for SFTTrainer huggingface/trl#656

Closed

trigger CI

fd0848e

stevhliu approved these changes Aug 17, 2023

View reviewed changes

stevhliu reviewed Aug 17, 2023

View reviewed changes

Comment thread docs/source/en/perf_infer_gpu_one.md Outdated

younesbelkada and others added 3 commits August 17, 2023 17:46

Apply suggestions from code review

05ae343

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

fix nits and copies

00ed550

add new section

f9a6592

younesbelkada requested a review from ArthurZucker August 18, 2023 08:18

ArthurZucker approved these changes Aug 18, 2023

View reviewed changes

younesbelkada merged commit 940d1a7 into huggingface:main Aug 18, 2023

younesbelkada deleted the sdpa-docs branch August 18, 2023 08:32

younesbelkada mentioned this pull request Aug 18, 2023

[core ] Integrate Flash attention 2 in most used models #25598

Merged

7 tasks


		As of PyTorch 2.0, the attention fastpath is supported for both encoders and decoders. The list of supported architectures can be found [here](https://huggingface.co/docs/optimum/bettertransformer/overview#supported-models).

		For decoder-based models (e.g. GPT, T5, Llama, etc.), the `BetterTransformer` API will convert all attention operations to use the [`torch.nn.functional.scaled_dot_product_attention` method](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA), that is available only from PyTorch 2.0 and onwards.

Conversation

younesbelkada commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevhliu Aug 2, 2023

Choose a reason for hiding this comment

Uh oh!

younesbelkada commented Aug 3, 2023

Uh oh!

fxmarty left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fxmarty Aug 3, 2023

Choose a reason for hiding this comment

Uh oh!

younesbelkada Aug 17, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fxmarty Aug 3, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

younesbelkada commented Aug 17, 2023

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

younesbelkada commented Aug 2, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 2, 2023 •

edited

Loading

fxmarty left a comment •

edited

Loading