[`FA-2`] Add Flash Attention to `Phi` by susnato · Pull Request #27661 · huggingface/transformers

susnato · 2023-11-22T18:58:05Z

What does this PR do?

This PR adds Flash Attention to Phi.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

cc: @younesbelkada, @ArthurZucker

susnato · 2023-11-22T19:33:12Z

All the FA tests pass except the test_flash_attn_2_generate_padding_right.

This is odd given that the flash_attn_2_inference_padding_right test is passing as does the test_flash_attn_2_generate_left_padding test.

younesbelkada · 2023-11-22T19:51:49Z

@susnato can you try to run that test multiple times? sometimes it is flaky - apart from that the changes look great on my end !

susnato · 2023-11-22T20:07:45Z

Hi @younesbelkada, I ran that test 30 times and every time it failed!

Shouldn't the inference test fail too, if the generation test fails? 😅

younesbelkada · 2023-11-23T10:52:55Z

Hmm yes correct, what I did for llama was to overwrite the test as can be see here: https://github.com/huggingface/transformers/blob/main/tests/models/llama/test_modeling_llama.py#L392 using a real checkpoint. It would be great if you can do the same and test the next 10 tokens are the same (make sure to use do_sample=False)

susnato · 2023-11-23T12:28:47Z

Hi @younesbelkada, thanks a lot for the advice! All the flash attention tests are passing now. 🤗

younesbelkada

Truly amazing work @susnato ! Thanks a lot for this great contribution

younesbelkada · 2023-11-23T12:33:52Z

Suggested change

# in fp32. (LlamaRMSNorm handles it correctly)

# in fp32. (PhiRMSNorm handles it correctly)

Hey, we don't have a PhiRMSNorm, only nn.LayerNorm in the Attention Layer. So removing this part of the line (~~PhiRMSNorm handles it correctly~~).

ok makes sense!

ArthurZucker

LGTM other than the small comments

ArthurZucker · 2023-11-23T14:02:01Z

that's not something we want to remove, we should also have flash attention support in persimmon should be the same as this

Actually, there is another PR which is adding the FA support for Persimmon.

Should I add the self.causal=True in the PersimmonAttention so that we can keep this # Copied from statement?

Done @ArthurZucker.

younesbelkada

Thanks ! We should be able to merge after accepting the suggestion below

younesbelkada · 2023-12-06T16:00:19Z

Suggested change

### Expected speedups

Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `susnato/phi-1_dev` checkpoint and the Flash Attention 2 version of the model using a sequence length of 2048.

<div style="text-align: center">

<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/phi_1_speedup_plot.jpg">

</div>

susnato · 2023-12-06T16:09:56Z

Hi @younesbelkada, I have pushed the suggestion you asked.

susnato · 2023-12-06T16:12:20Z

BTW when is the next release date for transformers?

younesbelkada

Hi @susnato sorry one last thing before merging, can you apply changes similar to this commit: 93fe356 and merge with upstream main branch?

susnato · 2023-12-06T16:28:03Z

Just force-pushed the branch along with the changes. @younesbelkada

younesbelkada

Thanks !
We'll probably be able to make it for the next release wdyt @ArthurZucker

younesbelkada · 2023-12-06T16:35:40Z

+    @pytest.mark.flash_attn_test
+    @slow
+    # Copied from tests.models.llama.test_modeling_llama.LlamaModelTest.test_flash_attn_2_generate_padding_right with LlamaForCausalLM->PhiForCausalLM,LlamaTokenizer->AutoTokenizer,meta-llama/Llama-2-7b-hf->susnato/phi-1_5_dev
+    def test_flash_attn_2_generate_padding_right(self):


ArthurZucker

LGTM cc @fxmarty for your sdpa PR that will need rebasing I think

fxmarty · 2023-12-07T10:07:41Z

yep

younesbelkada approved these changes Nov 23, 2023

View reviewed changes

younesbelkada requested a review from ArthurZucker November 23, 2023 12:34

ArthurZucker reviewed Nov 23, 2023

View reviewed changes

jeromeku mentioned this pull request Nov 24, 2023

Add Flash Attention 2 to Persimmon #27685

Closed

4 tasks

susnato mentioned this pull request Dec 4, 2023

CUDA OOM when using SFTTrainer with Phi-1.5B huggingface/trl#1034

Closed

younesbelkada approved these changes Dec 6, 2023

View reviewed changes

younesbelkada reviewed Dec 6, 2023

View reviewed changes

susnato added 6 commits December 6, 2023 21:53

add FA and modify doc file

cdcd671

test_flash_attn_2_generate_padding_right test overwritten

17185d8

comment

a33fa73

modify persimmon modeling file

e16090b

added speedup graph

30cac2e

more changes

9e22498

susnato force-pushed the flash_attn_phi branch from 8b9edbc to 9e22498 Compare December 6, 2023 16:26

younesbelkada approved these changes Dec 6, 2023

View reviewed changes

younesbelkada requested a review from ArthurZucker December 6, 2023 18:08

ArthurZucker approved these changes Dec 7, 2023

View reviewed changes

ArthurZucker merged commit f84d85b into huggingface:main Dec 7, 2023

susnato mentioned this pull request Jan 8, 2024

PhiForCausalLM does not support Flash Attention 2.0 #28381

Closed

	# in fp32. (LlamaRMSNorm handles it correctly)
	# in fp32. (PhiRMSNorm handles it correctly)

+### Expected speedups
+Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `susnato/phi-1_dev` checkpoint and the Flash Attention 2 version of the model using a sequence length of 2048.
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/phi_1_speedup_plot.jpg">
+</div>

Conversation

susnato commented Nov 22, 2023

What does this PR do?

Before submitting

Who can review?

Uh oh!

susnato commented Nov 22, 2023

Uh oh!

younesbelkada commented Nov 22, 2023

Uh oh!

susnato commented Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada commented Nov 23, 2023

Uh oh!

susnato commented Nov 23, 2023

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

susnato commented Dec 6, 2023

Uh oh!

susnato commented Dec 6, 2023

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

susnato commented Dec 6, 2023

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

fxmarty commented Dec 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

susnato commented Nov 22, 2023 •

edited

Loading