[`Mixtral` & `Mistral`] Add support for sdpa by ArthurZucker · Pull Request #28133 · huggingface/transformers

ArthurZucker · 2023-12-19T07:55:23Z

What does this PR do?

Adds the SDPA attention for both classes cc @younesbelkada for visibility 😉 Will help for fast LLava

younesbelkada

Thanks !
I don"t see why sliding window attention shouldn't be supported with SDPA because the only difference vs the eager attention implementation is on the attention mask. Passing arbitrary attention masks in SDPA should be supported without any problem IMO

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

…s into add-mixtral-sdpa

HuggingFaceDocBuilderDev · 2023-12-21T07:49:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ehuaa · 2024-02-12T15:30:38Z

Thanks ! I don"t see why sliding window attention shouldn't be supported with SDPA because the only difference vs the eager attention implementation is on the attention mask. Passing arbitrary attention masks in SDPA should be supported without any problem IMO

I have the same problem here, why sdpa not support window attention? Is there any problems not been solved? @ArthurZucker

younesbelkada · 2024-02-12T23:09:08Z

@ehuaa the way the window attention is implemented in Mistral original code base is by changing the attention mask to a "more custom" attention mask to not attend to tokens that are before sliding_windows. Check out more by looking into the details of this method:

transformers/src/transformers/modeling_attn_mask_utils.py

Line 145 in d90acc1

def _make_causal_mask(

The point that I tried to convey is that passing that attention mask is supported I think in SDPA so you can implicitly get SDPA + sliding window attention by just passing that correct attention mask. Let me know if this makes sense to you!

ehuaa · 2024-02-13T12:01:10Z

@ehuaa the way the window attention is implemented in Mistral original code base is by changing the attention mask to a "more custom" attention mask to not attend to tokens that are before sliding_windows. Check out more by looking into the details of this method:

transformers/src/transformers/modeling_attn_mask_utils.py

Line 145 in d90acc1

def _make_causal_mask(

The point that I tried to convey is that passing that attention mask is supported I think in SDPA so you can implicitly get SDPA + sliding window attention by just passing that correct attention mask. Let me know if this makes sense to you!

@younesbelkada Thank you for your quick reply! Your solution above can pass a custom mask to sdpa, and i think this way is the same as passing sliding_window param to this function.
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L1006-L1023

ArthurZucker added 12 commits December 19, 2023 08:20

some nits

1f8d83a

update test

8f8d024

add support d\sd[a

c3905b4

remove some dummy inputs

69f6f9d

all good

a4c67b2

style

2e4fc18

nits

c22839e

fixes

cc724db

fix more copies

b48e064

nits

373cf16

styling

d49cec1

fix

b6e6929

ArthurZucker marked this pull request as ready for review December 20, 2023 18:05

ArthurZucker requested a review from younesbelkada December 20, 2023 18:05

younesbelkada approved these changes Dec 20, 2023

View reviewed changes

Comment thread src/transformers/models/mistral/modeling_mistral.py Outdated

Comment thread src/transformers/models/mistral/modeling_mistral.py

Update src/transformers/models/mistral/modeling_mistral.py

eeb456b

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

ArthurZucker requested a review from LysandreJik December 20, 2023 18:40

ArthurZucker and others added 3 commits December 20, 2023 19:48

add a slow test just to be sure

8475dc1

Merge branch 'add-mixtral-sdpa' of github.com:huggingface/transformer…

0a27cd9

…s into add-mixtral-sdpa

Merge branch 'main' into add-mixtral-sdpa

be74ae8

fixup

c932e14

ArthurZucker merged commit f9a98c4 into main Dec 21, 2023

ArthurZucker deleted the add-mixtral-sdpa branch December 21, 2023 11:38

ydshieh mentioned this pull request Dec 22, 2023

Fix the check of models supporting FA/SDPA not run #28202

Merged

NielsRogge mentioned this pull request Jan 17, 2024

OWL-VIT Vision Foundation Model deployment in the edge cases - Need SDPA support for OWL-ViT Model optimization for Edge Deployment #28103

Closed

fxmarty mentioned this pull request Feb 19, 2024

Add sliding window attention to sdpa in mistral #28980

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Mixtral` & `Mistral`] Add support for sdpa#28133

[`Mixtral` & `Mistral`] Add support for sdpa#28133
ArthurZucker merged 17 commits intomainfrom
add-mixtral-sdpa

ArthurZucker commented Dec 19, 2023 •

edited

Loading

Uh oh!

younesbelkada left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 21, 2023

Uh oh!

ehuaa commented Feb 12, 2024

Uh oh!

younesbelkada commented Feb 12, 2024

Uh oh!

ehuaa commented Feb 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ArthurZucker commented Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

younesbelkada left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 21, 2023

Uh oh!

ehuaa commented Feb 12, 2024

Uh oh!

younesbelkada commented Feb 12, 2024

Uh oh!

ehuaa commented Feb 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArthurZucker commented Dec 19, 2023 •

edited

Loading

younesbelkada left a comment •

edited

Loading