[Mixtral & Mistral] Add support for sdpa#28133
Conversation
There was a problem hiding this comment.
Thanks !
I don"t see why sliding window attention shouldn't be supported with SDPA because the only difference vs the eager attention implementation is on the attention mask. Passing arbitrary attention masks in SDPA should be supported without any problem IMO
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I have the same problem here, why sdpa not support window attention? Is there any problems not been solved? @ArthurZucker |
|
@ehuaa the way the window attention is implemented in Mistral original code base is by changing the attention mask to a "more custom" attention mask to not attend to tokens that are before The point that I tried to convey is that passing that attention mask is supported I think in SDPA so you can implicitly get SDPA + sliding window attention by just passing that correct attention mask. Let me know if this makes sense to you! |
@younesbelkada Thank you for your quick reply! Your solution above can pass a custom mask to sdpa, and i think this way is the same as passing sliding_window param to this function. |

What does this PR do?
Adds the SDPA attention for both classes cc @younesbelkada for visibility 😉 Will help for fast LLava