Remove mask slicing in all eager attentions#42186
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Was this able to move forward? Thanks! |
|
Hey @justinchuby! This has been on standby a bit as a lot of errors popped up, because some old models do not create their masks correctly... |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42186&sha=e2efac |
|
Finally ready now that #42848 was merged |
There was a problem hiding this comment.
Wdyt about using 2 sources (because some do fp32 softmax and some don't) for eager to unify this? Talking about those that are not using modular atm --> use copied from to ensure we only modify these
Just a thought which might make maintenance easier + I don't have a good overview on which are independent atm
| if attention_mask is not None: | ||
| causal_mask = attention_mask[:, :, :, : key_layer.shape[-1]] |
There was a problem hiding this comment.
Damn nice catch! This one thought it could keep on living... 😈
|
[For maintainers] Suggested jobs to run (before merge) run-slow: afmoe, albert, align, apertus, arcee, aria, audio_spectrogram_transformer, audioflamingo3, bamba, bart, bert, bert_generation, bigbird_pegasus, biogpt, bitnet, blenderbot |
|
I agree that in general it would be very nice to unify a bit more the eager implementations which are all the same but use different variable names etc for the same things. |
vasqu
left a comment
There was a problem hiding this comment.
Yup, it's fine that way - wanna run a few slow tests (run-slow) on important models for safety? Otherwise lgtm
|
run-slow: bert, gemma2, llama, mistral, mixtral |
|
This comment contains models: ["models/bert", "models/gemma2", "models/llama", "models/mistral", "models/mixtral"] |
* remove slice * finalize * remove shape check * modular * a few more * a few tried to escape
What does this PR do?
As per the title. Following #41900.
The mask is (and should!!) be correctly prepared, with the correct shape. If not, then it0s better to crash immediately, as otherwise this leads to very silent bugs!!