Minor test addition for sdpa producing NaNs for pad tokens#40971
Minor test addition for sdpa producing NaNs for pad tokens#40971DuyguA wants to merge 3 commits intohuggingface:mainfrom
Conversation
|
[For maintainers] Suggested jobs to run (before merge) run-slow: bert |
|
cc @Cyrilvallez for attention |
|
Hey! Thanks for opening the PR. However, the issue you are referring to is very old and well-known, and we've been working around it for quite some time already. Moreover, our tests run on latest pytorch, so it does not actually test anything. |
Sure, no worries @Cyrilvallez 🙂 One tiny question, for some reason flash attention is not supported for BERT and T5, I wonder PRs per model would be of interest? |
|
Hey! Bert got refactored a few days ago in #38301 to support it. Still not the case for T5 though if you want to try it out! |
The issue was fixed from PyTorch side, still I added a quick test to confirm the issue is indeed resolved.
Fixes #31035
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.