Add FA2 & SDPA support for RoBERTa & XLM-RoBERTa#30450
Add FA2 & SDPA support for RoBERTa & XLM-RoBERTa#30450tomaarsen wants to merge 4 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
thanks for this PR, use sdpa saved a ton of memory for GPU poor like me, super grateful |
|
@tomaarsen do you need a review on this one? |
|
@ArthurZucker Would be nice, though there's some conflicts now. I'll be off next week, so I'll be able to take care of the conflicts & any comments starting the 17th again.
|
younesbelkada
left a comment
There was a problem hiding this comment.
Looks pretty clean already, thanks a lot @tomaarsen ! Can you make sure to propagate the changes into the encoders that copy from Roberta by running make fix-copies. You would also need to update this file: https://github.com/huggingface/transformers/blob/main/docs/source/en/perf_infer_gpu_one.md to mention Roberta and all other models that now support FA2 & SDPA.
You also need to fix the merge conflicts that should be easy to fix ! 🙏
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
…sen/transformers into feat/roberta_sdpa_fa2make
|
There is another similar PR by the way: #30510 |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hello!
Pull Request overview
Details
The world of embedding models still very much relies on bert, roberta, xlm_roberta, mpnet, etc., but these model architectures have not yet received the benefits of FA2/SDPA. I'd like to make a start with that today.
I recognize that these models are tricky to change, as BERT especially is tangled in a big web of "Copied from" connections. However, I suspect that I've implemented FA2/SDPA such that it could be extended for a lot of architectures. However, I'd like to get reviews on the current implementation before I potentially expand to new architectures.
Most of the code is based on the Llama2 FA2/SDPA, so it should be fairly familiar. I want to note some limitations:
output_attentionsdoes not work for FA2/SDPA - this is fairly standard.head_maskdoes not work for FA2/SDPA.position_embedding_typewith anything other than"absolute"(i.e., the default) does not work for FA2/SDPA.Additionally, I have yet to write tests & I haven't tested all ways to use these models. Instead, I've only experimented with Sentence Transformers.
For a small RoBERTa-based model (https://huggingface.co/sentence-transformers/all-distilroberta-v1, 82M params), I get about a 10% speedup at one sample and a ~25% speedup at a large batch size with FA2 or SDPA. For a large XLM-RoBERTa-based model (https://huggingface.co/BAAI/bge-m3, 8192 sequence length), the speedup is up to 3x with FA2. Because newer embedding models are using larger sequence lengths, FA2/SDPA will become more important for them.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @younesbelkada
If I have a bit of a go-ahead, I can move forward with other architectures. Let me know if you'd like me to work on tests first, though. I'm also aware that the "copies" tests will currently fail due to these changes.