Use custom SDPA for decoder-only HF Transformers #46
Use custom SDPA for decoder-only HF Transformers #46guangy10 merged 3 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Will gate the access to custom_sdpa in 0.4.0 release. |
|
@larryliu0820 Updated: The issue is because the eager was using |
8e9e3c2 to
1b2cb42
Compare
|
Transformers version bump has been merged in #47 |
1b2cb42 to
cadd829
Compare
|
cc: @larryliu0820 @kimishpatel for review |
|
Can you change this: https://github.com/huggingface/optimum-executorch/blob/main/optimum/executorch/modeling.py#L181 to use the new Python API: https://pytorch.org/executorch/stable/index.html |
@larryliu0820 Yeah, I'm going to do it in a separate PR. |
754dd57 to
e8f5263
Compare
e8f5263 to
eb2c840
Compare
|
Rebased and fixed conflicts |
|
@larryliu0820 @kimishpatel good to merge? |
|
Support export with custom_sdpa using |
14a6bbd to
aab448f
Compare
4.51.0) in order to use theAttentionInterface. This has been addressed in Bump Transformers verion #47optimum-cli export executorchsupports custom SDPA3x speedup using custom SDPA for HF smollm2 (XNNPACK fp32):

General applicable to all causal LMs. For encoder-decoder models, it may apply to the self attention layer in the decoder, can make an experiment in a follow-up PR.