-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Description
Describe the bug
The attention slicing function as described in:
Sliced attention for additional memory savings
under https://huggingface.co/docs/diffusers/optimization/fp16
Is very outdated and we should probably remove the section from the optimizations docs.
90% of users use either xformers or Torch 2.0 with SDPA. If this is used then the attention computation is dispatched to flash attention in which case the attention function is very memory efficient and attention slicing:
a) makes no sense because the attention computation won't be the memory bottleneck and
b) can lead to serious slow downs
Only in edge cases does it make sense to keep using attention slicing.
Reproduction
N/A
Logs
No response
System Info
N/A
Who can help?
@stevhliu @sayakpaul can we remove attention slicing from this highly read doc page and in general add a disclaimer to the function that it should not be used in combination with SDPA or xformers?
I'm opening this issue because a user on Discord stumbled upon it