-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[Docs] update the PT 2.0 optimization doc with latest findings #3370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
| We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and different batch sizes for five of our most used pipelines. | ||
| In the following tables, we report our findings in terms of the number of iterations processed per second. | ||
|
|
||
| ### A100 (batch size: 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add RTX4090 and T4 no?
Think people will be quite interested in the home GPUs here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a reason why the PR is in the draft mode :-)
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
|
@patrickvonplaten @pcuenca this is now ready for review! |
pcuenca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Do you want me to regenerate some of the plots we used for the blog post? Let me know if you'd like to use any of them here and I'll prepare them.
| ``` | ||
|
|
||
| ## Using accelerated transformers and torch.compile. | ||
| ## Using accelerated transformers and `torch.compile`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the backquotes were not properly shown in headings when docs were generated? Let's just keep an eye on it and remove them if we see anything weird after merging :)
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Sure, that'd be very helpful! |
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it 😍 Super clean and nicely written.
I'm sure the community will love this!
Some graphs could definitely help (maybe we could use those as well to update the official PyTorch blog post cc @pcuenca )
Co-authored-by: Pedro <pedro@huggingface.co>
|
For the failing test: #3397 (comment) |
pcuenca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thanks a lot @sayakpaul!
| ``` | ||
|
|
||
| Depending on the type of GPU, `compile()` can yield between 2-9% of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100). | ||
| Depending on the type of GPU, `compile()` can yield between **3% - 56%** of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually much more than that, especially in the case of IF. I think the percent computation was wrong in the excel, so I changed it. For example, in A100 (BS 1), looking at txt2img in the stable version it goes from 21.66 it/s to 44.03 it/s, which is a bit more than double. Hence, the percent improvement is 103. Can you please double check @sayakpaul?
We could for example say that we get up to twice as many iterations per second, or almost 5x in the case of IF Stage I.
I'm also curious and surprised that the improvement is so big (especially in IF)! Do you have any insight on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did 100 * (b - a) / b following https://docs.google.com/spreadsheets/d/1LrltKSgZyOZiLQ7n7_GvIl_BoED-AHeVOJYBc_QqxXA/edit#gid=0
I'm also curious and surprised that the improvement is so big (especially in IF)! Do you have any insight on that?
It's probably just better suited for tiling and fusion and collectively they might contribute in improving the overall arithmetic density while minimizing the memory transfers. But need to profile to say for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're actually right. Let me edit and merge.
|
We may want to update this section in a follow up PR: https://huggingface.co/docs/diffusers/stable_diffusion#next-steps. Writing it down here so I don't forget. |
|
Cool should we merge this one? |
|
Waiting for @pcuenca to clarify: |
No description provided.