Skip to content

Conversation

@sayakpaul
Copy link
Member

No description provided.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 9, 2023

The documentation is not available anymore as the PR was closed or merged.

We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and different batch sizes for five of our most used pipelines.
In the following tables, we report our findings in terms of the number of iterations processed per second.

### A100 (batch size: 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add RTX4090 and T4 no?

Think people will be quite interested in the home GPUs here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a reason why the PR is in the draft mode :-)

@sayakpaul sayakpaul marked this pull request as ready for review May 9, 2023 15:23
@sayakpaul
Copy link
Member Author

@patrickvonplaten @pcuenca this is now ready for review!

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Do you want me to regenerate some of the plots we used for the blog post? Let me know if you'd like to use any of them here and I'll prepare them.

```

## Using accelerated transformers and torch.compile.
## Using accelerated transformers and `torch.compile`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the backquotes were not properly shown in headings when docs were generated? Let's just keep an eye on it and remove them if we see anything weird after merging :)

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
@sayakpaul
Copy link
Member Author

Do you want me to regenerate some of the plots we used for the blog post?

Sure, that'd be very helpful!

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it 😍 Super clean and nicely written.
I'm sure the community will love this!

Some graphs could definitely help (maybe we could use those as well to update the official PyTorch blog post cc @pcuenca )

sayakpaul and others added 2 commits May 12, 2023 09:50
Co-authored-by: Pedro <pedro@huggingface.co>
@sayakpaul sayakpaul requested a review from pcuenca May 12, 2023 04:41
@sayakpaul
Copy link
Member Author

For the failing test: #3397 (comment)

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks a lot @sayakpaul!

```

Depending on the type of GPU, `compile()` can yield between 2-9% of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
Depending on the type of GPU, `compile()` can yield between **3% - 56%** of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually much more than that, especially in the case of IF. I think the percent computation was wrong in the excel, so I changed it. For example, in A100 (BS 1), looking at txt2img in the stable version it goes from 21.66 it/s to 44.03 it/s, which is a bit more than double. Hence, the percent improvement is 103. Can you please double check @sayakpaul?

We could for example say that we get up to twice as many iterations per second, or almost 5x in the case of IF Stage I.

I'm also curious and surprised that the improvement is so big (especially in IF)! Do you have any insight on that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did 100 * (b - a) / b following https://docs.google.com/spreadsheets/d/1LrltKSgZyOZiLQ7n7_GvIl_BoED-AHeVOJYBc_QqxXA/edit#gid=0

I'm also curious and surprised that the improvement is so big (especially in IF)! Do you have any insight on that?

It's probably just better suited for tiling and fusion and collectively they might contribute in improving the overall arithmetic density while minimizing the memory transfers. But need to profile to say for sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're actually right. Let me edit and merge.

@pcuenca
Copy link
Member

pcuenca commented May 12, 2023

We may want to update this section in a follow up PR: https://huggingface.co/docs/diffusers/stable_diffusion#next-steps. Writing it down here so I don't forget.

@patrickvonplaten
Copy link
Contributor

Cool should we merge this one?

@sayakpaul
Copy link
Member Author

Waiting for @pcuenca to clarify:

#3370 (comment)

@sayakpaul sayakpaul merged commit bdefabd into main May 13, 2023
@sayakpaul sayakpaul deleted the update-pt2-docs branch May 13, 2023 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants