Update Transformers to 4.35 and fix pad_to_multiple_of by Malrama · Pull Request #4817 · invoke-ai/InvokeAI

Malrama · 2023-10-06T17:06:00Z

What type of PR is this? (check all applicable)

Have you discussed this change with the InvokeAI team?

Yes, with @blessedcoolant
No, because:

Have you updated all relevant documentation?

Yes
No

Description

This PR updates Transformers to the most recent version and fixes the value pad_to_multiple_of for text_encoder.resize_token_embeddings which was introduced with huggingface/transformers#25088 in Transformers 4.32.0.

According to the Nvidia Documentation, Performance is better when equivalent matrix dimensions M, N, and K are aligned to multiples of 8 bytes (or 64 bytes on A100) for FP16
This fixes the following error that was popping up before every invocation starting with Transformers 4.32.0
You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embedding dimension will be None. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

This is my first "real" fix PR, so I hope this is fine. Please inform me if there is anything wrong with this. I am glad to help.

Have a nice day and thank you!

Related Tickets & Documents

Related Issue: resize_token_embeddings warning should provide more context and be easier to squelch huggingface/transformers#26303
Related Discord discussion: https://discord.com/channels/1020123559063990373/1154152783579197571
Closes #

QA Instructions, Screenshots, Recordings

Added/updated tests?

Yes
No : please replace this line with details on why tests
have not been included

[optional] Are there any post deployment tasks we need to perform?

keturn · 2023-10-06T17:40:00Z

I haven't messed with this adding-tokens code myself. From my experience with the outputs of the text encoder, I can imagine that adding that extra padding either:

increases memory usage by a relatively small amount, and then has absolutely zero impact on the results.
or maybe the padding isn't invisible to the following operations and this makes the results different (in a way that probably isn't better because the padding is neither matching up with the model's training nor containing any new information).

Do we have any tests that could tell us which?

Automated unit tests are always preferred -- but this is in the middle of some LoRA-loading code, and we don't have a test suite that contains a full model and LoRA. If we can't figure out how to factor out that code to an easily-testable function, some manual tests would be reassuring.

blessedcoolant · 2023-10-06T18:01:04Z

I think this should probably be exposed to the user. We set the default to 8 but otherwise provide the option to the user to change it as per their card.

In this particular case we are setting a custom value for the A100 as per the docs but when there are other card families that require different settings, then we'll need to constantly keep this updated and a user will have to wait on us to support it.

So maybe it should be exposed like @psychedelicious suggested. Chuck it in advanced settings in the UI.

keturn · 2023-10-06T18:30:17Z

then we'll need to constantly keep this updated and a user will have to wait on us to support it.

I too would like this to be more resilient than a substring check of the device name. Ideally pytorch would expose some torch.cuda.tensor_core_alignment_size number or something like that and we'd use that. But since we don't have that, and reading through the referenced nvidia docs didn't give me any bright ideas about how to come up with it...

I have some sympathy for the "this will be a maintenance burden and people will have to wait for a new release" argument. But practically speaking, this is proposing adding a configuration setting for the hypothetical benefit of someone who's just bought a bleeding-edge $10,000 GPU, for the sake of some small performance optimization during text encoding -- at the expense of everyone else in the world whose Settings screen just got that much longer with an option that's going to be hard to document in a way that means anything to the kind of people who aren't buying new bleeding-edge server hardware.

I'm afraid the support burden of answering all the "hey what's this setting can I change it to make my GTX 1660 go faster" would outweigh that hypothetical maintenance burden.

Anyone who is dropping that kind of money on server hardware is probably either already running their own fork, or can afford to sponsor a release.

[There's a hole in my argument if someone suddenly comes out with a new CUDA-capable GPU that's budget-priced and needs that number to be something other than 8, but I don't see that happening anytime soon.]

hipsterusername · 2023-10-06T20:12:12Z

I think this should probably be exposed to the user. We set the default to 8 but otherwise provide the option to the user to change it as per their card.

In this particular case we are setting a custom value for the A100 as per the docs but when there are other card families that require different settings, then we'll need to constantly keep this updated and a user will have to wait on us to support it.

So maybe it should be exposed like @psychedelicious suggested. Chuck it in advanced settings in the UI.

To @keturn 's point, I don't think this should be in advanced settings. If we're exposing this as a configuration, it should live in the configuration file, which is where the type of person that is working w/ this hardware would need to operate anyways.

I do think it's important to test this thoroughly - I do like the cleanliness of addressing/cleaning up the warning/info messages of non-ideal implementations.

blessedcoolant · 2023-10-06T20:21:44Z

If we feel this value won't change for the majority of the cards in the near future, I'd prefer just hardcoding it too. It's a difficult setting for people to know what the right value should be anyway.

But if we do expose it, I think it lives in the UI because if it goes in the config file, changing it would mean restarting the backend which in this case is not necessary.

hipsterusername · 2023-10-06T20:42:57Z

Hardcode it now, re-assess in the future?

Malrama · 2023-10-06T21:06:13Z

I personally think this is a good fix for now, as a lot of stuff will likely get reworked in the future anyway.

psychedelicious · 2023-10-06T21:34:34Z

Until proven otherwise, this seems like a thing where there is a correct value for your system. Not something you'd want to change regularly. So it makes sense to set it once in the config, requiring a restart.

FWIW, there are many other config settings that can be changed at runtime, but which we don't expose in the UI.

psychedelicious

Where did the values of 64 from A100 and 8 for everything else come from?

The nvidia docs for current up-to-date software say ...most efficient with multiples of 8; on A100, multiples of 64.

(archive.org link bc the nvidia site is 500ing for me)

Note that's not saying 8 and 64, it's saying somewhat vaguely multiples of those numbers. It goes on:

We recommend choosing matrix dimensions to be multiples of 16 bytes (8 for FP16 as in Table 1); if this is not possible, choosing multiples of a smaller power of two (such as 8 or 4 bytes) often still helps performance with cuBLAS 11.0 and higher. On A100, choosing multiples of larger powers of two up to 128 bytes (64 for FP16) can further improve efficiency.

Note that now it says multiple powers of two up to 64.

Then the author of the transformers PR is asked about the correct value for a 4090 and isn't sure, guessing 32. The questioner is referred the HF forums.

There's no advice anywhere about FP32.

Overall, it's clear as mud which value is the right one.

We will need to upgrade transformers sooner or later, so we need to figure this out, just doesn't seem to be much info out there yet.

invokeai/app/services/config/invokeai_config.py would be the place to add a configuration setting for this - probably under the Device category.

Millu · 2023-11-03T04:56:27Z

I would love to meet the user that's running Invoke with an A100.

@Malrama would you be able to make the updates @psychedelicious and then we can get this in as hardcoded optimizations for now with an eye to revisit in the future?

Malrama · 2023-11-04T12:40:28Z

I would love to meet the user that's running Invoke with an A100.

@Malrama would you be able to make the updates @psychedelicious and then we can get this in as hardcoded optimizations for now with an eye to revisit in the future?

I will try. I have to figure out how I am suppose to do it because Github is really new Land for me and I get quiet confused. Give me a little time. I assume I should only change the comment but we keep it hardcoded like this?

Malrama · 2023-11-04T12:59:19Z

@Millu Did I do it right?

Malrama · 2023-11-04T13:03:26Z

Alright, sorry for the multiple commits. I think I'm done now unless I did something wrong or more changes need to be done.

psychedelicious · 2023-11-05T01:08:32Z

@RyanJDick I'm not confident hardcoding this in case some other GPU platform needs a different value. Can you please review the nvidia/transformers docs and advise?

It's straightforward to make the setting configurable. Should we do that in this PR or is it not a big deal either way?

RyanJDick · 2023-11-05T02:07:31Z

@RyanJDick I'm not confident hardcoding this in case some other GPU platform needs a different value. Can you please review the nvidia/transformers docs and advise?

It's straightforward to make the setting configurable. Should we do that in this PR or is it not a big deal either way?

Just did my best to catch up on this whole conversation. It feels very hand-wavy right now. There is guidance that suggests that setting the right padding value could improve performance, but there is still lots of uncertainty because we haven't actually profiled it on multiple cards. Questions I still have:

How much speedup does this provide?
Does the optimal setting depend on other factors than the card (e.g. driver version or dtype)?
Does the time cost of padding counteract the benefit in some cases?

Given all of this confusion, I'd propose:

Set it to 8 in all cases.
Include a comment summarizing all of this confusion.
Only add device-specific configs once we can clearly prove that they are better. (Otherwise we just end up with hard-coded values that people are afraid to change/break in the future.)

Malrama · 2023-11-08T16:06:20Z

Updated the PR with taking @RyanJDick points into account.

lstein · 2023-11-08T22:13:34Z

Is there any benchmarking data for no padding vs padding=8 and other values? It seems like we're arguing over hypotheticals.

psychedelicious

Thanks for seeing this through @Malrama !

psychedelicious · 2023-11-09T05:46:15Z

@lstein

Is there any benchmarking data for no padding vs padding=8 and other values? It seems like we're arguing over hypotheticals.

Well, that's the thing - we just want a value that works on all platforms. It looks like 8 is that number for now.

lstein

LGTM

…'s temporary status

Malrama requested review from GreggHelt2, RyanJDick, StAlKeR7779, blessedcoolant, brandonrising, damian0815, hipsterusername and lstein as code owners October 6, 2023 17:06

psychedelicious suggested changes Oct 10, 2023

View reviewed changes

Comment thread invokeai/backend/model_management/lora.py Outdated

Comment thread invokeai/backend/model_management/lora.py Outdated

Malrama changed the title ~~Update Transformers to 4.34 and fix pad_to_multiple_of~~ Update Transformers to 4.35 and fix pad_to_multiple_of Nov 4, 2023

psychedelicious self-requested a review November 9, 2023 05:44

psychedelicious approved these changes Nov 9, 2023

View reviewed changes

RyanJDick approved these changes Nov 10, 2023

View reviewed changes

lstein approved these changes Nov 10, 2023

View reviewed changes

Malrama added 4 commits November 10, 2023 17:51

Update Transformers to 4.34 and fix pad_to_multiple_of

a0be83e

Update Documentation

8831d1e

Update to 4.35.X

b9f607b

Change pad_to_multiple_of to be 8 for all cases. Add comment about it…

6001d3d

…'s temporary status

Millu merged commit d63a614 into invoke-ai:main Nov 10, 2023

Malrama deleted the update-transformers branch November 11, 2023 07:46

Conversation

Malrama commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Have you discussed this change with the InvokeAI team?

Have you updated all relevant documentation?

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Added/updated tests?

[optional] Are there any post deployment tasks we need to perform?

Uh oh!

keturn commented Oct 6, 2023

Uh oh!

blessedcoolant commented Oct 6, 2023

Uh oh!

keturn commented Oct 6, 2023

Uh oh!

hipsterusername commented Oct 6, 2023

Uh oh!

blessedcoolant commented Oct 6, 2023

Uh oh!

hipsterusername commented Oct 6, 2023

Uh oh!

Malrama commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psychedelicious commented Oct 6, 2023

Uh oh!

psychedelicious left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Millu commented Nov 3, 2023

Uh oh!

Malrama commented Nov 4, 2023

Uh oh!

Malrama commented Nov 4, 2023

Uh oh!

Malrama commented Nov 4, 2023

Uh oh!

psychedelicious commented Nov 5, 2023

Uh oh!

RyanJDick commented Nov 5, 2023

Uh oh!

Malrama commented Nov 8, 2023

Uh oh!

lstein commented Nov 8, 2023

Uh oh!

psychedelicious left a comment

Choose a reason for hiding this comment

Uh oh!

psychedelicious commented Nov 9, 2023

Uh oh!

lstein left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Malrama commented Oct 6, 2023 •

edited

Loading

Malrama commented Oct 6, 2023 •

edited

Loading

psychedelicious left a comment •

edited

Loading