Skip to content

Update Transformers to 4.35 and fix pad_to_multiple_of#4817

Merged
Millu merged 4 commits intoinvoke-ai:mainfrom
Malrama:update-transformers
Nov 10, 2023
Merged

Update Transformers to 4.35 and fix pad_to_multiple_of#4817
Millu merged 4 commits intoinvoke-ai:mainfrom
Malrama:update-transformers

Conversation

@Malrama
Copy link
Copy Markdown
Contributor

@Malrama Malrama commented Oct 6, 2023

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Community Node Submission

Have you discussed this change with the InvokeAI team?

Have you updated all relevant documentation?

  • Yes
  • No

Description

This PR updates Transformers to the most recent version and fixes the value pad_to_multiple_of for text_encoder.resize_token_embeddings which was introduced with huggingface/transformers#25088 in Transformers 4.32.0.

According to the Nvidia Documentation, Performance is better when equivalent matrix dimensions M, N, and K are aligned to multiples of 8 bytes (or 64 bytes on A100) for FP16
This fixes the following error that was popping up before every invocation starting with Transformers 4.32.0
You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embedding dimension will be None. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

This is my first "real" fix PR, so I hope this is fine. Please inform me if there is anything wrong with this. I am glad to help.

Have a nice day and thank you!

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Added/updated tests?

  • Yes
  • No : please replace this line with details on why tests
    have not been included

[optional] Are there any post deployment tasks we need to perform?

@keturn
Copy link
Copy Markdown
Contributor

keturn commented Oct 6, 2023

I haven't messed with this adding-tokens code myself. From my experience with the outputs of the text encoder, I can imagine that adding that extra padding either:

  1. increases memory usage by a relatively small amount, and then has absolutely zero impact on the results.
  2. or maybe the padding isn't invisible to the following operations and this makes the results different (in a way that probably isn't better because the padding is neither matching up with the model's training nor containing any new information).

Do we have any tests that could tell us which?

Automated unit tests are always preferred -- but this is in the middle of some LoRA-loading code, and we don't have a test suite that contains a full model and LoRA. If we can't figure out how to factor out that code to an easily-testable function, some manual tests would be reassuring.

@blessedcoolant
Copy link
Copy Markdown
Collaborator

I think this should probably be exposed to the user. We set the default to 8 but otherwise provide the option to the user to change it as per their card.

In this particular case we are setting a custom value for the A100 as per the docs but when there are other card families that require different settings, then we'll need to constantly keep this updated and a user will have to wait on us to support it.

So maybe it should be exposed like @psychedelicious suggested. Chuck it in advanced settings in the UI.

@keturn
Copy link
Copy Markdown
Contributor

keturn commented Oct 6, 2023

then we'll need to constantly keep this updated and a user will have to wait on us to support it.

I too would like this to be more resilient than a substring check of the device name. Ideally pytorch would expose some torch.cuda.tensor_core_alignment_size number or something like that and we'd use that. But since we don't have that, and reading through the referenced nvidia docs didn't give me any bright ideas about how to come up with it...

I have some sympathy for the "this will be a maintenance burden and people will have to wait for a new release" argument. But practically speaking, this is proposing adding a configuration setting for the hypothetical benefit of someone who's just bought a bleeding-edge $10,000 GPU, for the sake of some small performance optimization during text encoding -- at the expense of everyone else in the world whose Settings screen just got that much longer with an option that's going to be hard to document in a way that means anything to the kind of people who aren't buying new bleeding-edge server hardware.

I'm afraid the support burden of answering all the "hey what's this setting can I change it to make my GTX 1660 go faster" would outweigh that hypothetical maintenance burden.

Anyone who is dropping that kind of money on server hardware is probably either already running their own fork, or can afford to sponsor a release.

[There's a hole in my argument if someone suddenly comes out with a new CUDA-capable GPU that's budget-priced and needs that number to be something other than 8, but I don't see that happening anytime soon.]

@hipsterusername
Copy link
Copy Markdown
Member

I think this should probably be exposed to the user. We set the default to 8 but otherwise provide the option to the user to change it as per their card.

In this particular case we are setting a custom value for the A100 as per the docs but when there are other card families that require different settings, then we'll need to constantly keep this updated and a user will have to wait on us to support it.

So maybe it should be exposed like @psychedelicious suggested. Chuck it in advanced settings in the UI.

To @keturn 's point, I don't think this should be in advanced settings. If we're exposing this as a configuration, it should live in the configuration file, which is where the type of person that is working w/ this hardware would need to operate anyways.

I do think it's important to test this thoroughly - I do like the cleanliness of addressing/cleaning up the warning/info messages of non-ideal implementations.

@blessedcoolant
Copy link
Copy Markdown
Collaborator

If we feel this value won't change for the majority of the cards in the near future, I'd prefer just hardcoding it too. It's a difficult setting for people to know what the right value should be anyway.

But if we do expose it, I think it lives in the UI because if it goes in the config file, changing it would mean restarting the backend which in this case is not necessary.

@hipsterusername
Copy link
Copy Markdown
Member

Hardcode it now, re-assess in the future?

@Malrama
Copy link
Copy Markdown
Contributor Author

Malrama commented Oct 6, 2023

I personally think this is a good fix for now, as a lot of stuff will likely get reworked in the future anyway.

@psychedelicious
Copy link
Copy Markdown
Contributor

Until proven otherwise, this seems like a thing where there is a correct value for your system. Not something you'd want to change regularly. So it makes sense to set it once in the config, requiring a restart.

FWIW, there are many other config settings that can be changed at runtime, but which we don't expose in the UI.

Copy link
Copy Markdown
Contributor

@psychedelicious psychedelicious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did the values of 64 from A100 and 8 for everything else come from?

The nvidia docs for current up-to-date software say ...most efficient with multiples of 8; on A100, multiples of 64.

(archive.org link bc the nvidia site is 500ing for me)

Note that's not saying 8 and 64, it's saying somewhat vaguely multiples of those numbers. It goes on:

We recommend choosing matrix dimensions to be multiples of 16 bytes (8 for FP16 as in Table 1); if this is not possible, choosing multiples of a smaller power of two (such as 8 or 4 bytes) often still helps performance with cuBLAS 11.0 and higher. On A100, choosing multiples of larger powers of two up to 128 bytes (64 for FP16) can further improve efficiency.

Note that now it says multiple powers of two up to 64.

Then the author of the transformers PR is asked about the correct value for a 4090 and isn't sure, guessing 32. The questioner is referred the HF forums.

There's no advice anywhere about FP32.

Overall, it's clear as mud which value is the right one.

We will need to upgrade transformers sooner or later, so we need to figure this out, just doesn't seem to be much info out there yet.

invokeai/app/services/config/invokeai_config.py would be the place to add a configuration setting for this - probably under the Device category.

Comment thread invokeai/backend/model_management/lora.py Outdated
Comment thread invokeai/backend/model_management/lora.py Outdated
@Millu
Copy link
Copy Markdown
Contributor

Millu commented Nov 3, 2023

I would love to meet the user that's running Invoke with an A100.

@Malrama would you be able to make the updates @psychedelicious and then we can get this in as hardcoded optimizations for now with an eye to revisit in the future?

@Malrama
Copy link
Copy Markdown
Contributor Author

Malrama commented Nov 4, 2023

I would love to meet the user that's running Invoke with an A100.

@Malrama would you be able to make the updates @psychedelicious and then we can get this in as hardcoded optimizations for now with an eye to revisit in the future?

I will try. I have to figure out how I am suppose to do it because Github is really new Land for me and I get quiet confused. Give me a little time. I assume I should only change the comment but we keep it hardcoded like this?

@Malrama
Copy link
Copy Markdown
Contributor Author

Malrama commented Nov 4, 2023

@Millu Did I do it right?

@Malrama Malrama changed the title Update Transformers to 4.34 and fix pad_to_multiple_of Update Transformers to 4.35 and fix pad_to_multiple_of Nov 4, 2023
@Malrama
Copy link
Copy Markdown
Contributor Author

Malrama commented Nov 4, 2023

Alright, sorry for the multiple commits. I think I'm done now unless I did something wrong or more changes need to be done.

@psychedelicious
Copy link
Copy Markdown
Contributor

@RyanJDick I'm not confident hardcoding this in case some other GPU platform needs a different value. Can you please review the nvidia/transformers docs and advise?

It's straightforward to make the setting configurable. Should we do that in this PR or is it not a big deal either way?

@RyanJDick
Copy link
Copy Markdown
Contributor

@RyanJDick I'm not confident hardcoding this in case some other GPU platform needs a different value. Can you please review the nvidia/transformers docs and advise?

It's straightforward to make the setting configurable. Should we do that in this PR or is it not a big deal either way?

Just did my best to catch up on this whole conversation. It feels very hand-wavy right now. There is guidance that suggests that setting the right padding value could improve performance, but there is still lots of uncertainty because we haven't actually profiled it on multiple cards. Questions I still have:

  • How much speedup does this provide?
  • Does the optimal setting depend on other factors than the card (e.g. driver version or dtype)?
  • Does the time cost of padding counteract the benefit in some cases?

Given all of this confusion, I'd propose:

  1. Set it to 8 in all cases.
  2. Include a comment summarizing all of this confusion.
  3. Only add device-specific configs once we can clearly prove that they are better. (Otherwise we just end up with hard-coded values that people are afraid to change/break in the future.)

@Malrama
Copy link
Copy Markdown
Contributor Author

Malrama commented Nov 8, 2023

Updated the PR with taking @RyanJDick points into account.

@lstein
Copy link
Copy Markdown
Collaborator

lstein commented Nov 8, 2023

Is there any benchmarking data for no padding vs padding=8 and other values? It seems like we're arguing over hypotheticals.

@psychedelicious psychedelicious self-requested a review November 9, 2023 05:44
Copy link
Copy Markdown
Contributor

@psychedelicious psychedelicious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for seeing this through @Malrama !

@psychedelicious
Copy link
Copy Markdown
Contributor

@lstein

Is there any benchmarking data for no padding vs padding=8 and other values? It seems like we're arguing over hypotheticals.

Well, that's the thing - we just want a value that works on all platforms. It looks like 8 is that number for now.

Copy link
Copy Markdown
Collaborator

@lstein lstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Millu Millu merged commit d63a614 into invoke-ai:main Nov 10, 2023
@Malrama Malrama deleted the update-transformers branch November 11, 2023 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants