Update Transformers to 4.35 and fix pad_to_multiple_of#4817
Update Transformers to 4.35 and fix pad_to_multiple_of#4817Millu merged 4 commits intoinvoke-ai:mainfrom Malrama:update-transformers
Conversation
|
I haven't messed with this adding-tokens code myself. From my experience with the outputs of the text encoder, I can imagine that adding that extra padding either:
Do we have any tests that could tell us which? Automated unit tests are always preferred -- but this is in the middle of some LoRA-loading code, and we don't have a test suite that contains a full model and LoRA. If we can't figure out how to factor out that code to an easily-testable function, some manual tests would be reassuring. |
|
I think this should probably be exposed to the user. We set the default to 8 but otherwise provide the option to the user to change it as per their card. In this particular case we are setting a custom value for the So maybe it should be exposed like @psychedelicious suggested. Chuck it in advanced settings in the UI. |
I too would like this to be more resilient than a substring check of the device name. Ideally pytorch would expose some I have some sympathy for the "this will be a maintenance burden and people will have to wait for a new release" argument. But practically speaking, this is proposing adding a configuration setting for the hypothetical benefit of someone who's just bought a bleeding-edge $10,000 GPU, for the sake of some small performance optimization during text encoding -- at the expense of everyone else in the world whose Settings screen just got that much longer with an option that's going to be hard to document in a way that means anything to the kind of people who aren't buying new bleeding-edge server hardware. I'm afraid the support burden of answering all the "hey what's this setting can I change it to make my GTX 1660 go faster" would outweigh that hypothetical maintenance burden. Anyone who is dropping that kind of money on server hardware is probably either already running their own fork, or can afford to sponsor a release. [There's a hole in my argument if someone suddenly comes out with a new CUDA-capable GPU that's budget-priced and needs that number to be something other than |
To @keturn 's point, I don't think this should be in advanced settings. If we're exposing this as a configuration, it should live in the configuration file, which is where the type of person that is working w/ this hardware would need to operate anyways. I do think it's important to test this thoroughly - I do like the cleanliness of addressing/cleaning up the warning/info messages of non-ideal implementations. |
|
If we feel this value won't change for the majority of the cards in the near future, I'd prefer just hardcoding it too. It's a difficult setting for people to know what the right value should be anyway. But if we do expose it, I think it lives in the UI because if it goes in the config file, changing it would mean restarting the backend which in this case is not necessary. |
|
Hardcode it now, re-assess in the future? |
|
I personally think this is a good fix for now, as a lot of stuff will likely get reworked in the future anyway. |
|
Until proven otherwise, this seems like a thing where there is a correct value for your system. Not something you'd want to change regularly. So it makes sense to set it once in the config, requiring a restart. FWIW, there are many other config settings that can be changed at runtime, but which we don't expose in the UI. |
There was a problem hiding this comment.
Where did the values of 64 from A100 and 8 for everything else come from?
The nvidia docs for current up-to-date software say ...most efficient with multiples of 8; on A100, multiples of 64.
(archive.org link bc the nvidia site is 500ing for me)
Note that's not saying 8 and 64, it's saying somewhat vaguely multiples of those numbers. It goes on:
We recommend choosing matrix dimensions to be multiples of 16 bytes (8 for FP16 as in Table 1); if this is not possible, choosing multiples of a smaller power of two (such as 8 or 4 bytes) often still helps performance with cuBLAS 11.0 and higher. On A100, choosing multiples of larger powers of two up to 128 bytes (64 for FP16) can further improve efficiency.
Note that now it says multiple powers of two up to 64.
Then the author of the transformers PR is asked about the correct value for a 4090 and isn't sure, guessing 32. The questioner is referred the HF forums.
There's no advice anywhere about FP32.
Overall, it's clear as mud which value is the right one.
We will need to upgrade transformers sooner or later, so we need to figure this out, just doesn't seem to be much info out there yet.
invokeai/app/services/config/invokeai_config.py would be the place to add a configuration setting for this - probably under the Device category.
|
I would love to meet the user that's running Invoke with an A100. @Malrama would you be able to make the updates @psychedelicious and then we can get this in as hardcoded optimizations for now with an eye to revisit in the future? |
I will try. I have to figure out how I am suppose to do it because Github is really new Land for me and I get quiet confused. Give me a little time. I assume I should only change the comment but we keep it hardcoded like this? |
|
@Millu Did I do it right? |
|
Alright, sorry for the multiple commits. I think I'm done now unless I did something wrong or more changes need to be done. |
|
@RyanJDick I'm not confident hardcoding this in case some other GPU platform needs a different value. Can you please review the nvidia/transformers docs and advise? It's straightforward to make the setting configurable. Should we do that in this PR or is it not a big deal either way? |
Just did my best to catch up on this whole conversation. It feels very hand-wavy right now. There is guidance that suggests that setting the right padding value could improve performance, but there is still lots of uncertainty because we haven't actually profiled it on multiple cards. Questions I still have:
Given all of this confusion, I'd propose:
|
|
Updated the PR with taking @RyanJDick points into account. |
|
Is there any benchmarking data for no padding vs padding=8 and other values? It seems like we're arguing over hypotheticals. |
psychedelicious
left a comment
There was a problem hiding this comment.
Thanks for seeing this through @Malrama !
Well, that's the thing - we just want a value that works on all platforms. It looks like 8 is that number for now. |
What type of PR is this? (check all applicable)
Have you discussed this change with the InvokeAI team?
Have you updated all relevant documentation?
Description
This PR updates Transformers to the most recent version and fixes the value
pad_to_multiple_offortext_encoder.resize_token_embeddingswhich was introduced with huggingface/transformers#25088 in Transformers 4.32.0.According to the Nvidia Documentation,
Performance is better when equivalent matrix dimensions M, N, and K are aligned to multiples of 8 bytes (or 64 bytes on A100) for FP16This fixes the following error that was popping up before every invocation starting with Transformers 4.32.0
You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embedding dimension will be None. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tcThis is my first "real" fix PR, so I hope this is fine. Please inform me if there is anything wrong with this. I am glad to help.
Have a nice day and thank you!
Related Tickets & Documents
QA Instructions, Screenshots, Recordings
Added/updated tests?
have not been included
[optional] Are there any post deployment tasks we need to perform?