FIX: setting requires_grad on adapter layers#905
Merged
pacman100 merged 9 commits intohuggingface:mainfrom Sep 26, 2023
Merged
Conversation
This is an alternative to huggingface#900, resolves huggingface#899. Description Currently, we don't handle setting requires_grad on adapter layers really well. The main issue is that it can be set to True on adapter parameters that are not being used, e.g. the original_module in ModulesToSaveWrapper or inactive adapters in LoRA. Normally, this is not a big issue, except maybe if we want to correctly count the number of trainable parameters. However, when training with DistributedDataParallel, this results in errors, as PyTorch thinks that all parameters with requires_grad=True should participate in the loss computation, but those mentioned parameters don't. For that reason, training with DDP currently fails when using modules_to_save or multiple adapters. Implementation This turned out to be more complicated than I initially thought. The logic for setting requires_grad is all over the place, it was hard to encapsulate the logic and I only succeeded partially. As is, this PR is more complex than the one it tries to supersede, huggingface#900, but it is also "more correct". Tests were added to check whether requires_grad is set correctly. There are (so far) no tests for whether DDP indeed works, they could be added with multi-GPU. I did, however, test an early stage of this PR with DDP and setting requires_grad correctly will indeed fix the DDP error. DONE/TODO - [x] ModulesToSaveWrapper - [x] LoRA - [ ] IA³ - [ ] AdaLora Since some tuners are not implemented yet, tests are expected to fail. Check the new tests at the bottom of test_custom.py, those should pass.
|
The documentation is not available anymore as the PR was closed or merged. |
This was referenced Sep 19, 2023
pacman100
approved these changes
Sep 26, 2023
Contributor
pacman100
left a comment
There was a problem hiding this comment.
Thank you @BenjaminBossan for fixing this major bug when using DDP/Multiple Adapters with PEFT. LGTM! 🤗
younesbelkada
approved these changes
Sep 26, 2023
Contributor
younesbelkada
left a comment
There was a problem hiding this comment.
Thanks a mile @BenjaminBossan !
This was referenced Sep 26, 2023
9 tasks
This was referenced Oct 9, 2023
Guy-Bilitski
pushed a commit
to Guy-Bilitski/peft
that referenced
this pull request
May 13, 2025
* [WIP] Fix setting requires_grad on adapter layers This is an alternative to huggingface#900, resolves huggingface#899. Description Currently, we don't handle setting requires_grad on adapter layers really well. The main issue is that it can be set to True on adapter parameters that are not being used, e.g. the original_module in ModulesToSaveWrapper or inactive adapters in LoRA. Normally, this is not a big issue, except maybe if we want to correctly count the number of trainable parameters. However, when training with DistributedDataParallel, this results in errors, as PyTorch thinks that all parameters with requires_grad=True should participate in the loss computation, but those mentioned parameters don't. For that reason, training with DDP currently fails when using modules_to_save or multiple adapters. Implementation This turned out to be more complicated than I initially thought. The logic for setting requires_grad is all over the place, it was hard to encapsulate the logic and I only succeeded partially. As is, this PR is more complex than the one it tries to supersede, huggingface#900, but it is also "more correct". Tests were added to check whether requires_grad is set correctly. There are (so far) no tests for whether DDP indeed works, they could be added with multi-GPU. I did, however, test an early stage of this PR with DDP and setting requires_grad correctly will indeed fix the DDP error. DONE/TODO - [x] ModulesToSaveWrapper - [x] LoRA - [ ] IA³ - [ ] AdaLora Since some tuners are not implemented yet, tests are expected to fail. Check the new tests at the bottom of test_custom.py, those should pass. * Refactor: move more requires_grad machinery to ABC * [skip ci] [WIP] Add requires_grad logic to IA³ * Add AdaLora * Fix some minor issues * Make style
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an alternative to #900, resolves #899.
Thanks @passaglia for figuring out the underlying issue.
Description
Currently, we don't handle setting
requires_gradon adapter layers really well. The main issue is that it can be set toTrueon adapter parameters that are not being used, e.g. theoriginal_moduleinModulesToSaveWrapperor inactive adapters in LoRA.Normally, this is not a big issue, except maybe if we want to correctly count the number of trainable parameters. However, when training with
DistributedDataParallel, this results in errors, as PyTorch thinks that all parameters withrequires_grad=Trueshould participate in the loss computation, but those mentioned parameters don't. For that reason, training with DDP currently errors when usingmodules_to_saveor multiple adapters.Implementation
This turned out to be more complicated than I initially thought. The logic for setting
requires_gradis all over the place, it was hard to encapsulate the logic and I only succeeded partially. As is, this PR is more complex than the one it tries to supersede, #900, but it is also "more correct".Tests were added to check whether
requires_gradis set correctly. There are (so far) no tests for whether DDP indeed works, they could be added with multi-GPU. I did, however, test an early stage of this PR with DDP and settingrequires_gradcorrectly will indeed fix the DDP error.DONE/TODO