Skip to content

Trainable Tokens: Support for Weight Tying#2399

Merged
githubnemo merged 21 commits intohuggingface:mainfrom
githubnemo:feature/custom-token-tuner-weight-tying
Mar 6, 2025
Merged

Trainable Tokens: Support for Weight Tying#2399
githubnemo merged 21 commits intohuggingface:mainfrom
githubnemo:feature/custom-token-tuner-weight-tying

Conversation

@githubnemo
Copy link
Copy Markdown
Collaborator

@githubnemo githubnemo commented Feb 25, 2025

This is a follow-up PR of #2376 to add support for weight-tying. Do not merge before the other is not merged.

What is this

Some models, such as gpt2, tie the weights between the LM head and the input embeddings for various reasons. If we use the trainable tokens adapter, we're changing the result of the forward() of the input embeddings but we do not change the weights (unless we merge()). This means that the changes are not reflected in the tied weights, such as the LM head, leading to wrong results when training.

How it is solved

The current approach is searching for tied layers and putting TrainableTokensLayer adapters on them as well but initialized to use the parameters from the embedding layer's TrainableTokensLayer. This is done via the tied_adapter argument of TrailableTokensLayer.__init__().

What needs to be done

  • encoder-decoder model tests
  • support for standalone TrainableTokens adapter
  • more tests

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@githubnemo githubnemo force-pushed the feature/custom-token-tuner-weight-tying branch from 69948b9 to ac70db6 Compare February 26, 2025 16:00
nemo added 11 commits February 26, 2025 17:21
Notably we are removing the duplication filter of `named_modules` when searching for
the (tied) target modules since tied weights are by definition duplicates.
It's now possible to let the adapter decide which is the input embedding layer based on the output
of `model.get_input_embeddings()`. If that fails, the default is still `embed_tokens`.
This is probably just a case of model misconfiguration but there are cases in the tests
where tie_embedding_weights is set to true in the config but no tied_weights_keys is set on the model.
Before this change only the selection of the module that was supposed to have the queried
attribute was given to the wrapper implemention (via `_{has,get}attr_wrapped`). Now the full
`getattr()` call is done by the implementation.

This change is motivated by the need for access to `embedding.weight` at certain times which,
for `ModulesToSaveWrapper` is not a problem - but it is for `TrainableTokensWrapper` since
the original module's weights differ from the current weights, at least potentially.

What we do now is to merge the weights and return those when `embedding.weight` is accessed.
No other attributes are currently forwarded.
Mixed batch is still broken, though.
Looking at you, stable diffusion
@githubnemo githubnemo marked this pull request as ready for review March 3, 2025 15:43
Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding support for trainable tokens with tied embeddings and enhancing the tests. This was more complex than I expected. Good work covering this many edge cases.

I have a couple of comments, but I think there is nothing major.

Comment thread src/peft/peft_model.py
Comment thread src/peft/tuners/lora/config.py Outdated
found, `embed_tokens`). Alternatively, you can specify a dictionary where the key is the name of the
embedding module and the values are the list of token indices, e.g. `{'embed_tokens': [0, 1, ...]}`. Note
that training with FSDP/DeepSpeed might not yet be fully supported with this option enabled. Also note that
models using weight-tying are currently not supported.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust/delete?

Comment thread src/peft/tuners/trainable_tokens/layer.py

def update_layer(self, adapter_name, **kwargs):
if kwargs.get("tied_adapter", None):
# in this case we don't have any say, we're just following whatever the tied
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# in this case we don't have any say, we're just following whatever the tied
# in this case we don't have any because we're just following whatever the tied

?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to express that, as a tied layer, we don't have anything to do but to return. I'll clarify.

scale_grad_by_freq=self.base_layer.scale_grad_by_freq,
sparse=self.base_layer.sparse,
)
elif isinstance(self.base_layer, torch.nn.Linear):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would not necessarily work with quantized models, right? I wonder if we can find a more robust way of handling this, but I'm not sure how exactly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least for bnb that isinstance(self.base_layer, torch.nn.Linear) holds true. Or what do you mean?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true for some quantization methods but not for others. Just an example:

from transformers import AutoModelForCausalLM, HqqConfig

quant_config = HqqConfig(nbits=8, group_size=64)
model = AutoModelForCausalLM.from_pretrained('facebook/opt-125m', quantization_config=quant_config)
isinstance(model.model.decoder.layers[0].self_attn.k_proj, nn.Linear)  # => False

Although in this case, the LM head is a normal nn.Linear layer. I'm not sure if this is always the case.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is it realistic that we address the issue in this PR? It seems rather like a separate can of worms.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's keep it for later. Then let's add a comment here so that we don't forget.

Comment thread tests/test_trainable_tokens.py Outdated
emb_in = peft_model.model.encoder.embed_tokens(torch.tensor([token_indices]))
emb_out = peft_model.model.lm_head(1 / emb_in)

assert all(torch.diag(emb_out[0]) == torch.tensor([emb_dim] * len(token_indices)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same argument as above

[
("model_emb", lambda model: model.emb),
("model_embed_in", lambda model: model.embed_in),
("model", lambda model: model.model.model.embed_tokens),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be more prudent to use operator.attrgetter than lambda but maybe it's unproblematic here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight yes, I didn't know if there are models where it is a bit more complicated so I left it as lambdas. I can change it if you want.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that lambda has some weird scoping rules, but I can never remember what exactly it was. If it's not relevant here, it's okay to leave it as is.

Comment thread src/peft/utils/other.py Outdated
return self.token_adapter.get_base_layer()


def _get_input_embeddings_name(model):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function could get a default argument that it returns instead of None, like getattr, but no strong opinion.

Comment thread tests/testing_common.py Outdated
if hasattr(model, "config"): # custom models don't have a config attribute
assert config["base_model_name_or_path"] == model.config.to_dict()["_name_or_path"]

def perturb_trainable_token_weights_if_used(self, model, config_kwargs, adapter_name="default", weight=1.0):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def perturb_trainable_token_weights_if_used(self, model, config_kwargs, adapter_name="default", weight=1.0):
def perturb_trainable_token_weights_if_used(self, model, config_kwargs, adapter_name="default", scale=1.0):

Maybe more precise name?

Comment thread tests/testing_common.py
githubnemo and others added 8 commits March 4, 2025 17:34
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
…githubnemo/peft into feature/custom-token-tuner-weight-tying
* initialization from buffers was broken since `persistent` flag was set too late
  (update() is called before setting the flag)

* update from other BufferDict was broken since it was assumed that BufferDict was
  a mapping collection object. we cannot simply change it to a Mapping since it
  then will break pytorch code which assumes that modules are hashable.
Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the last changes and for the fixes to BufferDict. PR LGTM.

scale_grad_by_freq=self.base_layer.scale_grad_by_freq,
sparse=self.base_layer.sparse,
)
elif isinstance(self.base_layer, torch.nn.Linear):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's keep it for later. Then let's add a comment here so that we don't forget.

@githubnemo githubnemo merged commit 461f642 into huggingface:main Mar 6, 2025
Guy-Bilitski pushed a commit to Guy-Bilitski/peft that referenced this pull request May 13, 2025
This is a follow-up PR of huggingface#2376 to add support for weight-tying.

Some models, such as gpt2, tie the weights between the LM head and the input embeddings for various reasons. If we use the trainable tokens adapter, we're changing the result of the forward() of the input embeddings but we do not change the weights (unless we merge()). This means that the changes are not reflected in the tied weights, such as the LM head, leading to wrong results when training.

The current approach is searching for tied layers and putting TrainableTokensLayer adapters on them as well but initialized to use the parameters from the embedding layer's TrainableTokensLayer. This is done via the tied_adapter argument of TrailableTokensLayer.__init__().

Notable other changes:

* Implement weight-tying for encoder-decoder models

Notably we are removing the duplication filter of `named_modules` when searching for
the (tied) target modules since tied weights are by definition duplicates.

* Implement embedding name inference

It's now possible to let the adapter decide which is the input embedding layer based on the output
of `model.get_input_embeddings()`. If that fails, the default is still `embed_tokens`.

* Refactor getattr in AuxiliaryTrainingWrapper

Before this change only the selection of the module that was supposed to have the queried
attribute was given to the wrapper implemention (via `_{has,get}attr_wrapped`). Now the full
`getattr()` call is done by the implementation.

This change is motivated by the need for access to `embedding.weight` at certain times which,
for `ModulesToSaveWrapper` is not a problem - but it is for `TrainableTokensWrapper` since
the original module's weights differ from the current weights, at least potentially.

What we do now is to merge the weights and return those when `embedding.weight` is accessed.
No other attributes are currently forwarded.

* initialization from buffers was broken since `persistent` flag was set too late
  (update() is called before setting the flag)

* update from other BufferDict was broken since it was assumed that BufferDict was
  a mapping collection object. we cannot simply change it to a Mapping since it
  then will break pytorch code which assumes that modules are hashable.

---------

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
cyyever pushed a commit to cyyever/peft that referenced this pull request Sep 4, 2025
* New type hint structure

* Update type hints

* Delete wrong file

* Remove dict import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants