[PEFT] Fix prefix tuning#41696
Conversation
|
run-slow: bert,bart |
|
This comment contains run-slow, running the specified jobs: models: ['models/bart', 'models/bert'] |
|
run-slow: bert,bart |
|
This comment contains run-slow, running the specified jobs: models: ['models/bart', 'models/bert'] |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks a lot for taking quick care of this regression.
I also tested this PR with the original PEFT unit tests that were failing and they pass now!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Ping @ArthurZucker could we please get this in before the next release? |
Cyrilvallez
left a comment
There was a problem hiding this comment.
All right, let's go for it even if it's quite a weird pattern, as both should always be the same in general 🥲
Removed after more offline discussion
In transformers, the bidirecional mask creation (as e.g. used by Bert) does not take into account possible virtual tokens inserted by prefix tuning. This results in an attention mask of the wrong shape. This PR monkey patches the _preprocess_mask_arguments function from transformers to take the virtual tokens into account. This is far from ideal but there are currently no plans to fix this on the transformers side (see huggingface/transformers#41696).
Cyrilvallez
left a comment
There was a problem hiding this comment.
Alright, after all the discussions it's a bit too hard to fix in peft directly, so let's go for it!
* fix * simplify * add my 2 cents
Prefix tuning is a bit more of a special case where we have different q and kv lengths despite having the encoder attention. We now rely on the mask for kv length now (if provided). The assumption was holding before because causal masks always have q == kv length. Fixes #38301 (comment)
cc @BenjaminBossan