Fixing bug in Voxtral when merging text and audio embeddings#40671
Fixing bug in Voxtral when merging text and audio embeddings#40671eustlb merged 5 commits intohuggingface:mainfrom
Conversation
| # Enable gradient tracking when inputs_embeds is a leaf tensor | ||
| if inputs_embeds.is_leaf and inputs_embeds.requires_grad: | ||
| inputs_embeds = inputs_embeds.clone() | ||
| # replace text-audio token placeholders with audio embeddings | ||
| audio_token_mask = input_ids == self.config.audio_token_id | ||
| inputs_embeds[audio_token_mask] = audio_embeds | ||
| inputs_embeds[audio_token_mask] = audio_embeds.to(inputs_embeds.device) |
There was a problem hiding this comment.
Not a fan of using a clone tbh, can we use a masked scatter here instead, e.g. see
There was a problem hiding this comment.
Should be out-of-place op then, so I would hope it resolves the inplace operation issue
(probably still needs the device movement)
| # Enable gradient tracking when inputs_embeds is a leaf tensor | ||
| if inputs_embeds.is_leaf and inputs_embeds.requires_grad: | ||
| inputs_embeds = inputs_embeds.clone() | ||
| # replace text-audio token placeholders with audio embeddings | ||
| audio_token_mask = input_ids == self.config.audio_token_id | ||
| inputs_embeds[audio_token_mask] = audio_embeds | ||
| inputs_embeds[audio_token_mask] = audio_embeds.to(inputs_embeds.device) |
There was a problem hiding this comment.
Agree here with @vasqu, let's not forget also to move the mask to the correct device and make it's broadcastable!
Can you please change to:
# replace text-audio token placeholders with audio embeddings
audio_token_mask = (input_ids == self.config.audio_token_id).unsqueeze(-1)
inputs_embeds = inputs_embeds.masked_scatter(audio_token_mask.to(inputs_embeds.device), audio_embeds.to(inputs_embeds.device))|
[For maintainers] Suggested jobs to run (before merge) run-slow: voxtral |
|
Adding a for patch labels here @eustlb? And glad to help ❤️ |
|
Actually no, for-patch label is for what has been broken in last release, which is not the case here ;) |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
This PR addresses two issues arising from an in-place operation where audio and text embeddings are merged in the Voxtral model. This fixes Issue #40488 and an unreported issue resulting from using
device_map="auto"with the Voxtral model.More detail about the two issues and their resolution:
forwardmethod ofVoxtralForConditionalGenerationfails when using LoRA. The underlying issue is that when the text embedding layer is frozen, theinputs_embedstensor extracted from the embedding layer is a leaf tensor. Wheninputs_embedsis a leaf tensor that requires gradients, values of this tensor cannot be reassigned. To address this,inputs_embedswill be now be cloned to enable tracking of gradients when it is a leaf tensor that requires gradients.device_map="auto"with Voxtral. When usingdevice_map="auto", audio and text layers might be distributed across different devices. When this is the case,inputs_embedsandaudio_embedsmight be on different devices. In this case, attempting to assign values ofaudio_embedstoinputs_embedswill fail since both tensors are expected to be on the same device. To address this,audio_embedsis moved to the same device asinputs_embedsbefore updating values ofinputs_embeds.Before submitting
to it if that's the case. -> Voxtral model fails with LoRA due to in-place operation error #40488
Who can review?
@eustlb