System Info
Description
The Voxtral model crashes during LoRA training with:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
Location of Issue
The error occurs in modeling_voxtral.py at line 512 in the forward method:
if input_features is not None:
audio_embeds = self.get_audio_embeds(input_features)
# replace text-audio token placeholders with audio embeddings
audio_token_mask = input_ids == self.config.audio_token_id
inputs_embeds[audio_token_mask] = audio_embeds # <-- This line causes the error
Environment
- transformers: 4.55.4
- torch: 2.7.1
- mistral_common: 1.8.3
Who can help?
No response
Information
Tasks
Reproduction
# Configuration
from transformers import VoxtralForConditionalGeneration, VoxtralProcessor
from peft import LoraConfig, get_peft_model
import torch
import numpy as np
# Load model and processor
model_name = "mistralai/Voxtral-Mini-3B-2507"
processor = VoxtralProcessor.from_pretrained(model_name)
model = VoxtralForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
peft_config = LoraConfig(
r=32,
lora_alpha=64,
lora_dropout=0.01,
target_modules=["k_proj", "v_proj", "q_proj", "out_proj"],
bias="none",
)
model = get_peft_model(model, peft_config)
# Create proper audio input using the processor
dummy_audio = np.random.randn(48000) # 3 seconds of audio at 16kHz (1D array)
audio_inputs = processor.feature_extractor(
raw_speech=dummy_audio,
sampling_rate=16000,
return_tensors="pt"
)
# Prepare sample inputs (both audio features and text tokens required for transcription)
# This would typically come from your data collator during training
batch = {
"input_ids": torch.randint(0, 1000, (1, 50)).to(model.device),
"input_features": audio_inputs.input_features.to(model.device), # Use properly generated features
"attention_mask": torch.ones(1, 50).to(model.device),
"labels": torch.randint(0, 1000, (1, 50)).to(model.device)
}
print(f"input_features shape: {batch['input_features'].shape}")
print(f"input_ids shape: {batch['input_ids'].shape}")
# This triggers the error when both input_ids and input_features are provided
outputs = model(**batch)
510 # replace text-audio token placeholders with audio embeddings
511 audio_token_mask = input_ids == self.config.audio_token_id
--> 512 inputs_embeds[audio_token_mask] = audio_embeds
514 outputs: BaseModelOutputWithPast = self.language_model(
515 attention_mask=attention_mask,
516 position_ids=position_ids,
(...) 523 **kwargs,
524 )
525 return outputs
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
Expected behavior
Expected: Model works with LoRA training
Actual: Crashes with RuntimeError
System Info
Description
The Voxtral model crashes during LoRA training with:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
Location of Issue
The error occurs in
modeling_voxtral.pyat line 512 in theforwardmethod:Environment
Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Expected: Model works with LoRA training
Actual: Crashes with RuntimeError