Skip to content

Voxtral model fails with LoRA due to in-place operation error #40488

@juzhxng

Description

@juzhxng

System Info

Description

The Voxtral model crashes during LoRA training with:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Location of Issue

The error occurs in modeling_voxtral.py at line 512 in the forward method:

        if input_features is not None:
            audio_embeds = self.get_audio_embeds(input_features)

            # replace text-audio token placeholders with audio embeddings
            audio_token_mask = input_ids == self.config.audio_token_id
            inputs_embeds[audio_token_mask] = audio_embeds  # <-- This line causes the error

Environment

  • transformers: 4.55.4
  • torch: 2.7.1
  • mistral_common: 1.8.3

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

# Configuration
from transformers import VoxtralForConditionalGeneration, VoxtralProcessor
from peft import LoraConfig, get_peft_model
import torch
import numpy as np

# Load model and processor
model_name = "mistralai/Voxtral-Mini-3B-2507"
processor = VoxtralProcessor.from_pretrained(model_name)
model = VoxtralForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

peft_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.01,
    target_modules=["k_proj", "v_proj", "q_proj", "out_proj"],
    bias="none",
)

model = get_peft_model(model, peft_config)

# Create proper audio input using the processor
dummy_audio = np.random.randn(48000)  # 3 seconds of audio at 16kHz (1D array)
audio_inputs = processor.feature_extractor(
    raw_speech=dummy_audio,
    sampling_rate=16000,
    return_tensors="pt"
)

# Prepare sample inputs (both audio features and text tokens required for transcription)
# This would typically come from your data collator during training
batch = {
    "input_ids": torch.randint(0, 1000, (1, 50)).to(model.device),
    "input_features": audio_inputs.input_features.to(model.device),  # Use properly generated features
    "attention_mask": torch.ones(1, 50).to(model.device),
    "labels": torch.randint(0, 1000, (1, 50)).to(model.device)
}

print(f"input_features shape: {batch['input_features'].shape}")
print(f"input_ids shape: {batch['input_ids'].shape}")

# This triggers the error when both input_ids and input_features are provided
outputs = model(**batch)
    510     # replace text-audio token placeholders with audio embeddings
    511     audio_token_mask = input_ids == self.config.audio_token_id
--> 512     inputs_embeds[audio_token_mask] = audio_embeds
    514 outputs: BaseModelOutputWithPast = self.language_model(
    515     attention_mask=attention_mask,
    516     position_ids=position_ids,
   (...)    523     **kwargs,
    524 )
    525 return outputs

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Expected behavior

Expected: Model works with LoRA training
Actual: Crashes with RuntimeError

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions