Voxtral model fails with LoRA due to in-place operation error

### System Info

## Description

The Voxtral model crashes during LoRA training with:
**RuntimeError**: a leaf Variable that requires grad is being used in an in-place operation.

## Location of Issue

The error occurs in `modeling_voxtral.py` at line 512 in the `forward` method:

```python
        if input_features is not None:
            audio_embeds = self.get_audio_embeds(input_features)

            # replace text-audio token placeholders with audio embeddings
            audio_token_mask = input_ids == self.config.audio_token_id
            inputs_embeds[audio_token_mask] = audio_embeds  # <-- This line causes the error
```

## Environment
- transformers: 4.55.4
- torch: 2.7.1
- mistral_common: 1.8.3

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

```python
# Configuration
from transformers import VoxtralForConditionalGeneration, VoxtralProcessor
from peft import LoraConfig, get_peft_model
import torch
import numpy as np

# Load model and processor
model_name = "mistralai/Voxtral-Mini-3B-2507"
processor = VoxtralProcessor.from_pretrained(model_name)
model = VoxtralForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

peft_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.01,
    target_modules=["k_proj", "v_proj", "q_proj", "out_proj"],
    bias="none",
)

model = get_peft_model(model, peft_config)

# Create proper audio input using the processor
dummy_audio = np.random.randn(48000)  # 3 seconds of audio at 16kHz (1D array)
audio_inputs = processor.feature_extractor(
    raw_speech=dummy_audio,
    sampling_rate=16000,
    return_tensors="pt"
)

# Prepare sample inputs (both audio features and text tokens required for transcription)
# This would typically come from your data collator during training
batch = {
    "input_ids": torch.randint(0, 1000, (1, 50)).to(model.device),
    "input_features": audio_inputs.input_features.to(model.device),  # Use properly generated features
    "attention_mask": torch.ones(1, 50).to(model.device),
    "labels": torch.randint(0, 1000, (1, 50)).to(model.device)
}

print(f"input_features shape: {batch['input_features'].shape}")
print(f"input_ids shape: {batch['input_ids'].shape}")

# This triggers the error when both input_ids and input_features are provided
outputs = model(**batch)
```

```
    510     # replace text-audio token placeholders with audio embeddings
    511     audio_token_mask = input_ids == self.config.audio_token_id
--> 512     inputs_embeds[audio_token_mask] = audio_embeds
    514 outputs: BaseModelOutputWithPast = self.language_model(
    515     attention_mask=attention_mask,
    516     position_ids=position_ids,
   (...)    523     **kwargs,
    524 )
    525 return outputs

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
```

### Expected behavior

**Expected**: Model works with LoRA training
**Actual**: Crashes with RuntimeError

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voxtral model fails with LoRA due to in-place operation error #40488

System Info

Description

Location of Issue

Environment

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Voxtral model fails with LoRA due to in-place operation error #40488

Description

System Info

Description

Location of Issue

Environment

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions