{%- elif item['type'] == 'image' -%}
{{- '\n\n<|image|>\n\n' -}} ← has \n\n
{%- elif item['type'] == 'audio' -%}
{{- '<|audio|>' -}} ← missing \n\n
This causes the model to fail completely when image is placed before audio in the message content list.
Reproduction
python
from transformers import AutoProcessor, AutoModelForMultimodalLM
import torch
processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it")
model = AutoModelForMultimodalLM.from_pretrained(
"google/gemma-4-E2B-it",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# ❌ image before audio → model fails
messages_image_first = [
{
"role": "user",
"content": [
{"type": "image", "url": IMAGE_PATH},
{"type": "audio", "audio": AUDIO_PATH},
{"type": "text", "text": "Describe the image and audio."},
]
}
]
# ✅ audio before image → works correctly
messages_audio_first = [
{
"role": "user",
"content": [
{"type": "audio", "audio": AUDIO_PATH},
{"type": "image", "url": IMAGE_PATH},
{"type": "text", "text": "Describe the image and audio."},
]
}
]
Root Cause
The jinja template inserts \n\n around image tokens but not around audio tokens.
Image-first token sequence (broken):
<|image>...<image|>\n\n<|audio>...<audio|>Describe the image and audio.
↑ audio token directly concatenated with text, no separator
Audio-first token sequence (correct):
<|audio>...<audio|>\n\n<|image>...<image|>\n\n Describe the image and audio.
↑ ↑ correct \n\n separators
Note also that before the fix, the two orderings produce different input_ids shapes:
audio-first shape: torch.Size([1, 738])
image-first shape: torch.Size([1, 737]) ← 1 token missing due to missing \n\n
Evidence
Top 10 next tokens with image-first (before fix):
'<turn|>': 0.7383 ← model immediately ends the turn
'<eos>': 0.1406
Top 10 next tokens with image-first (after fix):
'这张': 0.6797 ← model correctly starts generating
'好的': 0.2656
Full generation with audio-first (before fix):
这张图片展示了一个教室的场景。画面中有一位戴眼镜的女性老师站在讲台后面...
音频中有人在呼喊"Look look look at the girl"...
Full generation with image-first (before fix):
(empty, model outputs <turn|> immediately)
Full generation with image-first (after fix):
这张图片展示了一个教室的场景,有几位学生和一位老师...
音频内容似乎是孩子们在进行某种对话或游戏...
Fix
In chat_template.jinja, change:
jinja
{%- elif item['type'] == 'audio' -%}
{{- '<|audio|>' -}}
to:
jinja
{%- elif item['type'] == 'audio' -%}
{{- '\n\n<|audio|>\n\n' -}}
After the fix, both orderings produce identical input_ids shapes and correct outputs.
Environment
transformers version: 5.5.0
Model: google/gemma-4-E2B-it
Bug Description
In
chat_template.jinjafor Gemma4, the image token has\n\nseparatorsbut the audio token does not: