Skip to content

[Gemma4] Bug: audio token missing newline separators in chat_template.jinja causes multimodal failure when image precedes audio #45331

@LfWhat

Description

@LfWhat

Bug Description

In chat_template.jinja for Gemma4, the image token has \n\n separators
but the audio token does not:

{%- elif item['type'] == 'image' -%}
    {{- '\n\n<|image|>\n\n' -}}   ← has \n\n
{%- elif item['type'] == 'audio' -%}
    {{- '<|audio|>' -}}            ← missing \n\n
This causes the model to fail completely when image is placed before audio in the message content list.

Reproduction
python
from transformers import AutoProcessor, AutoModelForMultimodalLM
import torch

processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it")
model = AutoModelForMultimodalLM.from_pretrained(
    "google/gemma-4-E2B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# ❌ image before audio → model fails
messages_image_first = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": IMAGE_PATH},
            {"type": "audio", "audio": AUDIO_PATH},
            {"type": "text", "text": "Describe the image and audio."},
        ]
    }
]

# ✅ audio before image → works correctly
messages_audio_first = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": AUDIO_PATH},
            {"type": "image", "url": IMAGE_PATH},
            {"type": "text", "text": "Describe the image and audio."},
        ]
    }
]
Root Cause
The jinja template inserts \n\n around image tokens but not around audio tokens.

Image-first token sequence (broken):

<|image>...<image|>\n\n<|audio>...<audio|>Describe the image and audio.
                                          ↑ audio token directly concatenated with text, no separator
Audio-first token sequence (correct):

<|audio>...<audio|>\n\n<|image>...<image|>\n\n Describe the image and audio.
                    ↑                      ↑ correct \n\n separators
Note also that before the fix, the two orderings produce different input_ids shapes:

audio-first shape: torch.Size([1, 738])
image-first shape: torch.Size([1, 737])  ← 1 token missing due to missing \n\n
Evidence
Top 10 next tokens with image-first (before fix):

'<turn|>': 0.7383   ← model immediately ends the turn
'<eos>':   0.1406
Top 10 next tokens with image-first (after fix):

'这张': 0.6797      ← model correctly starts generating
'好的': 0.2656
Full generation with audio-first (before fix):

这张图片展示了一个教室的场景。画面中有一位戴眼镜的女性老师站在讲台后面...
音频中有人在呼喊"Look look look at the girl"...
Full generation with image-first (before fix):

(empty, model outputs <turn|> immediately)
Full generation with image-first (after fix):

这张图片展示了一个教室的场景,有几位学生和一位老师...
音频内容似乎是孩子们在进行某种对话或游戏...
Fix
In chat_template.jinja, change:

jinja
{%- elif item['type'] == 'audio' -%}
    {{- '<|audio|>' -}}
to:

jinja
{%- elif item['type'] == 'audio' -%}
    {{- '\n\n<|audio|>\n\n' -}}
After the fix, both orderings produce identical input_ids shapes and correct outputs.

Environment
transformers version: 5.5.0
Model: google/gemma-4-E2B-it

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions