audio tester class by tarekziade · Pull Request #45391 · huggingface/transformers

tarekziade · 2026-04-13T06:32:49Z

What does this PR do?

Similarly to the VLM tester, this patch introduces a audio tester class, used in

Qwen2Audio
AudioFlamingo3
GraniteSpeech

Adding a new audio-language model using this will require ~8-20 lines for the tester (vs ~100-160 before). The boilerplate (config introspection, input preparation, SDPA dispatch test, common skips) lives in one place.

HuggingFaceDocBuilderDev · 2026-04-13T06:42:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tarekziade · 2026-04-13T07:18:06Z

run-slow: audioflamingo3, granite_speech, qwen2_audio

github-actions · 2026-04-13T07:19:27Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/granite_speech", "models/qwen2_audio"]
quantizations: []

github-actions · 2026-04-13T07:32:46Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	5472faa4	workflow commit (merge commit)
PR	0817bdbd	branch commit (from PR)
main	a5533957	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

eustlb

This is cool!! 🔥
Models that should be covered by this PR:

audioflamingo3
glmasr
granite_speech
higgs_audio_v2
kyutai_speech_to_text
qwen2_audio
vibevoice_asr
voxtral
voxtral_realtime
musicflamingo

might:

gemma3n
gemma4
qwen2_5_omni
qwen3_omni_moe

eustlb · 2026-04-13T10:48:34Z

+    def get_num_audio_tokens(self, audio_features):
+        """Compute number of audio placeholder tokens from features. Override for different subsampling."""
+        # Default: 2-stage pooling (common for Whisper-style encoders)
+        input_length = (audio_features.shape[-1] - 1) // 2 + 1
+        return (input_length - 2) // 2 + 1


we shouldn't put whisper defaults here but rather force sub classes to write this method

github-actions · 2026-04-22T11:11:24Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/glmasr", "models/granite_speech", "models/musicflamingo", "models/qwen2_5_omni", "models/qwen2_audio", "models/qwen3_omni_moe", "models/vibevoice_asr", "models/voxtral", "models/voxtral_realtime"]
quantizations: []

github-actions · 2026-04-22T12:41:29Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	07b039e2	workflow commit (merge commit)
PR	c9534432	branch commit (from PR)
main	77c0e6e7	base commit (on `main`)

Model CI Report

❌ 1 new failed tests from this PR 😭

voxtral_realtime:
tests/models/voxtral_realtime/test_modeling_voxtral_realtime.py::VoxtralRealtimeForConditionalGenerationModelTest::test_mismatching_num_audio_tokens (✅ ⟹ ❌)

tarekziade

Just a question about a potential coverage regression. The rest seems ok to me. Deferring to core maintainers but for my part it's good to go

tarekziade · 2026-04-22T13:01:42Z


 @require_torch
-class AudioFlamingo3ForConditionalGenerationModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+class AudioFlamingo3ForConditionalGenerationModelTest(ALMModelTest, unittest.TestCase):


test_model_base_model_prefix is now class-skipped, we might losing coverage here? since the skip is global in AudioModelTest

these test were all already skipped for the test_modelign_xx affected by this PR, so we should be good. Also this will be unskipped in #45534

tarekziade · 2026-04-22T13:08:04Z

@@ -159,47 +90,10 @@ def test_sdpa_can_compile_dynamic(self):
    def test_sdpa_can_dispatch_on_flash(self):


fly-by is this really about compiling?

"Flash-attn dispatch path not validated for this model" ?

yep that's right idk why this test was skipped before but I can be un-skipped

tarekziade · 2026-04-22T13:12:36Z

Also can you check the Voxtral failure?

eustlb · 2026-04-22T14:32:16Z

run-slow: audioflamingo3, glmasr, granite_speech, musicflamingo, qwen2_5_omni, qwen2_audio, qwen3_omni_moe, vibevoice_asr, voxtral, voxtral_realtime

github-actions · 2026-04-22T14:33:32Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/glmasr", "models/granite_speech", "models/musicflamingo", "models/qwen2_5_omni", "models/qwen2_audio", "models/qwen3_omni_moe", "models/vibevoice_asr", "models/voxtral", "models/voxtral_realtime"]
quantizations: []

github-actions · 2026-04-22T16:16:40Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	62000676	workflow commit (merge commit)
PR	dde65f61	branch commit (from PR)
main	bca7eee6	base commit (on `main`)

Model CI Report

❌ 1 new failed tests from this PR 😭

voxtral_realtime:
tests/models/voxtral_realtime/test_modeling_voxtral_realtime.py::VoxtralRealtimeForConditionalGenerationModelTest::test_mismatching_num_audio_tokens (✅ ⟹ ❌)

tarekziade · 2026-04-22T18:15:11Z

run-slow: voxtral_realtime

github-actions · 2026-04-22T18:16:31Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/voxtral_realtime"]
quantizations: []

github-actions · 2026-04-22T18:33:04Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	bbe42be1	workflow commit (merge commit)
PR	b47621a9	branch commit (from PR)
main	7187177f	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

eustlb · 2026-04-23T13:08:50Z

@zucchini-nlp pinging you here for review since I've touched vlm_tester.py

zucchini-nlp

Love the new MultimodalTester, thanks for adding it! Should give us more freedom to combine modalities!

Looks good to me, just a couple nits comments

zucchini-nlp · 2026-04-23T13:45:35Z

+class MultiModalModelTester:
+    """Shared tester base for VLM (vision-language) and ALM (audio-language).
+
+    Concrete subclasses (e.g. `VLMModelTester`, `ALMModelTester`) supply:
+      - the modality-specific sub-config class (`vision_config_class` for VLMs, `audio_config_class` for ALMs, ...),
+      - the modality-specific defaults and helper methods,
+      - the hooks `_build_modality_sub_configs` and `_prepare_modality_inputs`,
+      - optionally an extended `_special_token_ids` and `pipeline_model_mapping`.
+
+    This tester provides shared logic for evaluating and verifying models that combine text with other modalities,


niiiice ❤️

zucchini-nlp · 2026-04-23T13:52:12Z

+        # Avoid flaky tests by scrubbing any accidental special tokens produced by ids_tensor.
+        # Modality placeholder tokens are scrubbed and placed by `_prepare_modality_inputs`.
+        safe_token_id = self._safe_token_id()
+        input_ids[input_ids == self.pad_token_id] = safe_token_id
+        input_ids[input_ids == self.eos_token_id] = safe_token_id
+


maybe input_ids[input_ids == self._special_token_ids] = safe_token_id, so we skip all special tokens from appearing?

yep much better thanks

zucchini-nlp · 2026-04-23T13:54:17Z

        kwargs.setdefault("moe_num_shared_experts", 2)
        kwargs.setdefault("num_experts_per_tok", 2)
        kwargs.setdefault("num_experts", 8)


these are still text-related config values, should we store it all in MultimodalTester?

moved to _TEXT_MODEL_TESTER_DEFAULTS

zucchini-nlp · 2026-04-23T13:55:15Z

+class MultiModalModelTester:
+    """Shared tester base for VLM (vision-language) and ALM (audio-language).
+
+    Concrete subclasses (e.g. `VLMModelTester`, `ALMModelTester`) supply:
+      - the modality-specific sub-config class (`vision_config_class` for VLMs, `audio_config_class` for ALMs, ...),
+      - the modality-specific defaults and helper methods,
+      - the hooks `_build_modality_sub_configs` and `_prepare_modality_inputs`,
+      - optionally an extended `_special_token_ids` and `pipeline_model_mapping`.
+
+    This tester provides shared logic for evaluating and verifying models that combine text with other modalities,
+    centering on the needs of vision-language (VLM) and audio-language (ALM) models.
+    """


looking at how you did the audio and vision testers, maybe we can consider inheriting a multimodal from "Text" and overriding input preparation? Or is it too different to inherit

Almost all the methods are overwritten, and particularly the init diverges, so I'd rather keep them separated. Though I agree that having 2 different sources of truth for the text model params is inconvenient. Added this commit for that.

zucchini-nlp · 2026-04-23T13:57:27Z

+    def _prepare_modality_inputs(self, input_ids, config):
+        pixel_values = self.create_pixel_values()
        input_ids = self.place_image_tokens(input_ids, config)
+        return input_ids, {"pixel_values": pixel_values}, pixel_values


looks a bit weird to me, I'd prefer to return a dict and pass it over into get_additional_inputs

agree, fixed here, kept returning input_ids, modality_inputs (so input_ids not in the returned dict) though for clarity

…ts_for_common

github-actions · 2026-04-27T07:45:46Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, esm, gemma3, glmasr, granite_speech, llava_next, musicflamingo, qwen2_5_omni, qwen2_audio, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, vibevoice_asr, voxtral, voxtral_realtime

github-actions · 2026-04-27T08:02:40Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45391&sha=5e36c9

zucchini-nlp

Great work, thanks

zucchini-nlp · 2026-04-27T09:17:19Z

+        input_ids = input_ids.clone()
+        input_ids[input_ids == self.audio_token_id] = self.pad_token_id
+        for i in range(input_ids.shape[0]):
+            n = num_audio_tokens[i].item() if isinstance(num_audio_tokens, torch.Tensor) else num_audio_tokens
+            if 1 + int(n) > self.seq_length:
+                raise ValueError(
+                    f"Cannot place {int(n)} audio tokens after BOS in a sequence of length {self.seq_length}. "
+                    "This likely indicates a mismatch between your feature extraction/configuration and your sequence length. "
+                    "Please ensure `seq_length` is >= the number of audio embedding positions + 1."
+                )
+            input_ids[i, 1 : 1 + int(n)] = self.audio_token_id
+        return input_ids


i like it, allows to test different numbers of multimodal data per sample !

zucchini-nlp · 2026-04-27T09:17:56Z

+        return {self.audio_config_key: self.get_audio_config()}
+
+    def _prepare_modality_inputs(self, input_ids, config):
+        # TODO: add a clear diagram that explains input prep ?


TODO for next PR?

zucchini-nlp · 2026-04-27T09:20:35Z

    def __init__(self, parent, **kwargs):
-        self.parent = parent
+        # Overrides of _TEXT_MODEL_TESTER_DEFAULTS
+        kwargs.setdefault("seq_length", 7 + kwargs.get("num_image_tokens", (kwargs.get("image_size", 8) // kwargs.get("patch_size", 4)) ** 2))


nit: could we split to multiple lines for readabiliy?

audio tester

3562c7f

tarekziade requested review from eustlb and zucchini-nlp April 13, 2026 06:32

tarekziade self-assigned this Apr 13, 2026

tweak check repo for audio tester

0817bdb

eustlb reviewed Apr 13, 2026

View reviewed changes

audio -> ALM

356c922

zucchini-nlp reviewed Apr 13, 2026

View reviewed changes

Comment thread tests/alm_tester.py Outdated

zucchini-nlp reviewed Apr 13, 2026

View reviewed changes

Comment thread tests/alm_tester.py Outdated

eustlb added 11 commits April 13, 2026 17:38

ALMTester: no audio/text defaults; better input prep

9663a8e

Merge branch 'main' into tarekziade-audio-test

73c4548

udpate test_sdpa_can_dispatch_composite_models to hanlde ALMs

a599b1d

propagate to other model classes

a7d54dc

cleaner

a302c3e

updates

8fcba58

audio_mask_key + updates

66acc9e

typo

63ca77e

simplify granite speech

7588135

nits

41fed1c

some more cleaning

e5971c7

eustlb mentioned this pull request Apr 21, 2026

🚨 [ALM] Add base model without head #45534

Draft

4 tasks

eustlb added 5 commits April 21, 2026 17:57

add test_mismatching_num_audio_tokens

59703dd

add get_placeholder_mask

6a67f32

specific to musicflamingo

b59f958

granite speech fix

bb986b6

let's factorise alm/vlm testers

670c68c

tarekziade commented Apr 22, 2026

View reviewed changes

eustlb added 2 commits April 22, 2026 15:56

unskip test_sdpa_can_dispatch_on_flash on qwen2_audio

8740409

should not be skipped

dde65f6

make fix-repo

19b37c5

test_mismatching_num_audio_tokens should be skipped for voxtral_realtime

b47621a

eustlb approved these changes Apr 23, 2026

View reviewed changes

eustlb requested a review from zucchini-nlp April 23, 2026 13:08

zucchini-nlp reviewed Apr 23, 2026

View reviewed changes

eustlb added 7 commits April 27, 2026 14:48

nit

b9d30be

_special_token_ids as property and skipped in prepare_config_and_inpu…

8d2e4b7

…ts_for_common

MoE params in common class

cbd526f

add _TEXT_MODEL_TESTER_DEFAULTS to avoid divergence

12dfcd0

nit

95b1f20

clearer inits

c2aa666

_prepare_modality_inputs return dict

5e36c9f

eustlb requested a review from zucchini-nlp April 27, 2026 07:46

zucchini-nlp approved these changes Apr 27, 2026

View reviewed changes

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

		@@ -159,47 +90,10 @@ def test_sdpa_can_compile_dynamic(self):
		def test_sdpa_can_dispatch_on_flash(self):

Conversation

tarekziade commented Apr 13, 2026

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 13, 2026

Uh oh!

tarekziade commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

CI Results

Commit Info

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

CI Results

Commit Info

Model CI Report

Uh oh!

tarekziade left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarekziade commented Apr 22, 2026

Uh oh!

eustlb commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

CI Results

Commit Info

Model CI Report

Uh oh!

tarekziade commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

CI Results

Commit Info

Uh oh!

eustlb commented Apr 23, 2026

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!