Skip to content

audio tester class#45391

Open
tarekziade wants to merge 31 commits intomainfrom
tarekziade-audio-test
Open

audio tester class#45391
tarekziade wants to merge 31 commits intomainfrom
tarekziade-audio-test

Conversation

@tarekziade
Copy link
Copy Markdown
Collaborator

What does this PR do?

Similarly to the VLM tester, this patch introduces a audio tester class, used in

  • Qwen2Audio
  • AudioFlamingo3
  • GraniteSpeech

Adding a new audio-language model using this will require ~8-20 lines for the tester (vs ~100-160 before). The boilerplate (config introspection, input preparation, SDPA dispatch test, common skips) lives in one place.

@tarekziade tarekziade self-assigned this Apr 13, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tarekziade
Copy link
Copy Markdown
Collaborator Author

run-slow: audioflamingo3, granite_speech, qwen2_audio

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/granite_speech", "models/qwen2_audio"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 5472faa4 workflow commit (merge commit)
PR 0817bdbd branch commit (from PR)
main a5533957 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

Copy link
Copy Markdown
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool!! 🔥
Models that should be covered by this PR:

  • audioflamingo3
  • glmasr
  • granite_speech
  • higgs_audio_v2
  • kyutai_speech_to_text
  • qwen2_audio
  • vibevoice_asr
  • voxtral
  • voxtral_realtime
  • musicflamingo

might:

  • gemma3n
  • gemma4
  • qwen2_5_omni
  • qwen3_omni_moe

Comment thread tests/alm_tester.py Outdated
Comment thread tests/alm_tester.py Outdated
Comment thread tests/alm_tester.py Outdated
Comment thread tests/alm_tester.py
Comment thread tests/alm_tester.py Outdated
Comment on lines +156 to +160
def get_num_audio_tokens(self, audio_features):
"""Compute number of audio placeholder tokens from features. Override for different subsampling."""
# Default: 2-stage pooling (common for Whisper-style encoders)
input_length = (audio_features.shape[-1] - 1) // 2 + 1
return (input_length - 2) // 2 + 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't put whisper defaults here but rather force sub classes to write this method

Comment thread tests/alm_tester.py Outdated
Comment thread tests/alm_tester.py Outdated
Comment thread tests/alm_tester.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/glmasr", "models/granite_speech", "models/musicflamingo", "models/qwen2_5_omni", "models/qwen2_audio", "models/qwen3_omni_moe", "models/vibevoice_asr", "models/voxtral", "models/voxtral_realtime"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 07b039e2 workflow commit (merge commit)
PR c9534432 branch commit (from PR)
main 77c0e6e7 base commit (on main)

Model CI Report

1 new failed tests from this PR 😭

  • voxtral_realtime:
    tests/models/voxtral_realtime/test_modeling_voxtral_realtime.py::VoxtralRealtimeForConditionalGenerationModelTest::test_mismatching_num_audio_tokens (✅ ⟹ ❌)

Copy link
Copy Markdown
Collaborator Author

@tarekziade tarekziade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question about a potential coverage regression. The rest seems ok to me. Deferring to core maintainers but for my part it's good to go


@require_torch
class AudioFlamingo3ForConditionalGenerationModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
class AudioFlamingo3ForConditionalGenerationModelTest(ALMModelTest, unittest.TestCase):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_model_base_model_prefix is now class-skipped, we might losing coverage here? since the skip is global in AudioModelTest

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these test were all already skipped for the test_modelign_xx affected by this PR, so we should be good. Also this will be unskipped in #45534

@@ -159,47 +90,10 @@ def test_sdpa_can_compile_dynamic(self):
def test_sdpa_can_dispatch_on_flash(self):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fly-by is this really about compiling?

"Flash-attn dispatch path not validated for this model" ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that's right idk why this test was skipped before but I can be un-skipped

@tarekziade
Copy link
Copy Markdown
Collaborator Author

Also can you check the Voxtral failure?

@eustlb
Copy link
Copy Markdown
Contributor

eustlb commented Apr 22, 2026

run-slow: audioflamingo3, glmasr, granite_speech, musicflamingo, qwen2_5_omni, qwen2_audio, qwen3_omni_moe, vibevoice_asr, voxtral, voxtral_realtime

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/glmasr", "models/granite_speech", "models/musicflamingo", "models/qwen2_5_omni", "models/qwen2_audio", "models/qwen3_omni_moe", "models/vibevoice_asr", "models/voxtral", "models/voxtral_realtime"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 62000676 workflow commit (merge commit)
PR dde65f61 branch commit (from PR)
main bca7eee6 base commit (on main)

Model CI Report

1 new failed tests from this PR 😭

  • voxtral_realtime:
    tests/models/voxtral_realtime/test_modeling_voxtral_realtime.py::VoxtralRealtimeForConditionalGenerationModelTest::test_mismatching_num_audio_tokens (✅ ⟹ ❌)

@tarekziade
Copy link
Copy Markdown
Collaborator Author

run-slow: voxtral_realtime

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/voxtral_realtime"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN bbe42be1 workflow commit (merge commit)
PR b47621a9 branch commit (from PR)
main 7187177f base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@eustlb
Copy link
Copy Markdown
Contributor

eustlb commented Apr 23, 2026

@zucchini-nlp pinging you here for review since I've touched vlm_tester.py

@eustlb eustlb requested a review from zucchini-nlp April 23, 2026 13:08
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the new MultimodalTester, thanks for adding it! Should give us more freedom to combine modalities!

Looks good to me, just a couple nits comments

Comment on lines +33 to +42
class MultiModalModelTester:
"""Shared tester base for VLM (vision-language) and ALM (audio-language).

Concrete subclasses (e.g. `VLMModelTester`, `ALMModelTester`) supply:
- the modality-specific sub-config class (`vision_config_class` for VLMs, `audio_config_class` for ALMs, ...),
- the modality-specific defaults and helper methods,
- the hooks `_build_modality_sub_configs` and `_prepare_modality_inputs`,
- optionally an extended `_special_token_ids` and `pipeline_model_mapping`.

This tester provides shared logic for evaluating and verifying models that combine text with other modalities,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

niiiice ❤️

Comment on lines +148 to +153
# Avoid flaky tests by scrubbing any accidental special tokens produced by ids_tensor.
# Modality placeholder tokens are scrubbed and placed by `_prepare_modality_inputs`.
safe_token_id = self._safe_token_id()
input_ids[input_ids == self.pad_token_id] = safe_token_id
input_ids[input_ids == self.eos_token_id] = safe_token_id

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe input_ids[input_ids == self._special_token_ids] = safe_token_id, so we skip all special tokens from appearing?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep much better thanks

Comment thread tests/vlm_tester.py Outdated
Comment on lines 60 to 62
kwargs.setdefault("moe_num_shared_experts", 2)
kwargs.setdefault("num_experts_per_tok", 2)
kwargs.setdefault("num_experts", 8)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are still text-related config values, should we store it all in MultimodalTester?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to _TEXT_MODEL_TESTER_DEFAULTS

Comment on lines +33 to +44
class MultiModalModelTester:
"""Shared tester base for VLM (vision-language) and ALM (audio-language).

Concrete subclasses (e.g. `VLMModelTester`, `ALMModelTester`) supply:
- the modality-specific sub-config class (`vision_config_class` for VLMs, `audio_config_class` for ALMs, ...),
- the modality-specific defaults and helper methods,
- the hooks `_build_modality_sub_configs` and `_prepare_modality_inputs`,
- optionally an extended `_special_token_ids` and `pipeline_model_mapping`.

This tester provides shared logic for evaluating and verifying models that combine text with other modalities,
centering on the needs of vision-language (VLM) and audio-language (ALM) models.
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at how you did the audio and vision testers, maybe we can consider inheriting a multimodal from "Text" and overriding input preparation? Or is it too different to inherit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all the methods are overwritten, and particularly the init diverges, so I'd rather keep them separated. Though I agree that having 2 different sources of truth for the text model params is inconvenient. Added this commit for that.

Comment thread tests/vlm_tester.py Outdated
def _prepare_modality_inputs(self, input_ids, config):
pixel_values = self.create_pixel_values()
input_ids = self.place_image_tokens(input_ids, config)
return input_ids, {"pixel_values": pixel_values}, pixel_values
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks a bit weird to me, I'd prefer to return a dict and pass it over into get_additional_inputs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, fixed here, kept returning input_ids, modality_inputs (so input_ids not in the returned dict) though for clarity

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, esm, gemma3, glmasr, granite_speech, llava_next, musicflamingo, qwen2_5_omni, qwen2_audio, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, vibevoice_asr, voxtral, voxtral_realtime

@eustlb eustlb requested a review from zucchini-nlp April 27, 2026 07:46
@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45391&sha=5e36c9

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks

Comment thread tests/alm_tester.py
Comment on lines +80 to +91
input_ids = input_ids.clone()
input_ids[input_ids == self.audio_token_id] = self.pad_token_id
for i in range(input_ids.shape[0]):
n = num_audio_tokens[i].item() if isinstance(num_audio_tokens, torch.Tensor) else num_audio_tokens
if 1 + int(n) > self.seq_length:
raise ValueError(
f"Cannot place {int(n)} audio tokens after BOS in a sequence of length {self.seq_length}. "
"This likely indicates a mismatch between your feature extraction/configuration and your sequence length. "
"Please ensure `seq_length` is >= the number of audio embedding positions + 1."
)
input_ids[i, 1 : 1 + int(n)] = self.audio_token_id
return input_ids
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like it, allows to test different numbers of multimodal data per sample !

Comment thread tests/alm_tester.py
return {self.audio_config_key: self.get_audio_config()}

def _prepare_modality_inputs(self, input_ids, config):
# TODO: add a clear diagram that explains input prep ?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for next PR?

Comment thread tests/vlm_tester.py
def __init__(self, parent, **kwargs):
self.parent = parent
# Overrides of _TEXT_MODEL_TESTER_DEFAULTS
kwargs.setdefault("seq_length", 7 + kwargs.get("num_image_tokens", (kwargs.get("image_size", 8) // kwargs.get("patch_size", 4)) ** 2))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we split to multiple lines for readabiliy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants