audio tester class#45391
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: audioflamingo3, granite_speech, qwen2_audio |
|
This comment contains models: ["models/audioflamingo3", "models/granite_speech", "models/qwen2_audio"] |
eustlb
left a comment
There was a problem hiding this comment.
This is cool!! 🔥
Models that should be covered by this PR:
- audioflamingo3
- glmasr
- granite_speech
- higgs_audio_v2
- kyutai_speech_to_text
- qwen2_audio
- vibevoice_asr
- voxtral
- voxtral_realtime
- musicflamingo
might:
- gemma3n
- gemma4
- qwen2_5_omni
- qwen3_omni_moe
| def get_num_audio_tokens(self, audio_features): | ||
| """Compute number of audio placeholder tokens from features. Override for different subsampling.""" | ||
| # Default: 2-stage pooling (common for Whisper-style encoders) | ||
| input_length = (audio_features.shape[-1] - 1) // 2 + 1 | ||
| return (input_length - 2) // 2 + 1 |
There was a problem hiding this comment.
we shouldn't put whisper defaults here but rather force sub classes to write this method
|
This comment contains models: ["models/audioflamingo3", "models/glmasr", "models/granite_speech", "models/musicflamingo", "models/qwen2_5_omni", "models/qwen2_audio", "models/qwen3_omni_moe", "models/vibevoice_asr", "models/voxtral", "models/voxtral_realtime"] |
CI ResultsCommit Info
Model CI Report❌ 1 new failed tests from this PR 😭
|
tarekziade
left a comment
There was a problem hiding this comment.
Just a question about a potential coverage regression. The rest seems ok to me. Deferring to core maintainers but for my part it's good to go
|
|
||
| @require_torch | ||
| class AudioFlamingo3ForConditionalGenerationModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase): | ||
| class AudioFlamingo3ForConditionalGenerationModelTest(ALMModelTest, unittest.TestCase): |
There was a problem hiding this comment.
test_model_base_model_prefix is now class-skipped, we might losing coverage here? since the skip is global in AudioModelTest
There was a problem hiding this comment.
these test were all already skipped for the test_modelign_xx affected by this PR, so we should be good. Also this will be unskipped in #45534
| @@ -159,47 +90,10 @@ def test_sdpa_can_compile_dynamic(self): | |||
| def test_sdpa_can_dispatch_on_flash(self): | |||
There was a problem hiding this comment.
fly-by is this really about compiling?
"Flash-attn dispatch path not validated for this model" ?
There was a problem hiding this comment.
yep that's right idk why this test was skipped before but I can be un-skipped
|
Also can you check the Voxtral failure? |
|
run-slow: audioflamingo3, glmasr, granite_speech, musicflamingo, qwen2_5_omni, qwen2_audio, qwen3_omni_moe, vibevoice_asr, voxtral, voxtral_realtime |
|
This comment contains models: ["models/audioflamingo3", "models/glmasr", "models/granite_speech", "models/musicflamingo", "models/qwen2_5_omni", "models/qwen2_audio", "models/qwen3_omni_moe", "models/vibevoice_asr", "models/voxtral", "models/voxtral_realtime"] |
CI ResultsCommit Info
Model CI Report❌ 1 new failed tests from this PR 😭
|
|
run-slow: voxtral_realtime |
|
This comment contains models: ["models/voxtral_realtime"] |
|
@zucchini-nlp pinging you here for review since I've touched |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Love the new MultimodalTester, thanks for adding it! Should give us more freedom to combine modalities!
Looks good to me, just a couple nits comments
| class MultiModalModelTester: | ||
| """Shared tester base for VLM (vision-language) and ALM (audio-language). | ||
|
|
||
| Concrete subclasses (e.g. `VLMModelTester`, `ALMModelTester`) supply: | ||
| - the modality-specific sub-config class (`vision_config_class` for VLMs, `audio_config_class` for ALMs, ...), | ||
| - the modality-specific defaults and helper methods, | ||
| - the hooks `_build_modality_sub_configs` and `_prepare_modality_inputs`, | ||
| - optionally an extended `_special_token_ids` and `pipeline_model_mapping`. | ||
|
|
||
| This tester provides shared logic for evaluating and verifying models that combine text with other modalities, |
| # Avoid flaky tests by scrubbing any accidental special tokens produced by ids_tensor. | ||
| # Modality placeholder tokens are scrubbed and placed by `_prepare_modality_inputs`. | ||
| safe_token_id = self._safe_token_id() | ||
| input_ids[input_ids == self.pad_token_id] = safe_token_id | ||
| input_ids[input_ids == self.eos_token_id] = safe_token_id | ||
|
|
There was a problem hiding this comment.
maybe input_ids[input_ids == self._special_token_ids] = safe_token_id, so we skip all special tokens from appearing?
| kwargs.setdefault("moe_num_shared_experts", 2) | ||
| kwargs.setdefault("num_experts_per_tok", 2) | ||
| kwargs.setdefault("num_experts", 8) |
There was a problem hiding this comment.
these are still text-related config values, should we store it all in MultimodalTester?
There was a problem hiding this comment.
moved to _TEXT_MODEL_TESTER_DEFAULTS
| class MultiModalModelTester: | ||
| """Shared tester base for VLM (vision-language) and ALM (audio-language). | ||
|
|
||
| Concrete subclasses (e.g. `VLMModelTester`, `ALMModelTester`) supply: | ||
| - the modality-specific sub-config class (`vision_config_class` for VLMs, `audio_config_class` for ALMs, ...), | ||
| - the modality-specific defaults and helper methods, | ||
| - the hooks `_build_modality_sub_configs` and `_prepare_modality_inputs`, | ||
| - optionally an extended `_special_token_ids` and `pipeline_model_mapping`. | ||
|
|
||
| This tester provides shared logic for evaluating and verifying models that combine text with other modalities, | ||
| centering on the needs of vision-language (VLM) and audio-language (ALM) models. | ||
| """ |
There was a problem hiding this comment.
looking at how you did the audio and vision testers, maybe we can consider inheriting a multimodal from "Text" and overriding input preparation? Or is it too different to inherit
There was a problem hiding this comment.
Almost all the methods are overwritten, and particularly the init diverges, so I'd rather keep them separated. Though I agree that having 2 different sources of truth for the text model params is inconvenient. Added this commit for that.
| def _prepare_modality_inputs(self, input_ids, config): | ||
| pixel_values = self.create_pixel_values() | ||
| input_ids = self.place_image_tokens(input_ids, config) | ||
| return input_ids, {"pixel_values": pixel_values}, pixel_values |
There was a problem hiding this comment.
looks a bit weird to me, I'd prefer to return a dict and pass it over into get_additional_inputs
There was a problem hiding this comment.
agree, fixed here, kept returning input_ids, modality_inputs (so input_ids not in the returned dict) though for clarity
|
[For maintainers] Suggested jobs to run (before merge) run-slow: audioflamingo3, esm, gemma3, glmasr, granite_speech, llava_next, musicflamingo, qwen2_5_omni, qwen2_audio, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, vibevoice_asr, voxtral, voxtral_realtime |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45391&sha=5e36c9 |
| input_ids = input_ids.clone() | ||
| input_ids[input_ids == self.audio_token_id] = self.pad_token_id | ||
| for i in range(input_ids.shape[0]): | ||
| n = num_audio_tokens[i].item() if isinstance(num_audio_tokens, torch.Tensor) else num_audio_tokens | ||
| if 1 + int(n) > self.seq_length: | ||
| raise ValueError( | ||
| f"Cannot place {int(n)} audio tokens after BOS in a sequence of length {self.seq_length}. " | ||
| "This likely indicates a mismatch between your feature extraction/configuration and your sequence length. " | ||
| "Please ensure `seq_length` is >= the number of audio embedding positions + 1." | ||
| ) | ||
| input_ids[i, 1 : 1 + int(n)] = self.audio_token_id | ||
| return input_ids |
There was a problem hiding this comment.
i like it, allows to test different numbers of multimodal data per sample !
| return {self.audio_config_key: self.get_audio_config()} | ||
|
|
||
| def _prepare_modality_inputs(self, input_ids, config): | ||
| # TODO: add a clear diagram that explains input prep ? |
| def __init__(self, parent, **kwargs): | ||
| self.parent = parent | ||
| # Overrides of _TEXT_MODEL_TESTER_DEFAULTS | ||
| kwargs.setdefault("seq_length", 7 + kwargs.get("num_image_tokens", (kwargs.get("image_size", 8) // kwargs.get("patch_size", 4)) ** 2)) |
There was a problem hiding this comment.
nit: could we split to multiple lines for readabiliy?
What does this PR do?
Similarly to the VLM tester, this patch introduces a audio tester class, used in
Adding a new audio-language model using this will require ~8-20 lines for the tester (vs ~100-160 before). The boilerplate (config introspection, input preparation, SDPA dispatch test, common skips) lives in one place.