Qwen3 ASR and Forced Aligner#43838
Conversation
|
@mbtariq82 thanks for opening the PR! we're definitely interested to add the model and were planning to work on it. Could you go ahead with the rest of the model? And I can iterate with you on it. I see you started with a modular file which is great. Below are some pointers of recent audio LM models that may help you with the other files / get an idea of our conventions: thanks 🤗 |
Create tester class and test processor initialization
create methods for common tests
|
I'm struggling to get test_apply_chat_template_audio from test_processing_common.py to pass. Specifically, the final part of the test where we apply_chat_template with continue_final_message=True fails with ValueError(continue_final_message is set but the final message does not appear in the chat after applying the chat template!...) I've verified that the chat_template is being correctly loaded from the model checkpoint: Qwen/Qwen3-ASR-0.6B. According to ChatGPT, the chat_template provided by Qwen is not correctly rendering the final assistant message so I think the only way to solve this is to override the apply_chat_template method and add some custom logic before calling super().apply_chat_template? |
|
@mbtariq82 it's only about getting the test to pass but the model is behaving as expected, let's avoid overwriting For now you can leave the test failing, and if necessary we can overwrite or even skip the test later on |
|
when you finish the modeling, and integration tests that produce equivalent outputs to the original, e.g. for audio flamingo:
I can already take a look to give some feedback! and we can take a look at the test after that |
Create integration test Setup Qwen3ASRModelTester
|
So between the current version and v4.57.6, the "default" key was removed from ROPE_INIT_FUNCTIONS. Qwen3-ASR was built using v4.57.6 and the checkpoint uses the "default" key. I've changed the rope_type to "linear" in Qwen3ASRThinkerTextRotaryEmbedding for now but I'm not sure if this is correct. I also changed the "attentions" PyTorch hooks, it was set to Qwen3ASRThinkerTextAttention which is not used at all in the base class - maybe they plan to use it in the future but I'm not sure. I've changed it to Qwen3ASRTextAttention to get the tests to pass. I've added the entire model. All the tests are passing. |
Add property methods to config Add base_model_prefix and wrapper method to generation class
…ion weights CLEANUP NEEDED
…n to Qwen3ASRTextAttention, Qwen3ASRThinkerTextAttention is never instantiated and so 'attentions' was not being properly propogated Fix integration tests
ebezzam
left a comment
There was a problem hiding this comment.
Hi @mbtariq82 thanks for working on this integration! I'm doing a small review because I noticed you started a modular file, but aren't making full use of its functionality to generate the configuration, processing, and modeling from existing components in Transformers. I gave some pointers for the configuration and processing but will let you check out the rest for the modeling components.
I encourage reading this page on using modular to contribute models: https://huggingface.co/docs/transformers/en/modular_transformers
And for practical examples you can see other modular files:
- Qwen3OmniMoe, which has a lot of similarity with the ASR model (namely removing the vision modalities): https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py
- A recent Audio LM addition: https://github.com/huggingface/transformers/blob/main/src/transformers/models/glmasr/modular_glmasr.py
Note: I think you will have to use "Asr" instead of "ASR" in your model naming because the modular script prefers camelcase.
…gn RoPE position handling with cache_position Refactor position_ids construction to be fully cache_position-driven and generation-safe. - Compute batch_size/seq_length from inputs_embeds - Initialize cache_position when absent - Build 3D position_ids from cache_position - Compute rope_deltas once during prefill - Reuse rope_deltas for subsequent decode steps Removes legacy attention_mask-dependent branch that was incompatible with static cache generation. Ensures correct RoPE offsets for multimodal inputs under both dynamic and static cache modes.
|
I made some big changes in the base model's forward in this commit: 0b3248d. I also removed get_rope_index. |
|
run-slow: qwen3_asr |
|
This comment contains models: ["models/qwen3_asr"] |
ebezzam
left a comment
There was a problem hiding this comment.
Self-review on Qwen3 ASR particularities (before potential refactor of Qwen3 audio encoder)
| - [bezzam/Qwen3-ASR-1.7B](https://huggingface.co/bezzam/Qwen3-ASR-1.7B) | ||
| - [bezzam/Qwen3-ASR-0.6B](https://huggingface.co/bezzam/Qwen3-ASR-0.6B) | ||
| - [bezzam/Qwen3-ForcedAligner-0.6B](https://huggingface.co/bezzam/Qwen3-ForcedAligner-0.6B) |
There was a problem hiding this comment.
TODO: update checkpoints in the end
| MODEL_FOR_FORCED_ALIGNMENT_MAPPING_NAMES = OrderedDict( | ||
| [ | ||
| ("qwen3_forced_aligner", "Qwen3ASRForForcedAlignment"), | ||
| ] | ||
| ) |
There was a problem hiding this comment.
How about new class of forced alignment models?
- Input: audio and text
- output: timestamps
|
|
||
| ``skip_special_tokens`` is hard-set to ``True`` for ``"parsed"`` and ``"transcription_only"``. | ||
| """ | ||
| valid_formats = ["raw", "parsed", "transcription_only"] |
There was a problem hiding this comment.
Similar to VibeVoice ASR, different formats for decoding the ASR output
| return [Qwen3ASRProcessor._parse_single_output(raw_text)["transcription"] for raw_text in text] | ||
|
|
||
| @staticmethod | ||
| def _is_cjk_char(char: str) -> bool: |
There was a problem hiding this comment.
From here, I've largely kept many of these methods from the original codebase to the post-processing leads to equivalent outputs. To iterate on what we keep and how so that it fits Transformers convention
| if lang == "japanese": | ||
| try: | ||
| import nagisa | ||
| except ImportError: | ||
| raise ImportError( | ||
| "Japanese forced alignment requires the `nagisa` package. Install it with: pip install nagisa" | ||
| ) | ||
| return Qwen3ASRProcessor._clean_tokens(nagisa.tagging(text).words) | ||
|
|
||
| if lang == "korean": | ||
| try: | ||
| from soynlp.tokenizer import LTokenizer | ||
| except ImportError: | ||
| raise ImportError( | ||
| "Korean forced alignment requires the `soynlp` package. Install it with: pip install soynlp" | ||
| ) | ||
| return Qwen3ASRProcessor._clean_tokens(LTokenizer().tokenize(text)) |
There was a problem hiding this comment.
Should we keep such try-imports for Japanese and Korean?
|
|
||
| return [int(v) for v in result] | ||
|
|
||
| def prepare_forced_aligner_inputs( |
There was a problem hiding this comment.
Similar in spirit to apply_transcription_request: provide a helper function so the user doesn't need to manually call apply_chat_template
| def decode_forced_alignment( | ||
| self, | ||
| logits, | ||
| input_ids, | ||
| word_lists: list[list[str]], | ||
| timestamp_token_id: int, | ||
| timestamp_segment_time: float | None = None, | ||
| ) -> list[list[dict]]: |
There was a problem hiding this comment.
Things get a bit unconventional... is it ok to have this separate decode just for forced alignment? Or should forced alignment have its own processor, but then does that mean it should be in its own model folder?
|
run-slow: qwen3_asr |
|
This comment contains models: ["models/qwen3_asr"] |
CI ResultsCommit Info
Model CI Report❌ 1 new failed tests from this PR 😭
|
|
run-slow: qwen3_asr |
|
This comment contains models: ["models/qwen3_asr"] |
CI ResultsCommit Info
Model CI Report❌ 1 new failed tests from this PR 😭
|
|
run-slow: qwen3_asr |
|
This comment contains models: ["models/qwen3_asr"] |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, qwen3_asr, qwen3_omni_moe |
There was a problem hiding this comment.
@eustlb ready for review!
- I've defined a new audio encoder for Qwen3 ASR, instead of using the one from Qwen3OmniMoe. As we saw together Qwen3OmniMoe's audio encoder had operations which should have been in the feature extractor (and which hurt torch compile speedup). I'm made a new feature extractor object for Qwen3ASR, and if you see the torch compile example in the doc, we now get a speed up of 2.5 🚀 (when using the encoder from Omni it was 1.7).
- There are two types of model in this PR: ASR (audio LM approach) and a forced aligner (uses the audio encoder + classification layer to predict word durations). I'm sure we will iterate on the latter as it's a new type of model 😄 The processor methods can definitely be improved, I left them mainly as-is from the original to get your input on what is Transformers compatible.
Note there are some comments from a previous self-review, but you should see them when you are the "Files changed" tab!
| r""" | ||
| Constructs a Qwen3 ASR feature extractor. | ||
|
|
||
| Extracts 128-bin log-mel features from raw speech, then right-pads the mel time axis to a multiple of ``2 * n_window``. |
There was a problem hiding this comment.
Essentially this is the same as Whisper's feature extractor + right-padding data-dependent ops that were done in the audio encoder of Qwen3 Omni MoE
|
|
||
| @auto_docstring(checkpoint="bezzam/Qwen3-ASR-1.7B") | ||
| @strict | ||
| class Qwen3ASREncoderConfig(Qwen2_5OmniAudioEncoderConfig): |
There was a problem hiding this comment.
Not using Qwen3OmniMoeAudioEncoderConfig because it has unused conv_chunksize now that it's moved into the feature extractor
| lengths = torch.where(lengths > 0, (lengths - 1) // 2 + 1, torch.zeros_like(lengths)) | ||
| return lengths | ||
|
|
||
| def forward( |
There was a problem hiding this comment.
Overwrite forward so it is more compatible with torch compile and move what should be in the feature extractor!
| super().__init__(config) | ||
| self.num_timestamp_bins = config.num_timestamp_bins | ||
| self.model = Qwen3ASRModel(config) | ||
| self.classifier = nn.Linear(config.text_config.hidden_size, config.num_timestamp_bins, bias=False) |
There was a problem hiding this comment.
Classifier instead of lm head
| ) | ||
| self.layer_idx = layer_idx | ||
|
|
||
| self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False) |
There was a problem hiding this comment.
Qwen3ASRAudioAttention requires bias=True for k_proj, but set bias=False here ?
There was a problem hiding this comment.
thanks for pointing this out! looking into it, strange that the integration tests (between Transformers and original) still produce equivalent outputs 🤔
# from this branch
CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 pytest tests/models/qwen3_asr/test_modeling_qwen3_asr.py::Qwen3ASRForConditionalGenerationIntegrationTest
EDIT: actually it makes sense that it doesn't affect the output because softmax is invariant to adding the same constant to every logit. Which is what happens when adding the key projection bias. Computationally, it's slightly better to have bias=False for less parameters/memory/operations. But don't think it makes a big difference.
Right now this line is generated via modular by directly inheriting from Whisper. We'll see if during the review process, if we move away from Whisper's definition.
There was a problem hiding this comment.
If the bias is small enough or the general distribution is not influenced, you could have just gotten lucky :D
What does this PR do?
This PR adds Qwen3-ASR to the Transformers library.
Fixes #43837
Before submitting
Pull Request section?
to it if that's the case. Proposal to add Qwen3-ASR support #43837
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.