Qwen3 ASR and Forced Aligner by mbtariq82 · Pull Request #43838 · huggingface/transformers

mbtariq82 · 2026-02-08T12:05:43Z

What does this PR do?

This PR adds Qwen3-ASR to the Transformers library.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. Proposal to add Qwen3-ASR support #43837
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

audio models: @eustlb @ebezzam @vasqu

vasqu · 2026-02-09T11:18:49Z

I think this is a good one to add! cc @ebezzam @eustlb

ebezzam · 2026-02-09T11:59:23Z

@mbtariq82 thanks for opening the PR! we're definitely interested to add the model and were planning to work on it. Could you go ahead with the rest of the model? And I can iterate with you on it.

I see you started with a modular file which is great. Below are some pointers of recent audio LM models that may help you with the other files / get an idea of our conventions:

thanks 🤗

Create tester class and test processor initialization

create methods for common tests

mbtariq82 · 2026-02-11T13:24:48Z

I'm struggling to get test_apply_chat_template_audio from test_processing_common.py to pass.

Specifically, the final part of the test where we apply_chat_template with continue_final_message=True fails with ValueError(continue_final_message is set but the final message does not appear in the chat after applying the chat template!...)

I've verified that the chat_template is being correctly loaded from the model checkpoint: Qwen/Qwen3-ASR-0.6B.

According to ChatGPT, the chat_template provided by Qwen is not correctly rendering the final assistant message so I think the only way to solve this is to override the apply_chat_template method and add some custom logic before calling super().apply_chat_template?

@ebezzam

…for now

ebezzam · 2026-02-11T15:36:08Z

@mbtariq82 it's only about getting the test to pass but the model is behaving as expected, let's avoid overwriting apply_chat_template.

For now you can leave the test failing, and if necessary we can overwrite or even skip the test later on

ebezzam · 2026-02-11T15:41:05Z

when you finish the modeling, and integration tests that produce equivalent outputs to the original, e.g. for audio flamingo:

tests like this to test single and batch inference:

transformers/tests/models/audioflamingo3/test_modeling_audioflamingo3.py

Line 237 in 64e4192

class AudioFlamingo3ForConditionalGenerationIntegrationTest(unittest.TestCase):
and reproducers like this for computing the expected outputs with the original model: https://gist.github.com/ebezzam/c979f0f1a2b9223fa137faf1c02022d4#file-reproducer-py

I can already take a look to give some feedback! and we can take a look at the test after that

Create integration test Setup Qwen3ASRModelTester

mbtariq82 · 2026-02-15T20:56:29Z

So between the current version and v4.57.6, the "default" key was removed from ROPE_INIT_FUNCTIONS. Qwen3-ASR was built using v4.57.6 and the checkpoint uses the "default" key. I've changed the rope_type to "linear" in Qwen3ASRThinkerTextRotaryEmbedding for now but I'm not sure if this is correct.

I also changed the "attentions" PyTorch hooks, it was set to Qwen3ASRThinkerTextAttention which is not used at all in the base class - maybe they plan to use it in the future but I'm not sure. I've changed it to Qwen3ASRTextAttention to get the tests to pass.

I've added the entire model. All the tests are passing.

Add property methods to config Add base_model_prefix and wrapper method to generation class

…ion weights CLEANUP NEEDED

…n to Qwen3ASRTextAttention, Qwen3ASRThinkerTextAttention is never instantiated and so 'attentions' was not being properly propogated Fix integration tests

ebezzam

Hi @mbtariq82 thanks for working on this integration! I'm doing a small review because I noticed you started a modular file, but aren't making full use of its functionality to generate the configuration, processing, and modeling from existing components in Transformers. I gave some pointers for the configuration and processing but will let you check out the rest for the modeling components.

I encourage reading this page on using modular to contribute models: https://huggingface.co/docs/transformers/en/modular_transformers

And for practical examples you can see other modular files:

Qwen3OmniMoe, which has a lot of similarity with the ASR model (namely removing the vision modalities): https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py
A recent Audio LM addition: https://github.com/huggingface/transformers/blob/main/src/transformers/models/glmasr/modular_glmasr.py

Note: I think you will have to use "Asr" instead of "ASR" in your model naming because the modular script prefers camelcase.

…gn RoPE position handling with cache_position Refactor position_ids construction to be fully cache_position-driven and generation-safe. - Compute batch_size/seq_length from inputs_embeds - Initialize cache_position when absent - Build 3D position_ids from cache_position - Compute rope_deltas once during prefill - Reuse rope_deltas for subsequent decode steps Removes legacy attention_mask-dependent branch that was incompatible with static cache generation. Ensures correct RoPE offsets for multimodal inputs under both dynamic and static cache modes.

mbtariq82 · 2026-02-19T17:08:41Z

I made some big changes in the base model's forward in this commit: 0b3248d. I also removed get_rope_index.

…oeTextConfig

…niMoeThinkerConfig

ebezzam · 2026-04-22T12:09:21Z

run-slow: qwen3_asr

github-actions · 2026-04-22T12:10:43Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qwen3_asr"]
quantizations: []

github-actions · 2026-04-22T12:16:10Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	94f2967f	workflow commit (merge commit)
PR	62d80ea4	branch commit (from PR)
main	8a3b7b04	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

ebezzam

Self-review on Qwen3 ASR particularities (before potential refactor of Qwen3 audio encoder)

ebezzam · 2026-04-22T14:39:34Z

+- [bezzam/Qwen3-ASR-1.7B](https://huggingface.co/bezzam/Qwen3-ASR-1.7B)
+- [bezzam/Qwen3-ASR-0.6B](https://huggingface.co/bezzam/Qwen3-ASR-0.6B)
+- [bezzam/Qwen3-ForcedAligner-0.6B](https://huggingface.co/bezzam/Qwen3-ForcedAligner-0.6B)


TODO: update checkpoints in the end

ebezzam · 2026-04-22T15:50:16Z

+MODEL_FOR_FORCED_ALIGNMENT_MAPPING_NAMES = OrderedDict(
+    [
+        ("qwen3_forced_aligner", "Qwen3ASRForForcedAlignment"),
+    ]
+)


How about new class of forced alignment models?

Input: audio and text

output: timestamps

ebezzam · 2026-04-22T15:55:31Z

+
+                ``skip_special_tokens`` is hard-set to ``True`` for ``"parsed"`` and ``"transcription_only"``.
+        """
+        valid_formats = ["raw", "parsed", "transcription_only"]


Similar to VibeVoice ASR, different formats for decoding the ASR output

ebezzam · 2026-04-22T15:58:14Z

+        return [Qwen3ASRProcessor._parse_single_output(raw_text)["transcription"] for raw_text in text]
+
+    @staticmethod
+    def _is_cjk_char(char: str) -> bool:


From here, I've largely kept many of these methods from the original codebase to the post-processing leads to equivalent outputs. To iterate on what we keep and how so that it fits Transformers convention

ebezzam · 2026-04-22T15:58:41Z

+        if lang == "japanese":
+            try:
+                import nagisa
+            except ImportError:
+                raise ImportError(
+                    "Japanese forced alignment requires the `nagisa` package. Install it with: pip install nagisa"
+                )
+            return Qwen3ASRProcessor._clean_tokens(nagisa.tagging(text).words)
+
+        if lang == "korean":
+            try:
+                from soynlp.tokenizer import LTokenizer
+            except ImportError:
+                raise ImportError(
+                    "Korean forced alignment requires the `soynlp` package. Install it with: pip install soynlp"
+                )
+            return Qwen3ASRProcessor._clean_tokens(LTokenizer().tokenize(text))


Should we keep such try-imports for Japanese and Korean?

ebezzam · 2026-04-22T16:00:08Z

+
+        return [int(v) for v in result]
+
+    def prepare_forced_aligner_inputs(


Similar in spirit to apply_transcription_request: provide a helper function so the user doesn't need to manually call apply_chat_template

ebezzam · 2026-04-22T16:01:30Z

+    def decode_forced_alignment(
+        self,
+        logits,
+        input_ids,
+        word_lists: list[list[str]],
+        timestamp_token_id: int,
+        timestamp_segment_time: float | None = None,
+    ) -> list[list[dict]]:


Things get a bit unconventional... is it ok to have this separate decode just for forced alignment? Or should forced alignment have its own processor, but then does that mean it should be in its own model folder?

…e compatible!

ebezzam · 2026-04-23T21:03:41Z

run-slow: qwen3_asr

github-actions · 2026-04-23T21:05:06Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qwen3_asr"]
quantizations: []

github-actions · 2026-04-23T21:14:25Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	5172a96d	workflow commit (merge commit)
PR	e0d751e6	branch commit (from PR)
main	5cf79514	base commit (on `main`)

Model CI Report

❌ 1 new failed tests from this PR 😭

qwen3_asr:
tests/models/qwen3_asr/test_modeling_qwen3_asr.py::Qwen3ASRForConditionalGenerationModelTest::test_sdpa_can_dispatch_on_flash (✅ ⟹ ❌)

ebezzam · 2026-04-24T07:16:18Z

run-slow: qwen3_asr

github-actions · 2026-04-24T07:17:43Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qwen3_asr"]
quantizations: []

github-actions · 2026-04-24T07:31:46Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	91eecc12	workflow commit (merge commit)
PR	9b582c03	branch commit (from PR)
main	5cf79514	base commit (on `main`)

Model CI Report

❌ 1 new failed tests from this PR 😭

qwen3_asr:
tests/models/qwen3_asr/test_modeling_qwen3_asr.py::Qwen3ASRForConditionalGenerationModelTest::test_torch_export (✅ ⟹ ❌)

ebezzam · 2026-04-24T10:35:54Z

run-slow: qwen3_asr

github-actions · 2026-04-24T10:37:22Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qwen3_asr"]
quantizations: []

github-actions · 2026-04-24T10:44:38Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	75ad79fd	workflow commit (merge commit)
PR	81b8bba5	branch commit (from PR)
main	a66638d8	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

github-actions · 2026-04-24T13:46:37Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, qwen3_asr, qwen3_omni_moe

ebezzam

@eustlb ready for review!

I've defined a new audio encoder for Qwen3 ASR, instead of using the one from Qwen3OmniMoe. As we saw together Qwen3OmniMoe's audio encoder had operations which should have been in the feature extractor (and which hurt torch compile speedup). I'm made a new feature extractor object for Qwen3ASR, and if you see the torch compile example in the doc, we now get a speed up of 2.5 🚀 (when using the encoder from Omni it was 1.7).
There are two types of model in this PR: ASR (audio LM approach) and a forced aligner (uses the audio encoder + classification layer to predict word durations). I'm sure we will iterate on the latter as it's a new type of model 😄 The processor methods can definitely be improved, I left them mainly as-is from the original to get your input on what is Transformers compatible.

Note there are some comments from a previous self-review, but you should see them when you are the "Files changed" tab!

ebezzam · 2026-04-24T11:13:47Z

+    r"""
+    Constructs a Qwen3 ASR feature extractor.
+
+    Extracts 128-bin log-mel features from raw speech, then right-pads the mel time axis to a multiple of ``2 * n_window``.


Essentially this is the same as Whisper's feature extractor + right-padding data-dependent ops that were done in the audio encoder of Qwen3 Omni MoE

ebezzam · 2026-04-24T12:11:34Z

+
+@auto_docstring(checkpoint="bezzam/Qwen3-ASR-1.7B")
+@strict
+class Qwen3ASREncoderConfig(Qwen2_5OmniAudioEncoderConfig):


Not using Qwen3OmniMoeAudioEncoderConfig because it has unused conv_chunksize now that it's moved into the feature extractor

ebezzam · 2026-04-24T13:48:52Z

+            lengths = torch.where(lengths > 0, (lengths - 1) // 2 + 1, torch.zeros_like(lengths))
+        return lengths
+
+    def forward(


Overwrite forward so it is more compatible with torch compile and move what should be in the feature extractor!

ebezzam · 2026-04-24T13:50:09Z

+        super().__init__(config)
+        self.num_timestamp_bins = config.num_timestamp_bins
+        self.model = Qwen3ASRModel(config)
+        self.classifier = nn.Linear(config.text_config.hidden_size, config.num_timestamp_bins, bias=False)


Classifier instead of lm head

kobenaxie · 2026-04-30T07:07:29Z

+            )
+        self.layer_idx = layer_idx
+
+        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False)


Qwen3ASRAudioAttention requires bias=True for k_proj, but set bias=False here ?

thanks for pointing this out! looking into it, strange that the integration tests (between Transformers and original) still produce equivalent outputs 🤔

# from this branch CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 pytest tests/models/qwen3_asr/test_modeling_qwen3_asr.py::Qwen3ASRForConditionalGenerationIntegrationTest

EDIT: actually it makes sense that it doesn't affect the output because softmax is invariant to adding the same constant to every logit. Which is what happens when adding the key projection bias. Computationally, it's slightly better to have bias=False for less parameters/memory/operations. But don't think it makes a big difference.

Right now this line is generated via modular by directly inheriting from Whisper. We'll see if during the review process, if we move away from Whisper's definition.

If the bias is small enough or the general distribution is not influenced, you could have just gotten lucky :D

mbtariq82 mentioned this pull request Feb 8, 2026

Proposal to add Qwen3-ASR support #43837

Open

2 tasks

vasqu added the Audio label Feb 9, 2026

ebezzam added the New model label Feb 9, 2026

mbtariq82-code added 2 commits February 9, 2026 14:16

Create modular file and port processor

8a367b0

Create tester class and test processor initialization

Test for pretrained, tokenizer and feature extractor

a7d62a2

mbtariq82 force-pushed the qwen3-asr branch from 7245d38 to a7d62a2 Compare February 9, 2026 14:16

mbtariq82-code added 2 commits February 9, 2026 16:33

add ProcessorTesterMixin to test class

9e2cfd5

create methods for common tests

add config classes

665d1fb

unable to pass test_apply_chat_template_audio, added debugging logic …

3ce24d5

…for now

Add model and config classes

3669d24

Create integration test Setup Qwen3ASRModelTester

mbtariq82 changed the title ~~Proposal to add Qwen3-ASR support~~ Proposal to add Qwen3-ASR support [WIP] Feb 15, 2026

mbtariq82-code added 3 commits February 16, 2026 20:24

Add attn_implementation to configs

ae7d1cb

Add property methods to config Add base_model_prefix and wrapper method to generation class

Fix tests by removing attentions hook and manually calculating attent…

26db1dd

…ion weights CLEANUP NEEDED

Change model 'attentions' hook class from Qwen3ASRThinkerTextAttentio…

d4c307b

…n to Qwen3ASRTextAttention, Qwen3ASRThinkerTextAttention is never instantiated and so 'attentions' was not being properly propogated Fix integration tests

ebezzam reviewed Feb 19, 2026

View reviewed changes

Use modular transformers components to define Qwen3ASRAudioEncoderConfig

fdfd969

ebezzam reviewed Feb 21, 2026

View reviewed changes

Comment thread tests/models/qwen3_asr/test_modeling_qwen3_asr.py Outdated

ebezzam reviewed Feb 21, 2026

View reviewed changes

Comment thread src/transformers/models/qwen3_asr/modular_qwen3_asr.py Outdated

mbtariq82 force-pushed the qwen3-asr branch from a618b7d to fdfd969 Compare February 22, 2026 14:57

mbtariq82-code added 3 commits February 23, 2026 14:19

Use modular transformers to define Qwen3ASRTextConfig from Qwen3OmniM…

6336f14

…oeTextConfig

Comment about inherited class-level attributes for Qwen3ASRTextConfig

72cd0f6

Use modular transformers to define Qwen3ASRThinkerConfig from Qwen3Om…

86f4678

…niMoeThinkerConfig

Push timestamp fixtures.

62d80ea

Nits and style.

a5c5d60

ebezzam changed the title ~~Proposal to add Qwen3-ASR support [WIP]~~ Qwen3 ASR and Forced Aligner Apr 22, 2026

ebezzam added 2 commits April 22, 2026 17:40

Forced aligner refactor: new auto class and better naming.

502ff64

Forced alignmnet nits.

67c1f52

ebezzam reviewed Apr 22, 2026

View reviewed changes

Create audio encoder that is more in line with other and torch compil…

e0d751e

…e compatible!

Small fixes for tests.

9b582c0

add torch compil forced aligner example, and small fix for compile

81b8bba

ebezzam added 2 commits April 24, 2026 15:17

Modeling nits.

50962ae

undo exposure of omni audio encoder, doc/style nits

0b932ec

ebezzam reviewed Apr 24, 2026

View reviewed changes

ebezzam requested a review from eustlb April 24, 2026 14:11

CISC mentioned this pull request Apr 29, 2026

convert: set causal_attention=False for Qwen3-ASR ggml-org/llama.cpp#22511

Closed

2 tasks

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

kobenaxie reviewed Apr 30, 2026

View reviewed changes


		return [int(v) for v in result]

		def prepare_forced_aligner_inputs(

Conversation

mbtariq82 commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

vasqu commented Feb 9, 2026

Uh oh!

ebezzam commented Feb 9, 2026

Uh oh!

mbtariq82 commented Feb 11, 2026

Uh oh!

ebezzam commented Feb 11, 2026

Uh oh!

ebezzam commented Feb 11, 2026

Uh oh!

mbtariq82 commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbtariq82 commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

ebezzam commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

CI Results

Commit Info

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

CI Results

Commit Info

Model CI Report

Uh oh!

ebezzam commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

CI Results

Commit Info

Model CI Report

mbtariq82 commented Feb 8, 2026 •

edited

Loading

mbtariq82 commented Feb 15, 2026 •

edited

Loading

ebezzam Apr 22, 2026 •

edited

Loading

ebezzam left a comment •

edited

Loading

ebezzam Apr 30, 2026 •

edited

Loading