Skip to content

Qwen3 ASR and Forced Aligner#43838

Open
mbtariq82 wants to merge 100 commits intohuggingface:mainfrom
mbtariq82:qwen3-asr
Open

Qwen3 ASR and Forced Aligner#43838
mbtariq82 wants to merge 100 commits intohuggingface:mainfrom
mbtariq82:qwen3-asr

Conversation

@mbtariq82
Copy link
Copy Markdown

@mbtariq82 mbtariq82 commented Feb 8, 2026

What does this PR do?

This PR adds Qwen3-ASR to the Transformers library.

Fixes #43837

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Feb 9, 2026

I think this is a good one to add! cc @ebezzam @eustlb

@vasqu vasqu added the Audio label Feb 9, 2026
@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Feb 9, 2026

@mbtariq82 thanks for opening the PR! we're definitely interested to add the model and were planning to work on it. Could you go ahead with the rest of the model? And I can iterate with you on it.

I see you started with a modular file which is great. Below are some pointers of recent audio LM models that may help you with the other files / get an idea of our conventions:

thanks 🤗

@mbtariq82
Copy link
Copy Markdown
Author

I'm struggling to get test_apply_chat_template_audio from test_processing_common.py to pass.

Specifically, the final part of the test where we apply_chat_template with continue_final_message=True fails with ValueError(continue_final_message is set but the final message does not appear in the chat after applying the chat template!...)

I've verified that the chat_template is being correctly loaded from the model checkpoint: Qwen/Qwen3-ASR-0.6B.

According to ChatGPT, the chat_template provided by Qwen is not correctly rendering the final assistant message so I think the only way to solve this is to override the apply_chat_template method and add some custom logic before calling super().apply_chat_template?

@ebezzam

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Feb 11, 2026

@mbtariq82 it's only about getting the test to pass but the model is behaving as expected, let's avoid overwriting apply_chat_template.

For now you can leave the test failing, and if necessary we can overwrite or even skip the test later on

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Feb 11, 2026

when you finish the modeling, and integration tests that produce equivalent outputs to the original, e.g. for audio flamingo:

I can already take a look to give some feedback! and we can take a look at the test after that

Create integration test

Setup Qwen3ASRModelTester
@mbtariq82
Copy link
Copy Markdown
Author

mbtariq82 commented Feb 15, 2026

So between the current version and v4.57.6, the "default" key was removed from ROPE_INIT_FUNCTIONS. Qwen3-ASR was built using v4.57.6 and the checkpoint uses the "default" key. I've changed the rope_type to "linear" in Qwen3ASRThinkerTextRotaryEmbedding for now but I'm not sure if this is correct.

I also changed the "attentions" PyTorch hooks, it was set to Qwen3ASRThinkerTextAttention which is not used at all in the base class - maybe they plan to use it in the future but I'm not sure. I've changed it to Qwen3ASRTextAttention to get the tests to pass.

I've added the entire model. All the tests are passing.

@mbtariq82 mbtariq82 changed the title Proposal to add Qwen3-ASR support Proposal to add Qwen3-ASR support [WIP] Feb 15, 2026
Add property methods to config

Add base_model_prefix and wrapper method to generation class
…n to Qwen3ASRTextAttention, Qwen3ASRThinkerTextAttention is never instantiated and so 'attentions' was not being properly propogated

Fix integration tests
Copy link
Copy Markdown
Contributor

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mbtariq82 thanks for working on this integration! I'm doing a small review because I noticed you started a modular file, but aren't making full use of its functionality to generate the configuration, processing, and modeling from existing components in Transformers. I gave some pointers for the configuration and processing but will let you check out the rest for the modeling components.

I encourage reading this page on using modular to contribute models: https://huggingface.co/docs/transformers/en/modular_transformers

And for practical examples you can see other modular files:

Note: I think you will have to use "Asr" instead of "ASR" in your model naming because the modular script prefers camelcase.

Comment thread src/transformers/models/qwen3_asr/modular_qwen3_asr.py Outdated
Comment thread src/transformers/models/qwen3_asr/modular_qwen3_asr.py Outdated
Comment thread src/transformers/models/qwen3_asr/modular_qwen3_asr.py Outdated
Comment thread src/transformers/models/qwen3_asr/modular_qwen3_asr.py Outdated
Comment thread src/transformers/models/qwen3_asr/modular_qwen3_asr.py Outdated
Comment thread src/transformers/models/qwen3_asr/modular_qwen3_asr.py Outdated
…gn RoPE position handling with cache_position

 Refactor position_ids construction to be fully cache_position-driven and generation-safe.
 - Compute batch_size/seq_length from inputs_embeds
 - Initialize cache_position when absent
 - Build 3D position_ids from cache_position
 - Compute rope_deltas once during prefill
 - Reuse rope_deltas for subsequent decode steps
 Removes legacy attention_mask-dependent branch that was incompatible with static cache generation.
 Ensures correct RoPE offsets for multimodal inputs under both dynamic and static cache modes.
@mbtariq82
Copy link
Copy Markdown
Author

I made some big changes in the base model's forward in this commit: 0b3248d. I also removed get_rope_index.

Comment thread tests/models/qwen3_asr/test_modeling_qwen3_asr.py Outdated
Comment thread src/transformers/models/qwen3_asr/modular_qwen3_asr.py Outdated
@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Apr 22, 2026

run-slow: qwen3_asr

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qwen3_asr"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 94f2967f workflow commit (merge commit)
PR 62d80ea4 branch commit (from PR)
main 8a3b7b04 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@ebezzam ebezzam changed the title Proposal to add Qwen3-ASR support [WIP] Qwen3 ASR and Forced Aligner Apr 22, 2026
Copy link
Copy Markdown
Contributor

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review on Qwen3 ASR particularities (before potential refactor of Qwen3 audio encoder)

Comment thread src/transformers/models/qwen3_asr/modular_qwen3_asr.py Outdated
Comment thread src/transformers/models/qwen3_asr/modular_qwen3_asr.py Outdated
Comment on lines +31 to +33
- [bezzam/Qwen3-ASR-1.7B](https://huggingface.co/bezzam/Qwen3-ASR-1.7B)
- [bezzam/Qwen3-ASR-0.6B](https://huggingface.co/bezzam/Qwen3-ASR-0.6B)
- [bezzam/Qwen3-ForcedAligner-0.6B](https://huggingface.co/bezzam/Qwen3-ForcedAligner-0.6B)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: update checkpoints in the end

Comment thread src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py Outdated
Comment on lines +1843 to +1847
MODEL_FOR_FORCED_ALIGNMENT_MAPPING_NAMES = OrderedDict(
[
("qwen3_forced_aligner", "Qwen3ASRForForcedAlignment"),
]
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about new class of forced alignment models?

  • Input: audio and text
  • output: timestamps


``skip_special_tokens`` is hard-set to ``True`` for ``"parsed"`` and ``"transcription_only"``.
"""
valid_formats = ["raw", "parsed", "transcription_only"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to VibeVoice ASR, different formats for decoding the ASR output

return [Qwen3ASRProcessor._parse_single_output(raw_text)["transcription"] for raw_text in text]

@staticmethod
def _is_cjk_char(char: str) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here, I've largely kept many of these methods from the original codebase to the post-processing leads to equivalent outputs. To iterate on what we keep and how so that it fits Transformers convention

Comment on lines +372 to +388
if lang == "japanese":
try:
import nagisa
except ImportError:
raise ImportError(
"Japanese forced alignment requires the `nagisa` package. Install it with: pip install nagisa"
)
return Qwen3ASRProcessor._clean_tokens(nagisa.tagging(text).words)

if lang == "korean":
try:
from soynlp.tokenizer import LTokenizer
except ImportError:
raise ImportError(
"Korean forced alignment requires the `soynlp` package. Install it with: pip install soynlp"
)
return Qwen3ASRProcessor._clean_tokens(LTokenizer().tokenize(text))
Copy link
Copy Markdown
Contributor

@ebezzam ebezzam Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep such try-imports for Japanese and Korean?


return [int(v) for v in result]

def prepare_forced_aligner_inputs(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar in spirit to apply_transcription_request: provide a helper function so the user doesn't need to manually call apply_chat_template

Comment on lines +531 to +538
def decode_forced_alignment(
self,
logits,
input_ids,
word_lists: list[list[str]],
timestamp_token_id: int,
timestamp_segment_time: float | None = None,
) -> list[list[dict]]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things get a bit unconventional... is it ok to have this separate decode just for forced alignment? Or should forced alignment have its own processor, but then does that mean it should be in its own model folder?

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Apr 23, 2026

run-slow: qwen3_asr

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qwen3_asr"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 5172a96d workflow commit (merge commit)
PR e0d751e6 branch commit (from PR)
main 5cf79514 base commit (on main)

Model CI Report

1 new failed tests from this PR 😭

  • qwen3_asr:
    tests/models/qwen3_asr/test_modeling_qwen3_asr.py::Qwen3ASRForConditionalGenerationModelTest::test_sdpa_can_dispatch_on_flash (✅ ⟹ ❌)

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Apr 24, 2026

run-slow: qwen3_asr

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qwen3_asr"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 91eecc12 workflow commit (merge commit)
PR 9b582c03 branch commit (from PR)
main 5cf79514 base commit (on main)

Model CI Report

1 new failed tests from this PR 😭

  • qwen3_asr:
    tests/models/qwen3_asr/test_modeling_qwen3_asr.py::Qwen3ASRForConditionalGenerationModelTest::test_torch_export (✅ ⟹ ❌)

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Apr 24, 2026

run-slow: qwen3_asr

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qwen3_asr"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 75ad79fd workflow commit (merge commit)
PR 81b8bba5 branch commit (from PR)
main a66638d8 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, qwen3_asr, qwen3_omni_moe

Copy link
Copy Markdown
Contributor

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eustlb ready for review!

  1. I've defined a new audio encoder for Qwen3 ASR, instead of using the one from Qwen3OmniMoe. As we saw together Qwen3OmniMoe's audio encoder had operations which should have been in the feature extractor (and which hurt torch compile speedup). I'm made a new feature extractor object for Qwen3ASR, and if you see the torch compile example in the doc, we now get a speed up of 2.5 🚀 (when using the encoder from Omni it was 1.7).
  2. There are two types of model in this PR: ASR (audio LM approach) and a forced aligner (uses the audio encoder + classification layer to predict word durations). I'm sure we will iterate on the latter as it's a new type of model 😄 The processor methods can definitely be improved, I left them mainly as-is from the original to get your input on what is Transformers compatible.

Note there are some comments from a previous self-review, but you should see them when you are the "Files changed" tab!

r"""
Constructs a Qwen3 ASR feature extractor.

Extracts 128-bin log-mel features from raw speech, then right-pads the mel time axis to a multiple of ``2 * n_window``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially this is the same as Whisper's feature extractor + right-padding data-dependent ops that were done in the audio encoder of Qwen3 Omni MoE


@auto_docstring(checkpoint="bezzam/Qwen3-ASR-1.7B")
@strict
class Qwen3ASREncoderConfig(Qwen2_5OmniAudioEncoderConfig):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using Qwen3OmniMoeAudioEncoderConfig because it has unused conv_chunksize now that it's moved into the feature extractor

lengths = torch.where(lengths > 0, (lengths - 1) // 2 + 1, torch.zeros_like(lengths))
return lengths

def forward(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overwrite forward so it is more compatible with torch compile and move what should be in the feature extractor!

super().__init__(config)
self.num_timestamp_bins = config.num_timestamp_bins
self.model = Qwen3ASRModel(config)
self.classifier = nn.Linear(config.text_config.hidden_size, config.num_timestamp_bins, bias=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classifier instead of lm head

)
self.layer_idx = layer_idx

self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qwen3ASRAudioAttention requires bias=True for k_proj, but set bias=False here ?

Copy link
Copy Markdown
Contributor

@ebezzam ebezzam Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing this out! looking into it, strange that the integration tests (between Transformers and original) still produce equivalent outputs 🤔

# from this branch
CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 pytest tests/models/qwen3_asr/test_modeling_qwen3_asr.py::Qwen3ASRForConditionalGenerationIntegrationTest

EDIT: actually it makes sense that it doesn't affect the output because softmax is invariant to adding the same constant to every logit. Which is what happens when adding the key projection bias. Computationally, it's slightly better to have bias=False for less parameters/memory/operations. But don't think it makes a big difference.

Right now this line is generated via modular by directly inheriting from Whisper. We'll see if during the review process, if we move away from Whisper's definition.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bias is small enough or the general distribution is not influenced, you could have just gotten lucky :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal to add Qwen3-ASR support

8 participants