Skip to content

mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)#19441

Merged
ngxson merged 15 commits intoggml-org:masterfrom
ngxson:xsn/qwen3a
Apr 12, 2026
Merged

mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)#19441
ngxson merged 15 commits intoggml-org:masterfrom
ngxson:xsn/qwen3a

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented Feb 8, 2026

Status:

  • qwen3-omni-moe working (vision + audio input)
  • qwen3-asr working

@github-actions github-actions Bot added examples python python script changes labels Feb 8, 2026
@samshipengs
Copy link
Copy Markdown

Any updates on this?

@yimlin
Copy link
Copy Markdown

yimlin commented Mar 16, 2026

any updates?

QuentinFuxa pushed a commit to QuentinFuxa/llama.cpp that referenced this pull request Mar 18, 2026
Add support for Qwen3-ASR-1.7B model (Qwen3ASRForConditionalGeneration):
- New QWEN3A projector type for audio-only ASR models
- Conv2d encoder (3 layers, stride=2 each, 8x time downsampling)
- Whisper-like transformer encoder (24 layers)
- MLP projector: Linear(1024,1024) -> GELU -> Linear(1024,2048)
- Conversion tested: both mmproj and decoder GGUF files work
- Basic inference tested: model loads, encodes audio, generates output

Based on PR ggml-org#19441 by ngxson (WIP qwen3 audio),
adapted for Qwen3-ASR-only architecture (no vision, no deepstack).
Our attention extraction API (llama_set_attn_heads/llama_get_attn_ith) is untouched.
@michoecho
Copy link
Copy Markdown

michoecho commented Mar 25, 2026

I wrote a working Qwen3-ASR support for my own use at https://github.com/michoecho/llama.cpp/commits/qwen3_asr_support. (I successfully used it to transcribe some lectures in Chinese). I don't know if it's good enough for upstreaming, because I wasn't thinking about qwen3-omni at all. (I have no idea what "deepstack" is). But you could use it as a working base if you are getting wrong results.

At a glance, what mainly seems to be missing from this PR is:

  • To make my changes work properly, I had to fix a preexisting bug in whisper preprocessing which was causing the last audio chunk to be lost during: michoecho@63b3c1e#diff-a027f93a5e0a3fe643975f0ae176db52a3330a9422857b4f6fd9bfbac134c863R384. (I haven't reported it because I'm not 100% sure it's a bug — maybe I'm not seeing something — but I'm 99% sure).
  • The ggml_permute seems to have channels and frames swapped around.
  • Qwen3-ASR uses <|audio_start|> and <|audio_end|> instead of <|audio_bos|> and <|audio_eos|>.
  • The audio encoder expects windowed (/chunked) attention, with window size between 1s and 8s. If you run the encoder with full attention on a 30s chunk, you will get bogus results. I didn't want to implement windowed attention (because I would have to learn how to do that), so I just solved it in the preprocessing layer by splitting audio into 8s chunks instead of 30s chunks.
  • As a comment in this PR acknowledges, the reference implementation runs the conv2d layers on chunks of length 100. I followed the reference implementation. I don't know either if this chunking is necessary. I didn't test the non-chunked variant.
  • The default chat template of Qwen3-ASR doesn't work with llama.cpp. (It expects the audio to be passed via some special params). (My fork doesn't care about chat templates at all either, because my application constructs the prompt directly anyway. But if you want to implement a chat template, the prompt expected by the model isn't anything fancy, it's basically just chatml with audio used as the user message).

By the way, note that Qwen3-ForcedAligner (the timestamp predictor model) has the same architecture as Qwen3-ASR, so if you implement support for the latter, you almost get support for the former too. "Almost" because the ForcedAligner is a non-autoregressive classification model. (You put in the encoded audio and the transcribed text with some <timestamp> tokens mixed in, then you run a single prediction on it, and the logits on the <timestamp> tokens will describe the timestamp at those points in the text). I'm not sure how to integrate something like that with llama.cpp's abstractions. For my private use case (generating subtitles for the Chinese lectures) I added "support" for Qwen3-ForcedAligner too, but it's too hacky to post.

@ngxson ngxson changed the title mtmd: (WIP) qwen3 audio support mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) Apr 1, 2026
@ngxson ngxson marked this pull request as ready for review April 1, 2026 23:02
@ngxson ngxson requested review from a team and CISC as code owners April 1, 2026 23:02
@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 1, 2026

both qwen3-omni and qwen3-asr are working with this PR, GGUF will be uploaded shortly

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 1, 2026

* To make my changes work properly, I had to fix a preexisting bug in whisper preprocessing which was causing the last audio chunk to be lost during: [michoecho@63b3c1e#diff-a027f93a5e0a3fe643975f0ae176db52a3330a9422857b4f6fd9bfbac134c863R384](https://github.com/michoecho/llama.cpp/commit/63b3c1ec0cb1f73f4cb3a7056ae7356b413452f2#diff-a027f93a5e0a3fe643975f0ae176db52a3330a9422857b4f6fd9bfbac134c863R384). (I haven't reported it because I'm not 100% sure it's a bug — maybe I'm not seeing something — but I'm 99% sure).

Chunking can be implemented via a follow-up PR, this PR processes the input as 30s chunk for simplicity

* Qwen3-ASR uses `<|audio_start|>` and `<|audio_end|>` instead of `<|audio_bos|>` and `<|audio_eos|>`.

Thanks for pointing out, that need to be fixed in this PR

* The default chat template of Qwen3-ASR doesn't work with llama.cpp. (It expects the audio to be passed via some special params). (My fork doesn't care about chat templates at all either, because my application constructs the prompt directly anyway. But if you want to implement a chat template, the prompt expected by the model isn't anything fancy, it's basically just `chatml` with audio used as the user message).

That was fixed by simply push a chatml jinja template to GGUF upon conversion

By the way, note that Qwen3-ForcedAligner (the timestamp predictor model) has the same architecture as Qwen3-ASR, so if you implement support for the latter, you almost get support for the former too. "Almost" because the ForcedAligner is a non-autoregressive classification model. (You put in the encoded audio and the transcribed text with some <timestamp> tokens mixed in, then you run a single prediction on it, and the logits on the <timestamp> tokens will describe the timestamp at those points in the text). I'm not sure how to integrate something like that with llama.cpp's abstractions. For my private use case (generating subtitles for the Chinese lectures) I added "support" for Qwen3-ForcedAligner too, but it's too hacky to post.

Hmm yeah that sounds complicated, will see if it worth implementing knowing that another model (voxtral from mistral) having somewhat same logic

Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py
Comment on lines +5033 to +5036
if "thinker_config" in self.hparams:
vision_config = self.hparams["thinker_config"].get("vision_config", {})
else:
vision_config = self.hparams.get("vision_config", {})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of handling this everywhere, can't we just merge in all sub-configs in thinker_config here:

if "thinker_config" in config:
# rename for Qwen2.5-Omni
config["text_config"] = config["thinker_config"]["text_config"]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that can be quite dangerous because the sub config may have conflict keys with the thinker_config

I think it's fine to keep this as-is (a bit lazy to re-test this). plus, we only have one single place in the whole file that does this.

for ref, normally a text model never have to read the vision config, but this is the specific case for qwen3 to support "deep stack". from qwen3.5, they removed the deep stack

Comment thread convert_hf_to_gguf.py Outdated
ngxson and others added 3 commits April 12, 2026 14:33
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@ngxson ngxson requested a review from CISC April 12, 2026 14:48
Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py
Comment thread convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@CISC
Copy link
Copy Markdown
Member

CISC commented Apr 12, 2026

Uhoh, I think \r\n strikes again...

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 12, 2026

hmm seems like it mostly happens when you add a new line in the suggestion, right?

@CISC
Copy link
Copy Markdown
Member

CISC commented Apr 12, 2026

hmm seems like it mostly happens when you add a new line in the suggestion, right?

Yes, but only when you (and one other person so far) commit them. :)

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 12, 2026

hmm, the conversion is broken somehow, the omni mmproj now only contains audio tensors, not vision tensors

digging into this...

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 12, 2026

OK it got a bit messy with multi inheritance, but I fixed in the last commit

CC @pwilkin if you can give the 2nd approval, thanks!

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 12, 2026

pinging @ggml-org/maintainers if someone can give an approval, thanks!

@ngxson ngxson merged commit 21a4933 into ggml-org:master Apr 12, 2026
51 checks passed
@bennmann
Copy link
Copy Markdown

no audio output for the nice voices?

crodjer added a commit to crodjer/llama.cpp that referenced this pull request Apr 13, 2026
* origin/master:
  webui: MCP Diagnostics improvements (ggml-org#21803)
  Remove extra conditional check on debug mode. (ggml-org#21798)
  sycl: disable Q1_0 in backend and cleanup unused variables (ggml-org#21807)
  mtmd: fix crash when sending image under 2x2 pixels (ggml-org#21711)
  mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) (ggml-org#19441)
  convert : force f16 or f32 on step3-vl conv weights (ggml-org#21646)
  mtmd: add gemma 4 test (vision + audio) [no ci] (ggml-org#21806)
  mtmd: add Gemma 4 audio conformer encoder support (ggml-org#21421)
  fix: Proper messages rendering for "Show raw output" (ggml-org#21672)
  docs: add guide on how to add multimodal support (ggml-org#21778)
@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Apr 13, 2026

@ngxson sorry, somehow missed this one.

HermestoAizales pushed a commit to HermestoAizales/llama.cpp that referenced this pull request Apr 13, 2026
* add qwen3a

* wip

* vision ok

* no more deepstack for audio

* convert ASR model ok

* qwen3 asr working

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* nits

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix bad merge

* fix multi inheritance

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@erazortt
Copy link
Copy Markdown

no audio output for the nice voices?

Yes, that would also be of interest for me. Is there audio output planned?

@pablopla
Copy link
Copy Markdown

Can you show example how to use it? I'm getting short transcription in low quality when transcribing a 1 minute audio file.
The qwen-asr online demo works fine.
Should I use a prompt and set specific parameters?
Should I send the request to "/v1/audio/transcriptions" or ""/v1/chat/completions"

ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
* add qwen3a

* wip

* vision ok

* no more deepstack for audio

* convert ASR model ok

* qwen3 asr working

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* nits

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix bad merge

* fix multi inheritance

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants