mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) by ngxson · Pull Request #19441 · ggml-org/llama.cpp

ngxson · 2026-02-08T23:26:11Z

Status:

qwen3-omni-moe working (vision + audio input)
qwen3-asr working

samshipengs · 2026-02-23T11:39:49Z

Any updates on this?

yimlin · 2026-03-16T08:10:40Z

any updates?

Add support for Qwen3-ASR-1.7B model (Qwen3ASRForConditionalGeneration): - New QWEN3A projector type for audio-only ASR models - Conv2d encoder (3 layers, stride=2 each, 8x time downsampling) - Whisper-like transformer encoder (24 layers) - MLP projector: Linear(1024,1024) -> GELU -> Linear(1024,2048) - Conversion tested: both mmproj and decoder GGUF files work - Basic inference tested: model loads, encodes audio, generates output Based on PR ggml-org#19441 by ngxson (WIP qwen3 audio), adapted for Qwen3-ASR-only architecture (no vision, no deepstack). Our attention extraction API (llama_set_attn_heads/llama_get_attn_ith) is untouched.

michoecho · 2026-03-25T23:02:46Z

I wrote a working Qwen3-ASR support for my own use at https://github.com/michoecho/llama.cpp/commits/qwen3_asr_support. (I successfully used it to transcribe some lectures in Chinese). I don't know if it's good enough for upstreaming, because I wasn't thinking about qwen3-omni at all. (I have no idea what "deepstack" is). But you could use it as a working base if you are getting wrong results.

At a glance, what mainly seems to be missing from this PR is:

To make my changes work properly, I had to fix a preexisting bug in whisper preprocessing which was causing the last audio chunk to be lost during: michoecho@63b3c1e#diff-a027f93a5e0a3fe643975f0ae176db52a3330a9422857b4f6fd9bfbac134c863R384. (I haven't reported it because I'm not 100% sure it's a bug — maybe I'm not seeing something — but I'm 99% sure).
The ggml_permute seems to have channels and frames swapped around.
Qwen3-ASR uses <|audio_start|> and <|audio_end|> instead of <|audio_bos|> and <|audio_eos|>.
The audio encoder expects windowed (/chunked) attention, with window size between 1s and 8s. If you run the encoder with full attention on a 30s chunk, you will get bogus results. I didn't want to implement windowed attention (because I would have to learn how to do that), so I just solved it in the preprocessing layer by splitting audio into 8s chunks instead of 30s chunks.
As a comment in this PR acknowledges, the reference implementation runs the conv2d layers on chunks of length 100. I followed the reference implementation. I don't know either if this chunking is necessary. I didn't test the non-chunked variant.
The default chat template of Qwen3-ASR doesn't work with llama.cpp. (It expects the audio to be passed via some special params). (My fork doesn't care about chat templates at all either, because my application constructs the prompt directly anyway. But if you want to implement a chat template, the prompt expected by the model isn't anything fancy, it's basically just chatml with audio used as the user message).

By the way, note that Qwen3-ForcedAligner (the timestamp predictor model) has the same architecture as Qwen3-ASR, so if you implement support for the latter, you almost get support for the former too. "Almost" because the ForcedAligner is a non-autoregressive classification model. (You put in the encoded audio and the transcribed text with some <timestamp> tokens mixed in, then you run a single prediction on it, and the logits on the <timestamp> tokens will describe the timestamp at those points in the text). I'm not sure how to integrate something like that with llama.cpp's abstractions. For my private use case (generating subtitles for the Chinese lectures) I added "support" for Qwen3-ForcedAligner too, but it's too hacky to post.

ngxson · 2026-04-01T23:03:13Z

both qwen3-omni and qwen3-asr are working with this PR, GGUF will be uploaded shortly

ngxson · 2026-04-01T23:07:54Z

* To make my changes work properly, I had to fix a preexisting bug in whisper preprocessing which was causing the last audio chunk to be lost during: [michoecho@63b3c1e#diff-a027f93a5e0a3fe643975f0ae176db52a3330a9422857b4f6fd9bfbac134c863R384](https://github.com/michoecho/llama.cpp/commit/63b3c1ec0cb1f73f4cb3a7056ae7356b413452f2#diff-a027f93a5e0a3fe643975f0ae176db52a3330a9422857b4f6fd9bfbac134c863R384). (I haven't reported it because I'm not 100% sure it's a bug — maybe I'm not seeing something — but I'm 99% sure).

Chunking can be implemented via a follow-up PR, this PR processes the input as 30s chunk for simplicity

* Qwen3-ASR uses `<|audio_start|>` and `<|audio_end|>` instead of `<|audio_bos|>` and `<|audio_eos|>`.

Thanks for pointing out, that need to be fixed in this PR

* The default chat template of Qwen3-ASR doesn't work with llama.cpp. (It expects the audio to be passed via some special params). (My fork doesn't care about chat templates at all either, because my application constructs the prompt directly anyway. But if you want to implement a chat template, the prompt expected by the model isn't anything fancy, it's basically just `chatml` with audio used as the user message).

That was fixed by simply push a chatml jinja template to GGUF upon conversion

By the way, note that Qwen3-ForcedAligner (the timestamp predictor model) has the same architecture as Qwen3-ASR, so if you implement support for the latter, you almost get support for the former too. "Almost" because the ForcedAligner is a non-autoregressive classification model. (You put in the encoded audio and the transcribed text with some <timestamp> tokens mixed in, then you run a single prediction on it, and the logits on the <timestamp> tokens will describe the timestamp at those points in the text). I'm not sure how to integrate something like that with llama.cpp's abstractions. For my private use case (generating subtitles for the Chinese lectures) I added "support" for Qwen3-ForcedAligner too, but it's too hacky to post.

Hmm yeah that sounds complicated, will see if it worth implementing knowing that another model (voxtral from mistral) having somewhat same logic

CISC · 2026-04-02T07:34:46Z

+        if "thinker_config" in self.hparams:
+            vision_config = self.hparams["thinker_config"].get("vision_config", {})
+        else:
+            vision_config = self.hparams.get("vision_config", {})


Instead of handling this everywhere, can't we just merge in all sub-configs in thinker_config here:

llama.cpp/convert_hf_to_gguf.py

Lines 974 to 976 in eefcfee

if "thinker_config" in config:

# rename for Qwen2.5-Omni

config["text_config"] = config["thinker_config"]["text_config"]

hmm that can be quite dangerous because the sub config may have conflict keys with the thinker_config

I think it's fine to keep this as-is (a bit lazy to re-test this). plus, we only have one single place in the whole file that does this.

for ref, normally a text model never have to read the vision config, but this is the specific case for qwen3 to support "deep stack". from qwen3.5, they removed the deep stack

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CISC · 2026-04-12T15:39:03Z

Uhoh, I think \r\n strikes again...

ngxson · 2026-04-12T15:42:00Z

hmm seems like it mostly happens when you add a new line in the suggestion, right?

CISC · 2026-04-12T15:43:27Z

hmm seems like it mostly happens when you add a new line in the suggestion, right?

Yes, but only when you (and one other person so far) commit them. :)

ngxson · 2026-04-12T15:53:19Z

hmm, the conversion is broken somehow, the omni mmproj now only contains audio tensors, not vision tensors

digging into this...

ngxson · 2026-04-12T15:58:07Z

OK it got a bit messy with multi inheritance, but I fixed in the last commit

CC @pwilkin if you can give the 2nd approval, thanks!

ngxson · 2026-04-12T21:47:51Z

pinging @ggml-org/maintainers if someone can give an approval, thanks!

bennmann · 2026-04-12T23:27:00Z

no audio output for the nice voices?

* origin/master: webui: MCP Diagnostics improvements (ggml-org#21803) Remove extra conditional check on debug mode. (ggml-org#21798) sycl: disable Q1_0 in backend and cleanup unused variables (ggml-org#21807) mtmd: fix crash when sending image under 2x2 pixels (ggml-org#21711) mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) (ggml-org#19441) convert : force f16 or f32 on step3-vl conv weights (ggml-org#21646) mtmd: add gemma 4 test (vision + audio) [no ci] (ggml-org#21806) mtmd: add Gemma 4 audio conformer encoder support (ggml-org#21421) fix: Proper messages rendering for "Show raw output" (ggml-org#21672) docs: add guide on how to add multimodal support (ggml-org#21778)

pwilkin · 2026-04-13T10:08:00Z

@ngxson sorry, somehow missed this one.

* add qwen3a * wip * vision ok * no more deepstack for audio * convert ASR model ok * qwen3 asr working * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * nits * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix bad merge * fix multi inheritance --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

erazortt · 2026-04-14T21:04:45Z

no audio output for the nice voices?

Yes, that would also be of interest for me. Is there audio output planned?

pablopla · 2026-04-20T22:19:37Z

Can you show example how to use it? I'm getting short transcription in low quality when transcribing a 1 minute audio file.
The qwen-asr online demo works fine.
Should I use a prompt and set specific parameters?
Should I send the request to "/v1/audio/transcriptions" or ""/v1/chat/completions"

* add qwen3a * wip * vision ok * no more deepstack for audio * convert ASR model ok * qwen3 asr working * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * nits * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix bad merge * fix multi inheritance --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ngxson added 3 commits January 1, 2026 15:40

add qwen3a

d743546

wip

d703cf7

Merge branch 'master' into xsn/qwen3a

cd24d44

github-actions Bot added examples python python script changes labels Feb 8, 2026

theo77186 mentioned this pull request Feb 15, 2026

Feature Request: Qwen3-Omni-30B-A3B support #16186

Closed

4 tasks

ngxson added 6 commits April 1, 2026 15:23

Merge branch 'master' into xsn/qwen3a

948f614

Merge branch 'master' into xsn/qwen3a

2bd0966

vision ok

e0adcf7

no more deepstack for audio

792edb7

convert ASR model ok

172865e

qwen3 asr working

eefcfee

ngxson changed the title ~~mtmd: (WIP) qwen3 audio support~~ mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) Apr 1, 2026

ngxson marked this pull request as ready for review April 1, 2026 23:02

ngxson requested review from a team and CISC as code owners April 1, 2026 23:02

CISC reviewed Apr 2, 2026

View reviewed changes

ngxson and others added 3 commits April 12, 2026 14:33

Apply suggestions from code review

46a8435

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Merge branch 'master' into xsn/qwen3a

264371c

nits

4fd0ac7

ngxson requested a review from CISC April 12, 2026 14:48

CISC reviewed Apr 12, 2026

View reviewed changes

Comment thread convert_hf_to_gguf.py Outdated

Comment thread convert_hf_to_gguf.py Outdated

Comment thread convert_hf_to_gguf.py

Comment thread convert_hf_to_gguf.py

Apply suggestions from code review

3c9c8d9

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

fix bad merge

53d149b

CISC approved these changes Apr 12, 2026

View reviewed changes

fix multi inheritance

8fc8698

ServeurpersoCom approved these changes Apr 12, 2026

View reviewed changes

ngxson merged commit 21a4933 into ggml-org:master Apr 12, 2026
51 checks passed

github-actions Bot mentioned this pull request Apr 13, 2026

Reddit News Daily 2026-04-13 gitlawr/reddit-daily-news#213

Open

MerijnHendriks mentioned this pull request Apr 13, 2026

[Feature Request] Qwen3 audio support LostRuins/koboldcpp#2128

Closed

cora4 mentioned this pull request Apr 25, 2026

mtmd: qwen3-asr wrong output #22343

Closed

	if "thinker_config" in config:
	# rename for Qwen2.5-Omni
	config["text_config"] = config["thinker_config"]["text_config"]

Conversation

ngxson commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samshipengs commented Feb 23, 2026

Uh oh!

yimlin commented Mar 16, 2026

Uh oh!

michoecho commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 1, 2026

Uh oh!

ngxson commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Apr 12, 2026

Uh oh!

ngxson commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Apr 12, 2026

Uh oh!

ngxson commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 12, 2026

Uh oh!

ngxson commented Apr 12, 2026

Uh oh!

Uh oh!

bennmann commented Apr 12, 2026

Uh oh!

pwilkin commented Apr 13, 2026

Uh oh!

erazortt commented Apr 14, 2026

Uh oh!

pablopla commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

ngxson commented Feb 8, 2026 •

edited

Loading

michoecho commented Mar 25, 2026 •

edited

Loading

ngxson commented Apr 12, 2026 •

edited

Loading

ngxson commented Apr 12, 2026 •

edited

Loading