mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)#19441
mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)#19441ngxson merged 15 commits intoggml-org:masterfrom
Conversation
|
Any updates on this? |
|
any updates? |
Add support for Qwen3-ASR-1.7B model (Qwen3ASRForConditionalGeneration): - New QWEN3A projector type for audio-only ASR models - Conv2d encoder (3 layers, stride=2 each, 8x time downsampling) - Whisper-like transformer encoder (24 layers) - MLP projector: Linear(1024,1024) -> GELU -> Linear(1024,2048) - Conversion tested: both mmproj and decoder GGUF files work - Basic inference tested: model loads, encodes audio, generates output Based on PR ggml-org#19441 by ngxson (WIP qwen3 audio), adapted for Qwen3-ASR-only architecture (no vision, no deepstack). Our attention extraction API (llama_set_attn_heads/llama_get_attn_ith) is untouched.
|
I wrote a working Qwen3-ASR support for my own use at https://github.com/michoecho/llama.cpp/commits/qwen3_asr_support. (I successfully used it to transcribe some lectures in Chinese). I don't know if it's good enough for upstreaming, because I wasn't thinking about qwen3-omni at all. (I have no idea what "deepstack" is). But you could use it as a working base if you are getting wrong results. At a glance, what mainly seems to be missing from this PR is:
By the way, note that Qwen3-ForcedAligner (the timestamp predictor model) has the same architecture as Qwen3-ASR, so if you implement support for the latter, you almost get support for the former too. "Almost" because the ForcedAligner is a non-autoregressive classification model. (You put in the encoded audio and the transcribed text with some |
|
both qwen3-omni and qwen3-asr are working with this PR, GGUF will be uploaded shortly |
Chunking can be implemented via a follow-up PR, this PR processes the input as 30s chunk for simplicity
Thanks for pointing out, that need to be fixed in this PR
That was fixed by simply push a chatml jinja template to GGUF upon conversion
Hmm yeah that sounds complicated, will see if it worth implementing knowing that another model (voxtral from mistral) having somewhat same logic |
| if "thinker_config" in self.hparams: | ||
| vision_config = self.hparams["thinker_config"].get("vision_config", {}) | ||
| else: | ||
| vision_config = self.hparams.get("vision_config", {}) |
There was a problem hiding this comment.
Instead of handling this everywhere, can't we just merge in all sub-configs in thinker_config here:
llama.cpp/convert_hf_to_gguf.py
Lines 974 to 976 in eefcfee
There was a problem hiding this comment.
hmm that can be quite dangerous because the sub config may have conflict keys with the thinker_config
I think it's fine to keep this as-is (a bit lazy to re-test this). plus, we only have one single place in the whole file that does this.
for ref, normally a text model never have to read the vision config, but this is the specific case for qwen3 to support "deep stack". from qwen3.5, they removed the deep stack
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
Uhoh, I think \r\n strikes again... |
|
hmm seems like it mostly happens when you add a new line in the suggestion, right? |
Yes, but only when you (and one other person so far) commit them. :) |
|
hmm, the conversion is broken somehow, the omni mmproj now only contains audio tensors, not vision tensors digging into this... |
|
OK it got a bit messy with multi inheritance, but I fixed in the last commit CC @pwilkin if you can give the 2nd approval, thanks! |
|
pinging @ggml-org/maintainers if someone can give an approval, thanks! |
|
no audio output for the nice voices? |
* origin/master: webui: MCP Diagnostics improvements (ggml-org#21803) Remove extra conditional check on debug mode. (ggml-org#21798) sycl: disable Q1_0 in backend and cleanup unused variables (ggml-org#21807) mtmd: fix crash when sending image under 2x2 pixels (ggml-org#21711) mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) (ggml-org#19441) convert : force f16 or f32 on step3-vl conv weights (ggml-org#21646) mtmd: add gemma 4 test (vision + audio) [no ci] (ggml-org#21806) mtmd: add Gemma 4 audio conformer encoder support (ggml-org#21421) fix: Proper messages rendering for "Show raw output" (ggml-org#21672) docs: add guide on how to add multimodal support (ggml-org#21778)
|
@ngxson sorry, somehow missed this one. |
* add qwen3a * wip * vision ok * no more deepstack for audio * convert ASR model ok * qwen3 asr working * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * nits * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix bad merge * fix multi inheritance --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Yes, that would also be of interest for me. Is there audio output planned? |
|
Can you show example how to use it? I'm getting short transcription in low quality when transcribing a 1 minute audio file. |
* add qwen3a * wip * vision ok * no more deepstack for audio * convert ASR model ok * qwen3 asr working * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * nits * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix bad merge * fix multi inheritance --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Status: