mtmd: add Gemma 4 audio conformer encoder support#21421
mtmd: add Gemma 4 audio conformer encoder support#21421ngxson merged 19 commits intoggml-org:masterfrom
Conversation
|
Hi @stephencox-ict, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Nice, seems to work but not 100% correct (using e4b, f16):
However, the correct transcription should be:
|
I haven't yet implemented chunked local self-attention. Focussed on the testing side now and will come back to this |
6bf9d4a to
9729486
Compare
83d1f37 to
13e9f5e
Compare
29dd32e to
7435a59
Compare
JohannesGaessler
left a comment
There was a problem hiding this comment.
The changes to test-llama-archs.cpp LGTM otherwise. For some of the other files I'm seeing though that you are adding code comments with EM dashes. Please stick to ASCII unless there is a good reason not to.
9a5b23a to
1cbecb4
Compare
Fixed |
Instead of loading tensors into the wrong fields and swapping afterwards, load them directly into the correct fields by using the reversed GGUF tensor names at the loading site. This is cleaner and removes the need for the post-load swap loop. Addresses review comment from ngxson on 2026-04-11. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ggml_roll operations in the Gemma 4 audio conformer with equivalent ggml_view + ggml_concat sequences. The ROLL op has no Metal kernel, causing 73 graph splits and CPU fallbacks on Apple Silicon that likely cause the repetitive output reported by ngxson. With this change, all conformer ops run on a single backend (graph splits reduced from 73 to 1). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I don't think replacing ggml_roll with something else is a valid solution. Unsupported ops are fallback to CPU implementation. If your impl work on CUDA but fails on CPU (via |
|
@ngxson CUDA output gist: https://gist.github.com/stephencox/9ccab7b860d5be9c7f8df97b9e9f9525 I investigated the repetition issue and found the root cause: Fix (1389eea): Replaced both Results:
Could you try the latest commit and see if the repetitions are resolved on your M5 Max? |
|
Also a reminder to use the latest Unsloth GGUFs (the earlier conversions had issues): |
|
Fair point — I tested with The view+concat replacement still has value as a performance improvement (eliminates 73 graph splits on Metal, keeping the entire conformer on one backend), but you are right that it does not explain the repetitions if CPU fallback works correctly. Could the repetition be related to the model/GGUF version? I have not been able to reproduce it on E2B BF16 (Unsloth). What model and mmproj are you using? |
Restore the target_arch filter that was accidentally removed when adding per-arch skip lists. Also remove redundant LLM_ARCH_UNKNOWN check that was already handled above. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hmm ok, thanks for the pointer. I tried the unsloth version (Q4_K_M text + BF16 mmproj) and it's indeed working without repetition. I downloaded a fresh copy of https://huggingface.co/google/gemma-4-E4B-it and re-convert it. Turns out, the mmproj is very sensitive to quantization:
So I think for now, the only way is to keep BF16 for mmproj. I hope that will also fix some problems with image input (to be tested) |
|
Thanks for confirming. This matches what we found during validation — the Gemma 4 audio conformer uses So BF16 mmproj is required for now. The Unsloth GGUFs ship with BF16 mmproj which is why they work. For the PR, should I add a note/warning about this in the code or docs? |
I think adding a note on top of this PR should be fine, no need to add to docs or comment. Something like: Important It is recommended to use BF16 mmproj. Other quantizations are known to have degraded performance; ref comment: #21421 (comment) |
Keep only the gemma4-specific fixture params and skip entries. The other arch skip lists (CLIP, GPTJ, CHAMELEON, RWKV, BERT, PLM, WAVTOKENIZER_DEC, etc.) are unrelated to this PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ity" This reverts commit 1389eea.
|
For ref, the repetition seems to be due to causal attention being set incorrectly on the vision model. It should be fixed in #21824 ; I tested with Q8_0 mmproj and it works correctly now |
* mtmd: add Gemma 4 audio conformer encoder support Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer. Architecture: - 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm - Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm - Full self-attention with sinusoidal RPE and sliding window mask (24) - Logit softcapping at 50.0, ClippableLinear clamping - Output: 1024 → 1536 → RMSNorm → multimodal embedder Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a): - HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3 - Standard periodic Hann window (320 samples), zero-padded to FFT size - Semicausal left-padding (frame_length/2 samples) - Frame count matched to PyTorch (unfold formula) - No pre-emphasis, no Whisper-style normalization - Mel cosine similarity vs PyTorch: 0.9998 Key fixes: - Tensor loading dedup: prevent get_tensor() from creating duplicate entries in ctx_data. Fixed with std::set guard. - ClippableLinear clamp_info loading moved after per-layer tensors. - Sliding window mask (24 positions) matching PyTorch context_size. - Skip Whisper normalization for Gemma4 mel output. Tested on E2B and E4B with CPU and Vulkan backends. Transcribes: "Glad to see things are going well and business is starting to pick up" (matching ground truth). Ref: ggml-org#21325
* origin/master: webui: MCP Diagnostics improvements (ggml-org#21803) Remove extra conditional check on debug mode. (ggml-org#21798) sycl: disable Q1_0 in backend and cleanup unused variables (ggml-org#21807) mtmd: fix crash when sending image under 2x2 pixels (ggml-org#21711) mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) (ggml-org#19441) convert : force f16 or f32 on step3-vl conv weights (ggml-org#21646) mtmd: add gemma 4 test (vision + audio) [no ci] (ggml-org#21806) mtmd: add Gemma 4 audio conformer encoder support (ggml-org#21421) fix: Proper messages rendering for "Show raw output" (ggml-org#21672) docs: add guide on how to add multimodal support (ggml-org#21778)
* mtmd: add Gemma 4 audio conformer encoder support Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer. Architecture: - 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm - Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm - Full self-attention with sinusoidal RPE and sliding window mask (24) - Logit softcapping at 50.0, ClippableLinear clamping - Output: 1024 → 1536 → RMSNorm → multimodal embedder Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a): - HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3 - Standard periodic Hann window (320 samples), zero-padded to FFT size - Semicausal left-padding (frame_length/2 samples) - Frame count matched to PyTorch (unfold formula) - No pre-emphasis, no Whisper-style normalization - Mel cosine similarity vs PyTorch: 0.9998 Key fixes: - Tensor loading dedup: prevent get_tensor() from creating duplicate entries in ctx_data. Fixed with std::set guard. - ClippableLinear clamp_info loading moved after per-layer tensors. - Sliding window mask (24 positions) matching PyTorch context_size. - Skip Whisper normalization for Gemma4 mel output. Tested on E2B and E4B with CPU and Vulkan backends. Transcribes: "Glad to see things are going well and business is starting to pick up" (matching ground truth). Ref: ggml-org#21325
* mtmd: add Gemma 4 audio conformer encoder support Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer. Architecture: - 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm - Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm - Full self-attention with sinusoidal RPE and sliding window mask (24) - Logit softcapping at 50.0, ClippableLinear clamping - Output: 1024 → 1536 → RMSNorm → multimodal embedder Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a): - HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3 - Standard periodic Hann window (320 samples), zero-padded to FFT size - Semicausal left-padding (frame_length/2 samples) - Frame count matched to PyTorch (unfold formula) - No pre-emphasis, no Whisper-style normalization - Mel cosine similarity vs PyTorch: 0.9998 Key fixes: - Tensor loading dedup: prevent get_tensor() from creating duplicate entries in ctx_data. Fixed with std::set guard. - ClippableLinear clamp_info loading moved after per-layer tensors. - Sliding window mask (24 positions) matching PyTorch context_size. - Skip Whisper normalization for Gemma4 mel output. Tested on E2B and E4B with CPU and Vulkan backends. Transcribes: "Glad to see things are going well and business is starting to pick up" (matching ground truth). Ref: ggml-org#21325
Important
It is recommended to use BF16 mmproj. Other quantizations are known to have degraded performance; ref comment: #21421 (comment)
Overview
Add audio processing support for Gemma 4 models via a USM-style Conformer encoder.
Architecture:
Chunked local attention (matching PyTorch reference):
ggml_view_4dwith stride 12dist < left_window_sizeconditionMel preprocessing (dedicated
mtmd_audio_preprocessor_gemma4a):Fixes (beyond the initial encoder):
Contiguous sigmoid input (
gemma4a.cpp): Wrap GLU gate view inggml_cont()beforeggml_sigmoid(). The non-contiguous view caused CUDA/Vulkan to fall back to CPU for sigmoid, creating 25 graph splits and numerical divergence on longer audio.Conv norm swap at load site (
clip.cpp): The upstreamtensor_mapping.pymaps the gemma4 audio lconv1d norms with swapped GGUF names (conv_norm↔norm_conv). The loader now loads tensors in reverse order at the load site to correct this, rather than swapping post-load. Verified by element-wise comparison against Python transformers safetensors weights.Replace
ggml_rollwithggml_view+ggml_concat(gemma4a.cpp): Theggml_rollop has no Metal kernel, causing 73 graph splits and CPU fallbacks on Apple Silicon. Replaced with equivalent view+concat sequences that are supported on all backends. Audio encoder graph splits reduced from 73 to 1 on Metal.Usage:
llama-mtmd-cli \ -m gemma-4-E2B-it-Q6_K.gguf \ --mmproj mmproj-BF16.gguf \ --audio sample.wav \ -p "Transcribe this audio exactly." \ --temp 1.0 --top-k 64 --top-p 0.95 \ -ngl 99 --jinjaAudio transcription results (E2B, best across CPU/Vulkan/CUDA):
Short audio (5.9s LibriSpeech):
Long audio (17.4s, moon landing narration):
E2B short audio: 14/14 PASS. All quantizations correctly transcribe across all backends.
⏳ = model's thinking block consumed all tokens before outputting the transcription. Higher
-nvalues resolve this.E4B also tested: 19/21 short PASS, 20/21 long PASS/PARTIAL. Only ultra-low 2-bit quants (UD-IQ2_M, UD-IQ3_XXS) fail. See PR comment for full E4B matrix.
Resampling note: For best audio quality, provide input already at 16kHz. Audio at other sample rates will be downsampled using miniaudio's linear resampler regardless of format (WAV, MP3, FLAC).
Generation parameters (from model's
generation_config.json):--temp 1.0 --top-k 64 --top-p 0.95Additional information
Test plan:
test-mtmd-c-apipassestest-llama-archspassesDependency: #21625 (per-layer embedding scale for multimodal path) improves longer audio transcription reliability on longer audio (~17s+).
Ref: #21325
Related: #21599 (quantization fix: force Q6_K minimum for Gemma4 tied embeddings)
Related: #21612 (merged: perform per-layer projections in the first layer)
Related: #21625 (dependency: per-layer embedding scale for multimodal path)
Requirements