Skip to content

keeping turboquant up to date v4#25

Merged
InfernalDread merged 5 commits intoInfernalDread:turboquant_kv_cache_updated_v4from
ggml-org:master
Apr 12, 2026
Merged

keeping turboquant up to date v4#25
InfernalDread merged 5 commits intoInfernalDread:turboquant_kv_cache_updated_v4from
ggml-org:master

Conversation

@InfernalDread
Copy link
Copy Markdown
Owner

No description provided.

ngxson and others added 5 commits April 12, 2026 13:02
* docs: add guide on how to add multimodal support

* nits
* mtmd: add Gemma 4 audio conformer encoder support

Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer.

Architecture:
- 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm
- Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
- Full self-attention with sinusoidal RPE and sliding window mask (24)
- Logit softcapping at 50.0, ClippableLinear clamping
- Output: 1024 → 1536 → RMSNorm → multimodal embedder

Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):
- HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
- Standard periodic Hann window (320 samples), zero-padded to FFT size
- Semicausal left-padding (frame_length/2 samples)
- Frame count matched to PyTorch (unfold formula)
- No pre-emphasis, no Whisper-style normalization
- Mel cosine similarity vs PyTorch: 0.9998

Key fixes:
- Tensor loading dedup: prevent get_tensor() from creating duplicate
  entries in ctx_data. Fixed with std::set guard.
- ClippableLinear clamp_info loading moved after per-layer tensors.
- Sliding window mask (24 positions) matching PyTorch context_size.
- Skip Whisper normalization for Gemma4 mel output.

Tested on E2B and E4B with CPU and Vulkan backends.
Transcribes: "Glad to see things are going well and business is starting
to pick up" (matching ground truth).

Ref: #21325
* mtmd: add gemma 4 test (vision + audio)

* add to docs
@InfernalDread InfernalDread merged commit d569ad0 into InfernalDread:turboquant_kv_cache_updated_v4 Apr 12, 2026
80 of 105 checks passed
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
…bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants