Sync with ggml.org master by gehasia · Pull Request #73 · TheTom/llama-cpp-turboquant

gehasia · 2026-04-13T16:06:48Z

Sync with ggml.org master

…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Quality confirmed: PPL 6.194 (+1.4% of q8_0) Speed: 10.7 tok/s (inverse rotation in dequant, no pre-rotate-queries) Previous speed claims (51-77 tok/s) were invalid — measured garbage output speed. Key lessons documented for future reference. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prefill speed: 739 → 1074 tok/s (0.40x q8_0, was 0.27x) Quality: PPL 6.195 (unchanged from 6.194 baseline, +1.4% of q8_0) Metal shader changes: - turbo3_dequantize_full_block: WHT butterfly now runs in fp16 (half) Centroids fit in fp16 (max |val| = 0.19), butterfly add/sub stays in range. 2x throughput on Apple Silicon Metal fp16 ALUs. - dequantize_turbo3_0_t4: cooperative SIMD dequant for flash_attn_ext_vec All 32 SIMD lanes work on same block — each unpacks only its 4 elements, WHT butterfly runs across lanes via simd_shuffle. Eliminates 31/32 redundant full-block dequants. Graph changes: - Removed broken pre-rotate-queries code (WHT and RoPE don't commute — KV stores WHT(RoPE(K)) but graph rotation gave RoPE(WHT(Q))) - Added TODO comments documenting the root cause and fix path KV cache changes: - Fixed rotation matrix storage comments (R vs R^T after ggml layout analysis) - Fixed clear(true) zeroing rotation tensors without reinit (Codex catch) - Corrected ggml_backend_tensor_set to store R/R^T in correct orientation Docs: - quality-benchmarks.md: top-of-tree quality+speed table - turbo-speed-investigation.md: fp16 WHT results, RoPE/WHT commutativity - pre-rotate-queries-investigation.md: full debugging log (20+ builds) - turbo-quality-gate.sh: pre-push perplexity check script Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

These docs belong in our project, not in a fork of someone else's repo. Moved to https://github.com/TheTom/turboquant_plus/tree/main/docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Prefill: 1411 tok/s (0.52x q8_0, was 0.40x) PPL: 6.195 (unchanged, within 0.001 of baseline) Metal shader: turbo3_dequantize_full_block - WHT butterfly now uses 32 x half4 vectors instead of 128 x half scalars Stage h=1,2: intra-vector swizzle (half4 constructor reorder) Stage h=4..64: inter-vector butterfly with computed stride - Centroid lookup processes natural byte boundaries (4 elements per qs byte) - Sign application and norm scaling use vectorized half4/float4 Codex review: no correctness bugs. Butterfly pairing, centroid unpacking, and sign application all verified correct. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Pre-computed turbo_wht_signs1_h4[32] and turbo_wht_signs2_h4[32] as constant half4 arrays. Eliminates per-element float→half conversion and reduces constant memory reads from 4 per half4 to 1. Marginal improvement (~1%) — Metal compiler already optimized the constant reads. But cleaner code and consistent with the half4 WHT. PPL: 6.195 (unchanged) Codex: no issues (included in Exp1 review scope) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

THE BIG WIN: moved WHT rotation from per-block dequant to graph-level ggml_mul_mat ops. 47% speedup over previous best. Prefill: 2095 tok/s (0.78x q8_0, was 1424 = 0.53x) PPL: 6.201 (within 0.01 of 6.195 baseline) Compression: 4.9x (unchanged) Key insight: applying WHT in build_attn (after RoPE, before build_attn_mha) matches the K quantize pipeline exactly. K stores WHT(RoPE(K)) from SET_ROWS, Q becomes WHT(RoPE(Q)) from graph mul_mat. Dot products preserved. Changes: - llama-graph.cpp: Q forward rotation (R @ q) and V un-rotation (R^T @ cur) in the llm_graph_input_attn_kv build_attn overload - ggml-metal.metal: stripped WHT from turbo3_dequantize_full_block (returns centroid * norm in rotated space, graph handles un-rotation) Codex review: pipeline point correct, reshape dims correct, lifecycle OK. Noted: only covers one build_attn overload (sufficient for Qwen3MoE). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

THE BREAKTHROUGH: block-32 with graph-side WHT rotation reaches q8_0 parity. Prefill: 2747 tok/s (1.02x q8_0, was 0.78x with block-128) PPL: 5.460 (32-chunk) / 6.193 (8-chunk) — within noise of baseline Compression: 4.6x (slightly less than 4.9x due to per-block norm overhead) Changes: - QK_TURBO3: 128 → 32 (matches q4_0 block size for GPU parallelism) - dequantize_turbo3_0: simple centroid lookup + norm scale (no WHT, no full-block) - dequantize_turbo3_0_t4: same simple path (no SIMD shuffle needed) - Flash attention nl: 8→2 (non-vec), 32→8 (vec) matching new block size Why this works: with graph-side WHT rotation, dequant no longer needs the 128-element WHT butterfly. Each 32-element block can be decoded independently. Smaller blocks = more GPU parallelism = faster flash attention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Added TURBO_LAYER_ADAPTIVE env var for per-layer cache type selection: 0 = uniform (default) 1 = q8_0 for first+last 4 layers, turbo3 for middle 32 2 = q8_0 for last 8 layers, turbo3 for first 32 Results (Qwen3.5-35B-A3B, 8 chunks): uniform turbo3: PPL = 6.193 (+1.3% vs q8_0) mode 1: PPL = 6.185 (+1.2% vs q8_0) mode 2: PPL = 6.110 (+0.0% vs q8_0!!!) Mode 2 achieves q8_0 quality (PPL 6.110 vs 6.111) while compressing 32 of 40 layers at turbo3 (4.6x). Only the last 8 layers use q8_0. Effective compression: ~3.5x overall vs 2.0x uniform q8_0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

…ow guard 1. Thread-safe static init via C++ lambda (was data race on static int) 2. Guard n_layer >= 8 to prevent unsigned underflow on small models 3. Use const local for n_layer and is_turbo check PPL verified: mode 2 still gives 6.1095 (matching q8_0 baseline) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

…n data Part of TheTom#32: turbo3 prefill degrades relative to q8_0 with context length. Changes so far: - Skip ggml_cont when tensors already contiguous (+1%, minimal) - Generated 32x32 rotation matrices (turbo-rotation-data-32.h) for reduced group size approach (16x less matmul compute) - Fixed V un-rotation to check v->type not k->type Next: update QK_TURBO3_GROUP, Metal WHT kernel, and KV cache for d=32. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Reducing WHT rotation group from 128 to 32 elements degrades quality. Python kurtosis test showed 3.06 (good) on random data, but real Qwen3.5 KV tensors need 128-element groups for proper Gaussianization. Group-32 also didn't help speed — actually slower at all context sizes. This approach is a dead end. Next: custom GGML_OP_TURBO_WHT for O(d log d) rotation without dense matmul. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Adds a new ggml operation for applying WHT rotation to 128-element groups. Replaces the previous dense ggml_mul_mat(128x128, ...) approach. Implementation: - ggml.h: new op enum + ggml_turbo_wht(tensor, direction) API - ggml.c: constructor with direction param in op_params - ggml-cpu/ops.cpp: CPU impl (fp32 butterfly, parallel over groups) - ggml-metal.metal: Metal kernel (fp16 half4 vectorized butterfly) - ggml-metal-device: pipeline getter, supports_op - ggml-metal-ops: dispatch with threadgroup-per-group layout - llama-graph.cpp: uses ggml_turbo_wht instead of mul_mat+reshape Results: - PPL: 6.211 (within tolerance of 6.19 baseline) - Context scaling: same as dense matmul (~8% gap at 4k vs q8_0) - The matmul was NOT the bottleneck — dequant per KV position is The custom op is still valuable: eliminates rotation tensor storage, cleaner graph (no reshape/cont), and correct O(d log d) complexity. The context scaling regression comes from flash attention dequant cost, not the graph rotation. Codex review: fixed missing OP_NAME table entry. Noted CPU fp32 vs Metal fp16 precision difference (acceptable, Metal is the target). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Unrolled dequant with batched byte reads. Each 4-element group reads qs and signs bytes ONCE instead of per-element. Codex-verified bit indexing. Context scaling results: ctx=1024: 0.981x q8_0 (was 0.976x) ctx=2048: 0.989x q8_0 (was 0.960x) ctx=4096: 0.981x q8_0 (was 0.921x) The ratio now stays FLAT at ~98% vs q8_0 across all context sizes. Previous 7.9% gap at 4k context reduced to 1.9%. PPL: 6.211 (within tolerance) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Checks both: 1. PPL within 5% of q8_0 baseline (8-chunk wikitext-2) 2. Context scaling ratio > 0.95 at 4K context Both must pass. Run: bash scripts/turbo-quality-gate.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Half LUT for cache pressure + float4 * scalar norm (1 multiply vs 4). Verified on main: PPL 6.211, decode 78.4 short / 68.3 at 8K. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

llama-bench had a hardcoded ggml_type_from_name() that didn't include turbo types. Now turbo3 and turbo4 work with -ctk/-ctv flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Replace single 8-entry constant half LUT with two 4-entry LUTs (one for positive, one for negative centroids). Each lookup now has only 4 possible constant addresses instead of 8, reducing divergent constant cache access that causes 10x decode slowdown on M1 hardware. Codex review caught sign-mapping bug in initial magnitude+sign approach — the sorted centroid LUT has reversed magnitude order for negative values. Split LUT avoids this by keeping the original index mapping within each half. PPL: 6.2109 (identical to main) Decode M5: 74.0 tok/s (vs 77.4 main — 4.4% regression on M5) Target: significant improvement on M1 where constant cache is the bottleneck Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Signs can mix per element within a thread's 4-element dequant — each element independently selects from positive or negative LUT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

@spiritbuun

Port of @spiritbuun's norm correction from CUDA to Metal SET_ROWS. After quantizing all 128 elements in a group, compute the L2 norm of the centroid reconstruction vector and store: corrected_norm = original_norm / ||centroid_vector|| instead of raw original_norm. This corrects systematic norm shrinkage from codebook quantization. Zero decode cost — dequant code is unchanged, just reads a better stored norm value. Only adds 128 FMAs to the quantizer (not hot path). Results (Qwen3.5-35B-A3B, wikitext-2): Before: PPL 6.2109 (8-chunk), 5.4714 (32-chunk) — +1.6% vs q8_0 After: PPL 6.1756 (8-chunk), 5.4451 (32-chunk) — +1.1% vs q8_0 q8_0: PPL 6.1109 (8-chunk), 5.4145 (32-chunk) 0.5% quality improvement at literally zero speed cost. Original CUDA implementation: github.com/spiritbuun/llama-cpp-turboquant-cuda (commit 721880c) Co-Authored-By: spiritbuun <271142774+spiritbuun@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

…text overflow Two bugs that caused turbo3 to silently fail on pre-M5 Apple Silicon: 1. turbo3/turbo4 require flash attention for the dequant path, but llama-bench defaults to flash_attn=disabled. Auto-enable FA when turbo cache types are detected, with a warning log message. This fixes context creation failures on M2 Pro/Max and similar hardware. 2. KV cache ggml context was sized for exactly K/V tensors per layer, but turbo types add 2 rotation matrix tensors (turbo_rotation and turbo_rotation_inv) that weren't accounted for. Add +2 tensor overhead to prevent GGML_ASSERT(obj_new) failure. Tested on M5 Max (Apple9/has_tensor=true) and M2 Pro (Apple8/has_tensor=false). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@spiritbuun

Ported @spiritbuun's register centroid×norm LUT from CUDA to Metal. On CUDA: 96-97% of q8_0 decode (big win). On Metal: 75.2 tok/s vs 77.4 main (SLOWER — register spill). The cn[8] float array spills to device memory on Metal's smaller register file, making it slower than constant memory access. Reverted to proven constant half LUT + float norm broadcast. This is a fundamental Metal vs CUDA architecture difference: - CUDA: 255 registers per thread, cn[8] fits easily - Metal: smaller register file, 8 floats cause spill The split-LUT approach (2x4 half entries) was also tested earlier and showed similar regression (74.0 tok/s). Constant half[8] with float norm broadcast remains the fastest vec dequant on Apple Silicon. Co-Authored-By: spiritbuun <271142774+spiritbuun@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

…ling (Issue TheTom#29) Three bugs from the block-size-32 refactor: 1. kernel_set_rows_turbo hardcoded turbo3 packing for turbo4 — split into separate kernel_set_rows_turbo3 and kernel_set_rows_turbo4 kernels. turbo4 now correctly does 3-bit PolarQuant + QJL residual correction. 2. Integer division in n_groups = nk0 / blocks_per_group silently dropped tail blocks for non-128-aligned head dims (e.g. dk=192). Added ceiling division with tail-group bounds checking in turbo3, and GGML_ASSERT in WHT dispatch to catch non-128-aligned tensors. 3. TURBO_D constant was semantically coupled to QK_TURBO4 — replaced with TURBO_ROT_DIM (= QK_TURBO3_GROUP) and added static_assert that QK_TURBO4 == QK_TURBO3_GROUP to guard against future drift. Closes TheTom#29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…stack turbo_init_rotation() allocated a 128x128 float array (64KB) on the stack to generate the random Gaussian matrix, then memcpy'd it to the static turbo_rotation[]. llama.cpp worker threads have reduced stack sizes, causing segfault on first turbo4 quantize call. Fix: generate directly into the static turbo_rotation[] array, eliminating the intermediate stack allocation entirely. The Gram-Schmidt QR decomposition already runs in-place on turbo_rotation[]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Unroll qs/signs extraction into separate variables before centroid lookup. Helps Metal compiler schedule device reads ahead of ALU. Ported from spiritbuun's CUDA batched load pattern. Co-Authored-By: spiritbuun <271142774+spiritbuun@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

TURBO_PROFILE_MODE env var (0-4): 0 = full dequant (batched extract, production) 1 = no-op (zeros) — decode ceiling without dequant cost 2 = norm only — isolate norm read overhead 3 = norm + qs, skip signs — isolate signs byte cost 4 = full read, constant centroid — isolate LUT indexing cost Set at runtime: TURBO_PROFILE_MODE=1 ./build/bin/llama-bench ... Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

…20797) * use integer dot product for quantized KV flash attention * small improvements * fix SHMEM_STAGING indexing * add missing KV type quants * fixes * add supported quants to FA tests * readd fast paths for <8bit quants * fix mmq gate and shmem checks

Replace serial pre-Viterbi (load, InnerQ, norm, FWHT) and post-Viterbi (find-min, recon norm) sections with parallel implementations using all threads. Backtrack and bitpack remain serial (inherently sequential). 3-bit: 512 threads, 16-warp argmin, t>=2 parallel recon (3 shifts of 3 = 9 bits) 2-bit: 256 threads, 8-warp argmin, t>=3 parallel recon (4 shifts of 2 = 8 bits) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TheTom and others added 30 commits April 2, 2026 13:07

docs: perplexity 6.194 confirmed — 1.4% of q8_0 TheTom#30

280258c

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cleanup: remove stray diagnostic output files

4a8ac6a

experiment: 4-entry magnitude LUT + branchless sign (XOR trick)

af8a788

0cc4m and others added 2 commits April 13, 2026 14:21

Resolving merge conflict with ggml mater

c6733b7

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU ggml examples server Apple Metal Vulkan testing devops python script model OpenCL SYCL build jinja parser Ascend NPU Hexagon WebGPU IBM zDNN AMD ZenDNN server/webui OpenVINO labels Apr 13, 2026

gehasia closed this Apr 13, 2026

winkay2000 pushed a commit to winkay2000/llama-cpp-turboquant that referenced this pull request Apr 14, 2026

merge: parallelize TCQ encode kernels — 12% prefill speedup (TheTom#73)

f4f5a6f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

winkay2000 pushed a commit to winkay2000/llama-cpp-turboquant that referenced this pull request Apr 14, 2026

docs: append experiment TheTom#73 benchmark results

41b9d6d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

winkay2000 pushed a commit to winkay2000/llama-cpp-turboquant that referenced this pull request Apr 14, 2026

docs: mark experiment TheTom#73 done — 12.6% prefill speedup

177fcb3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync with ggml.org master#73

Sync with ggml.org master#73
gehasia wants to merge 277 commits intoTheTom:masterfrom
gehasia:ggml-org-master

gehasia commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

gehasia commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants