Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
170 commits
Select commit Hold shift + click to select a range
9b041de
WIP: add TurboQuant KV cache types (turbo3, turbo4)
TheTom Mar 25, 2026
9f3771a
feat: Metal kernels for TurboQuant KV cache (turbo3, turbo4) #21
TheTom Mar 25, 2026
70a313e
feat: full TurboQuant with rotation matrices in Metal kernels #21
TheTom Mar 25, 2026
dcd15a1
feat: inline rotation matrices in Metal shader + C round-trip test #21
TheTom Mar 25, 2026
793f157
fix: remove thread static from Metal dequantize, fix stale code #23
TheTom Mar 25, 2026
d4ee5b4
feat: replace dense 128x128 matvec with Fast Walsh-Hadamard rotation #26
TheTom Mar 25, 2026
8997c00
docs: detailed speed investigation plan for TurboQuant Metal shader #23
TheTom Mar 25, 2026
283441c
docs: log simd_broadcast attempt — no speed improvement #23
TheTom Mar 25, 2026
aede1bb
docs: log threadgroup attempt — no speed improvement, rethinking #23
TheTom Mar 25, 2026
1f1f8f2
docs: CRITICAL — dequant is NOT the bottleneck, no-op still 2.4 tok/s…
TheTom Mar 25, 2026
b73d683
fix: inline turbo-wht.h — was causing CPU fallback, not Metal! #23
TheTom Mar 25, 2026
b8410b3
docs: real Metal benchmarks after #include fix — 8× gap not 35× #23
TheTom Mar 25, 2026
0c9bada
docs: final investigation summary + upstream tracking #23 #27
TheTom Mar 25, 2026
456574c
docs: upstream competitive intel — pre-rotate-queries is the key #28
TheTom Mar 25, 2026
290732b
docs: speed ceiling test — 49 tok/s without dequant rotation (4.6× ga…
TheTom Mar 25, 2026
9549963
docs: pre-rotate-queries implementation plan + speed ceiling 49 tok/s
TheTom Mar 25, 2026
d27a4a5
feat: pre-rotate-queries optimization — 51.4 tok/s (5× speedup) #23
TheTom Mar 25, 2026
1659689
docs: final investigation summary — 2.4 → 51.4 tok/s journey complete…
TheTom Mar 25, 2026
f9841e0
feat: MSE-only mode — drop QJL, all 3 bits to PolarQuant #23
TheTom Mar 25, 2026
18f4241
docs: Change 2 not needed — Q rotation overhead is negligible
TheTom Mar 25, 2026
a1230c8
docs: block size is the bottleneck — q4_0 at block 32 = 100% of q8_0
TheTom Mar 25, 2026
aed7a94
feat: block size 32 — 77.7 tok/s MoE (91% of q8_0), 17.0 Qwopus (97%) 🎉
TheTom Mar 25, 2026
29786b2
fix: TURBO_D=128 independent of QK_TURBO3, file turbo4 bugs #29
TheTom Mar 25, 2026
c3a7afd
docs: final investigation log — 77.7 tok/s, 91% of q8_0
TheTom Mar 25, 2026
93c3e63
CRITICAL: turbo3 perplexity is 165.6 vs q8_0 6.1 — quality broken #30
TheTom Mar 25, 2026
753b872
CRITICAL: found TWO root causes for PPL=165 #30
TheTom Mar 25, 2026
3d9bcc7
docs: bisect confirms block size innocent, rotation access is the bug…
TheTom Mar 25, 2026
cf6270c
fix: restore inverse rotation in dequant — PPL 6.19 (1.2% of q8_0) #3…
TheTom Mar 25, 2026
d9b9725
docs: perplexity 6.194 confirmed — 1.4% of q8_0 #30
TheTom Mar 25, 2026
ded3e94
docs: complete quality benchmark summary + lessons learned #30
TheTom Mar 25, 2026
810b4b2
perf: fp16 WHT dequant + SIMD cooperative dequant — 45% speedup
TheTom Mar 25, 2026
b097f15
chore: move turboquant docs to turboquant_plus repo
TheTom Mar 25, 2026
ea35e51
perf: vectorized half4 WHT butterfly — 31% speedup (1074 → 1411 tok/s)
TheTom Mar 25, 2026
01fc3cd
perf: pre-packed half4 sign arrays — minor speedup (1411 → 1424 tok/s)
TheTom Mar 25, 2026
c76b717
perf: graph-side WHT rotation — 2095 tok/s (0.78x q8_0, was 0.53x)
TheTom Mar 25, 2026
316f88f
perf: block-32 + graph WHT — 2747 tok/s (1.02x q8_0!!!)
TheTom Mar 25, 2026
ccd1232
feat: layer-adaptive KV cache — q8_0 quality with 80% turbo3 compression
TheTom Mar 25, 2026
63c8d6a
fix: address Codex review on layer-adaptive — thread safety + underfl…
TheTom Mar 25, 2026
48e46bb
wip: context scaling fix — skip unnecessary ggml_cont + 32x32 rotatio…
TheTom Mar 26, 2026
99489db
experiment: group-32 rotation FAILED — PPL 7.06 (target 6.19)
TheTom Mar 26, 2026
8bf235b
feat: add GGML_OP_TURBO_WHT — custom O(d log d) Walsh-Hadamard Transform
TheTom Mar 26, 2026
2157f04
perf: optimized turbo3 dequant — eliminates context scaling regression
TheTom Mar 26, 2026
bc8ae28
ci: quality+speed gate script — PPL + context scaling check before push
TheTom Mar 26, 2026
7dd0af1
perf: fp16 centroid LUT — decode +6-14% at long context (#33)
TheTom Mar 26, 2026
abc6e88
perf: float norm broadcast in vec dequant — decode +2-3% over fp16 LUT
TheTom Mar 26, 2026
05412b3
fix: add turbo3/turbo4 cache types to llama-bench arg parser
TheTom Mar 26, 2026
a9ef409
experiment: split 2x4-entry constant LUT for M1 decode fix
TheTom Mar 26, 2026
9087f91
fix: Metal shader comment accuracy per Codex review
TheTom Mar 26, 2026
99da38b
cleanup: remove stray diagnostic output files
TheTom Mar 26, 2026
02268fc
feat: turbo3 norm correction — PPL 6.211 → 6.176 (free quality win)
TheTom Mar 26, 2026
929b8ba
fix: auto-enable flash attention for turbo cache types + fix ggml con…
TheTom Mar 26, 2026
5811aa5
experiment: register centroid LUT tested — register spill on Metal
TheTom Mar 26, 2026
b2a5a88
feat: CUDA port of TurboQuant3 KV cache compression (RTX 5090 / SM 12.0)
signalnine Mar 26, 2026
eb9a589
perf: enable MMA/TILE flash attention for turbo3 — 0.97x q8_0 prefill
signalnine Mar 26, 2026
8b36e47
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant — +31% decod…
signalnine Mar 27, 2026
9f23354
experiment: batched byte extraction + explicit bit field pre-extract
TheTom Mar 27, 2026
4b0918e
experiment: profiling modes for turbo3 decode bottleneck isolation
TheTom Mar 27, 2026
65ed372
fix: turbo4 SET_ROWS corruption, tail-block truncation, constant coup…
seanrasch Mar 27, 2026
830d76b
experiment: 4-entry magnitude LUT + branchless sign (XOR trick)
TheTom Mar 27, 2026
d602c8e
experiment: force non-vec FA path for turbo3 (nl=2 vs nl=8)
TheTom Mar 27, 2026
80430e3
fix: stack overflow in turbo4 CPU init — 64KB array on worker thread …
seanrasch Mar 27, 2026
1406691
experiment: zero-LUT select chain — 2-level ternary, no constant memory
TheTom Mar 27, 2026
3dfd54c
feat: auto-detect hardware, use 4-mag LUT on pre-M5 (+38-45% decode)
TheTom Mar 27, 2026
78fac6c
experiment: 2-pair half2 LUT — only 2 constant addresses per lookup
TheTom Mar 27, 2026
edfff21
experiment: deferred norm multiply (batch float4 * norm at end)
TheTom Mar 27, 2026
f29b8bb
revert to proven 4-mag + per-element norm (deferred norm was slower)
TheTom Mar 27, 2026
199d619
experiment: named-register centroid×norm — 4 constant reads upfront, …
TheTom Mar 27, 2026
39ba0c0
revert to 4-mag LUT (proven best), document all findings
TheTom Mar 27, 2026
bc41397
experiment: inline block processing — bypass template dequant in FA i…
TheTom Mar 27, 2026
ac9e3a7
experiment: inline block WORSE on M2 (-10-15%), reverted to 4-mag
TheTom Mar 27, 2026
f0c4c79
experiment: FULLY BRANCHLESS FMA decode — zero ternary, zero memory, …
TheTom Mar 27, 2026
6063cf4
final: 12 approaches tested, 4-mag LUT is the hardware limit
TheTom Mar 27, 2026
8888205
experiment: SIMD SHUFFLE magnitude select — cross-lane LUT replacement
TheTom Mar 27, 2026
d01a000
experiment: simd_shuffle 14.7 at 8K — close to 4-mag (15.1) but not b…
TheTom Mar 27, 2026
fd9538c
experiment: fused block dot — per-centroid Q accumulation, 4 constant…
TheTom Mar 27, 2026
e9d06b0
experiment: fused block dot 8.1 at 8K — worst result, 64 comparisons …
TheTom Mar 27, 2026
61a03d9
experiment: 4-mag helps M5 at 16K (+2.4%) but hurts at 32K (-7.3%)
TheTom Mar 27, 2026
7673d48
experiment: M5 LUT cost grows to 34% at 32K context
TheTom Mar 27, 2026
00a5423
feat: sparse V dequant — +12% decode at 32K, zero quality loss
TheTom Mar 27, 2026
7d1bd95
feat: sparse V dequant — +22% decode at 32K on M5, auto-enabled
TheTom Mar 27, 2026
7b885a5
Merge remote-tracking branch 'upstream/feature/turboquant-kv-cache' i…
seanrasch Mar 27, 2026
4c91451
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
signalnine Mar 27, 2026
065ef53
Merge pull request #4 from seanrasch/feature/turboquant-kv-cache
TheTom Mar 27, 2026
a52586e
Revert "Merge pull request #4 from seanrasch/feature/turboquant-kv-ca…
TheTom Mar 27, 2026
0a6078c
experiment: dedicated turbo4 SET_ROWS kernel + prefill FA kernels
TheTom Mar 28, 2026
972c76e
fix: graceful fallback for turbo3 with non-128-aligned head dims (iss…
signalnine Mar 28, 2026
9cdb872
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue…
signalnine Mar 28, 2026
f89c4f2
experiment: turbo4 2+1 bit packing — +33% decode, drop QJL
TheTom Mar 28, 2026
f284cc0
experiment: direct-extract turbo4 dequant — matches turbo3 speed
TheTom Mar 28, 2026
75e2769
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
signalnine Mar 28, 2026
eddfff7
experiment: 4-bit half-precision centroid LUT for turbo4 vec path
TheTom Mar 28, 2026
fef2832
experiment: fix turbo4 struct for 4-bit — Codex-caught OOB bug
TheTom Mar 28, 2026
d0d37b3
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vic…
signalnine Mar 28, 2026
c168011
experiment: 8-mag LUT tested, reverted — direct 16-LUT faster on M5
TheTom Mar 28, 2026
661794f
experiment: add turbo4_dequant_f16 compute shader (prefill prep)
TheTom Mar 28, 2026
53f1298
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
signalnine Mar 28, 2026
d2ca3c9
feat: TURBO4_USE_4BIT ifdef for ABI compatibility
TheTom Mar 28, 2026
3ef4d98
feat: complete 4-bit C reference for turbo4 — quantize + dequant
TheTom Mar 28, 2026
ca25246
Merge turbo4 4-bit PolarQuant into main
TheTom Mar 28, 2026
da6b0fd
feat: GGML_TYPE_TURBO2_0 — 2-bit TurboQuant KV cache (6.4x compression)
signalnine Mar 28, 2026
00ecbbe
fix: MLA inverse WHT group_size derived from K (not V) — fixes GLM-4.7
signalnine Mar 28, 2026
6fb85a6
feat: InnerQ per-channel equalization + turbo2 64-group fallback
signalnine Mar 28, 2026
4cf7145
fix: add turbo WHT rotation to ISWA build_attn — fixes Gemma 2
TheTom Mar 28, 2026
a5efe54
perf: sparse V dequant — skip negligible attention weights in VEC kernel
signalnine Mar 28, 2026
4c4511c
fix: require head_dim % 128 for turbo KV — fall back to q8_0 otherwise
signalnine Mar 29, 2026
172fc85
Merge signalnine/feature/turboquant-kv-cache (PR #3) — CUDA port
TheTom Mar 29, 2026
3380d3c
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
TheTom Mar 29, 2026
43f7d3d
feat: asymmetric K/V quant support for Metal flash attention
TheTom Mar 29, 2026
c1d9b34
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fall…
signalnine Mar 29, 2026
d46ac77
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37→192 t/s)
signalnine Mar 29, 2026
05b7fe3
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
signalnine Mar 29, 2026
b90b5e0
feat: CUDA port of turbo4 (4-bit, 3.8x compression) — fixes issue #25…
signalnine Mar 29, 2026
965a6ca
feat: asymmetric K/V support + q8_0 × turbo FA kernel instantiations
TheTom Mar 29, 2026
ae70214
fix: turbo4 on GLM-4.7 — context init check accounts for zero-padding…
signalnine Mar 29, 2026
2dd602a
Merge branch 'pr-24' into codex/pr24-integration
TheTom Mar 29, 2026
70b35c7
feat: Boundary V (experimental) — layer-aware V compression
TheTom Mar 29, 2026
89d267c
fix: KV state serialization uses padded tensor width (issue #28 follo…
signalnine Mar 29, 2026
1b7165f
Merge PR #30: KV state serialization fix for padded tensor widths
TheTom Mar 29, 2026
58d51a6
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative ke…
Tuklus Mar 29, 2026
64dd362
Merge PR #31: HIP/ROCm support for turbo3/turbo2 (7900 XTX)
TheTom Mar 30, 2026
adac2c6
Increase turbo3/turbo2 block size from 32 to 128
TheTom Mar 30, 2026
aca4594
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
Mar 30, 2026
7b75078
Merge pull request #32 from HyperionMS2040/fix/cuda-block-size-128
TheTom Mar 30, 2026
b8eccf5
WIP: Add PlanarQuant (planar3) KV cache type — 2D Givens rotation
johndpope Mar 31, 2026
abfb7c8
Add vec flash attention templates for planar3 — decode now works
johndpope Mar 31, 2026
406bfbb
Add IsoQuant (iso3) cache type — quaternion 4D rotation, best quality
johndpope Mar 31, 2026
e11d7e2
Add iso4 and planar4 (4-bit) cache types
johndpope Mar 31, 2026
b345da0
planar4: real Givens rotation (not turbo4 alias). PPL 5085 — poor qua…
johndpope Mar 31, 2026
c301c13
iso4: real quaternion 4D rotation (not turbo4 alias). PPL 74.
johndpope Mar 31, 2026
2c9d286
Use turbo centroids for iso/planar, calibration shows centroids not t…
johndpope Mar 31, 2026
3dee460
Register iso/planar types in llama-context and llama-kv-cache
johndpope Mar 31, 2026
bcfd846
Deferred quantization: allocate K cache as F16 for iso/planar types
johndpope Mar 31, 2026
26c90d6
Add CUDA F16→quantized conversion kernels for planar3/4 and iso3/4
johndpope Mar 31, 2026
25f896f
Double-buffer deferred quantization with CUDA conversion kernels
johndpope Mar 31, 2026
0971ed5
Fix ggml context size for double-buffer, disable conversion (schedule…
johndpope Mar 31, 2026
1ed0453
Add CUDA set_rows kernels for planar3/iso3/planar4/iso4
johndpope Mar 31, 2026
b69ae13
Fix k_stream views, disable conversion (missing CUDA FA dequantize)
johndpope Mar 31, 2026
a75b16f
Add CUDA flash attention dequantize for planar3/iso3/planar4/iso4
johndpope Mar 31, 2026
9d4ece5
COMPRESSION WORKS: 5.1x K-cache + 200 tok/s decode on CUDA
johndpope Mar 31, 2026
e7bde1f
Guard deferred conversion behind GGML_USE_CUDA
johndpope Mar 31, 2026
79da661
Add asymmetric FA kernels: q8_0 K + iso3/planar3 V (and reverse)
johndpope Mar 31, 2026
b719b2e
Fix FA dispatch: static constants, V=f16 check, asymmetric support
johndpope Mar 31, 2026
985fd96
Fix planar3/q8_0 asymmetric: add F16+Q8_0 VEC template for deferred p…
johndpope Mar 31, 2026
b83a09f
All 8 K/V configs working: real Givens/quaternion rotation for planar…
johndpope Mar 31, 2026
a730624
planar3/turbo3: 5x total compression, PPL 10.19 (vs Tom's 3.5x at 10.14)
johndpope Apr 1, 2026
6e5a4aa
Fix symmetric V=planar3/iso3: add inverse rotation to V dequant
johndpope Apr 1, 2026
326f7fb
Add inverse rotation V dequant for planar4/iso4
johndpope Apr 1, 2026
20efe75
Add symmetric planar4/iso4: V dequant, template instances, FA dispatch
johndpope Apr 1, 2026
86d111d
Merge remote-tracking branch 'planarquant/feature/planarquant-kv-cach…
Addy-ad Apr 9, 2026
05355ab
Fix Windows MSVC linker symbols and M_PI compatibility for PlanarQuan…
Addy-ad Apr 9, 2026
700bf5f
M_PI problem in Windows
Addy-ad Apr 9, 2026
fc60e17
Missed a "q" in g_innerq_scale_inv_host
Addy-ad Apr 10, 2026
01a9708
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 10, 2026
32ac93c
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 10, 2026
a6094b0
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 10, 2026
7ecdecb
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 11, 2026
61fba64
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 12, 2026
ef6ea10
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 12, 2026
06781fd
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 12, 2026
e550990
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 13, 2026
8d3756e
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 14, 2026
4088b9a
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 15, 2026
1624323
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 15, 2026
9bade9c
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 16, 2026
941b03b
Type check fix
Addy-ad Apr 16, 2026
75c8890
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 17, 2026
dac3a82
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 18, 2026
d5f8666
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 19, 2026
03c91d8
For hip.h conflicts, turboquant code was kept
Addy-ad Apr 21, 2026
78ebf35
Merge remote-tracking branch 'upstream/master' into addyad-latest
Addy-ad Apr 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
362 changes: 362 additions & 0 deletions bench-smem-m5-baseline.txt

Large diffs are not rendered by default.

413 changes: 413 additions & 0 deletions bench-smem-m5-smem.txt

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -391,6 +391,13 @@ const std::vector<ggml_type> kv_cache_types = {
GGML_TYPE_IQ4_NL,
GGML_TYPE_Q5_0,
GGML_TYPE_Q5_1,
GGML_TYPE_TURBO2_0,
GGML_TYPE_TURBO3_0,
GGML_TYPE_TURBO4_0,
GGML_TYPE_PLANAR3_0,
GGML_TYPE_ISO3_0,
GGML_TYPE_PLANAR4_0,
GGML_TYPE_ISO4_0,
};

static ggml_type kv_cache_type_from_str(const std::string & s) {
Expand Down
7 changes: 5 additions & 2 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -10910,7 +10910,10 @@ def set_vocab(self):

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)


if tokenizer is None:
raise RuntimeError(f"Failed to load tokenizer from {self.dir_model}")

# Pad vocab size (from Mamba2Model/GraniteHybridModel)
self.hparams["pad_vocab_size_multiple"] = 8 # Setting this here since GraniteHybridModel.set_vocab() isn't being invoked now.
# From Mamba2Model.set_vocab():
Expand All @@ -10922,7 +10925,7 @@ def set_vocab(self):

assert max(tokenizer.vocab.values()) < vocab_size # ty: ignore[unresolved-attribute]

Check warning on line 10926 in convert_hf_to_gguf.py

View workflow job for this annotation

GitHub Actions / python type-check

ty (unused-ignore-comment)

convert_hf_to_gguf.py:10926:60: unused-ignore-comment: Unused `ty: ignore` directive help: Remove the unused suppression comment

tokpre = self.get_vocab_base_pre(tokenizer)
tokpre = self.get_vocab_base_pre(tokenizer) # type: ignore

Check warning on line 10928 in convert_hf_to_gguf.py

View workflow job for this annotation

GitHub Actions / python type-check

ty (unused-type-ignore-comment)

convert_hf_to_gguf.py:10928:53: unused-type-ignore-comment: Unused blanket `type: ignore` directive help: Remove the unused suppression comment

reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()} # ty: ignore[unresolved-attribute]

Check warning on line 10930 in convert_hf_to_gguf.py

View workflow job for this annotation

GitHub Actions / python type-check

ty (unused-ignore-comment)

convert_hf_to_gguf.py:10930:93: unused-ignore-comment: Unused `ty: ignore` directive help: Remove the unused suppression comment
added_vocab = tokenizer.get_added_vocab() # ty: ignore[unresolved-attribute]

Check warning on line 10931 in convert_hf_to_gguf.py

View workflow job for this annotation

GitHub Actions / python type-check

ty (unused-ignore-comment)

convert_hf_to_gguf.py:10931:52: unused-ignore-comment: Unused `ty: ignore` directive help: Remove the unused suppression comment
Expand Down
22 changes: 20 additions & 2 deletions ggml/include/ggml.h
Original file line number Diff line number Diff line change
Expand Up @@ -428,8 +428,15 @@ extern "C" {
// GGML_TYPE_IQ4_NL_8_8 = 38,
GGML_TYPE_MXFP4 = 39, // MXFP4 (1 block)
GGML_TYPE_NVFP4 = 40, // NVFP4 (4 blocks, E4M3 scale)
GGML_TYPE_Q1_0 = 41,
GGML_TYPE_COUNT = 42,
GGML_TYPE_Q1_0 = 41,
GGML_TYPE_TURBO3_0 = 42, // TurboQuant 3-bit KV cache: 2-bit PolarQuant + 1-bit QJL
GGML_TYPE_TURBO4_0 = 43, // TurboQuant 4-bit KV cache: 3-bit PolarQuant + 1-bit QJL
GGML_TYPE_TURBO2_0 = 44, // TurboQuant 2-bit KV cache: 2-bit PolarQuant (no QJL)
GGML_TYPE_PLANAR3_0 = 45, // PlanarQuant 3-bit KV cache: 2D Givens rotation + 2-bit scalar + 1-bit QJL
GGML_TYPE_ISO3_0 = 46, // IsoQuant 3-bit KV cache: quaternion 4D rotation + 2-bit scalar + 1-bit QJL
GGML_TYPE_PLANAR4_0 = 47, // PlanarQuant 4-bit KV cache: 2D Givens rotation + 3-bit scalar + 1-bit QJL
GGML_TYPE_ISO4_0 = 48, // IsoQuant 4-bit KV cache: quaternion 4D rotation + 3-bit scalar + 1-bit QJL
GGML_TYPE_COUNT = 49,
};

// precision
Expand Down Expand Up @@ -561,6 +568,7 @@ extern "C" {
GGML_OP_RWKV_WKV7,
GGML_OP_SOLVE_TRI,
GGML_OP_GATED_DELTA_NET,
GGML_OP_TURBO_WHT,

GGML_OP_UNARY,

Expand Down Expand Up @@ -2539,6 +2547,16 @@ extern "C" {
struct ggml_tensor * beta,
struct ggml_tensor * state);

// TurboQuant Walsh-Hadamard Transform (O(d log d) rotation for KV cache compression)
// Applies WHT rotation to 128-element groups along ne[0]: sign1 → butterfly → sign2 → normalize
// direction: 0 = forward (signs1 → WHT → signs2), 1 = inverse (signs2 → WHT → signs1)
GGML_API struct ggml_tensor * ggml_turbo_wht(
struct ggml_context * ctx,
struct ggml_tensor * a,
int direction,
int group_size, // 0 = auto (64 or 128 from ne[0])
struct ggml_tensor * scale); // NULL = no InnerQ scaling

// custom operators

typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
Expand Down
5 changes: 5 additions & 0 deletions ggml/src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,11 @@ add_library(ggml-base
ggml-threading.h
ggml-quants.c
ggml-quants.h
ggml-turbo-quant.c
ggml-planar-quant.c
ggml-iso-quant.c
ggml-planar4-quant.c
ggml-iso4-quant.c
gguf.cpp)

set_target_properties(ggml-base PROPERTIES
Expand Down
103 changes: 103 additions & 0 deletions ggml/src/ggml-common.h
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,109 @@ typedef struct {
} block_tq2_0;
static_assert(sizeof(block_tq2_0) == sizeof(ggml_half) + QK_K / 4, "wrong tq2_0 block size/padding");

// TurboQuant 3-bit MSE-only: 3-bit PolarQuant indices (no QJL)
// Storage block size = 32 (matches q4_0 for optimal GPU parallelism)
// Transform group size = 128 (head_dim, for rotation Gaussianization)
// Per block: norm(fp16) + 2-bit indices (8 bytes) + 1-bit extra (4 bytes) = 14 bytes per 32 values
// = 3.5 bits/value → 4.6× compression vs fp16
// The 3-bit index is split: lower 2 bits in qs[], upper 1 bit in signs[]
#define QK_TURBO3 128 // Block size 128: one block per rotation group, eliminates redundant norms
#define QK_TURBO3_GROUP 128 // rotation group size = head_dim
// Derived: FA template nl parameters (auto-scale with block size)
#define NL_TURBO3 (QK_TURBO3 / 16) // non-vec FA iterations per block
#define NL_TURBO3_VEC (QK_TURBO3 / 4) // vec FA iterations per block
typedef struct {
ggml_half norm; // 2 bytes: vector L2 norm (for rescaling)
uint8_t qs[QK_TURBO3 / 4]; // 8 bytes: lower 2-bit indices (4 per byte)
uint8_t signs[QK_TURBO3 / 8]; // 4 bytes: upper 1-bit of 3-bit index (8 per byte)
} block_turbo3_0; // 14 bytes total
static_assert(sizeof(block_turbo3_0) == sizeof(ggml_half) + QK_TURBO3/4 + QK_TURBO3/8, "wrong turbo3_0 block size/padding");

// TurboQuant 4-bit: 3-bit PolarQuant indices + 1-bit QJL signs
// TURBO4_USE_4BIT: switch between 4-bit PolarQuant (new) and 3-bit+QJL (legacy)
// Default: 4-bit on all backends (Metal + CUDA validated)
#ifndef TURBO4_USE_4BIT
# define TURBO4_USE_4BIT 1
#endif

#define QK_TURBO4 128

#if TURBO4_USE_4BIT
// 4-bit PolarQuant: 16 optimal centroids, nibble packed, no QJL
// Per block: norm(fp16) + rnorm(fp16, reserved) + 4-bit indices (64 bytes)
// = 68 bytes per 128 values = 4.25 bits/value → 3.8× compression vs fp16
typedef struct {
ggml_half norm; // 2 bytes
ggml_half rnorm; // 2 bytes (reserved, unused in 4-bit mode)
uint8_t qs[QK_TURBO4 / 2]; // 64 bytes: 4-bit PolarQuant indices (nibble packed)
} block_turbo4_0; // 68 bytes total
static_assert(sizeof(block_turbo4_0) == 68, "wrong turbo4_0 block size");
#else
// Legacy 3-bit PolarQuant + 1-bit QJL (original paper design)
// Per block: norm(fp16) + rnorm(fp16) + 3-bit indices (48 bytes) + 1-bit QJL signs (16 bytes)
// = 68 bytes per 128 values = 4.25 bits/value → 3.8× compression vs fp16
typedef struct {
ggml_half norm; // 2 bytes
ggml_half rnorm; // 2 bytes: residual norm for QJL scale
uint8_t qs[QK_TURBO4 * 3 / 8]; // 48 bytes: 3-bit PolarQuant indices
uint8_t signs[QK_TURBO4 / 8]; // 16 bytes: 1-bit QJL signs
} block_turbo4_0; // 68 bytes total
static_assert(sizeof(block_turbo4_0) == 2*sizeof(ggml_half) + QK_TURBO4*3/8 + QK_TURBO4/8, "wrong turbo4_0 block size");
#endif

static_assert(QK_TURBO4 == 128, "turbo4 kernels assume QK_TURBO4 == 128");

// TurboQuant 2-bit: 2-bit PolarQuant indices only (no QJL)
// Per block: norm(fp16) + 2-bit indices (8 bytes) = 10 bytes per 32 values
// = 2.5 bits/value → 6.4× compression vs fp16
// 4 centroids (Lloyd-Max for N(0, 1/128)): {-0.133462, -0.039994, 0.039994, 0.133462}
#define QK_TURBO2 128 // Block size 128: one block per rotation group
#define QK_TURBO2_GROUP 128 // rotation group size = head_dim
// Derived: FA template nl parameters (auto-scale with block size)
#define NL_TURBO2 (QK_TURBO2 / 16) // non-vec FA iterations per block
#define NL_TURBO2_VEC (QK_TURBO2 / 4) // vec FA iterations per block
typedef struct {
ggml_half norm; // 2 bytes: corrected L2 norm
uint8_t qs[QK_TURBO2 / 4]; // 8 bytes: 2-bit indices (4 per byte)
} block_turbo2_0; // 10 bytes total
static_assert(sizeof(block_turbo2_0) == sizeof(ggml_half) + QK_TURBO2/4, "wrong turbo2_0 block size/padding");

// PlanarQuant 3-bit: 2D Givens rotation + 2-bit quantized + 1-bit QJL
// Same block layout as turbo3 (norm + 2-bit indices + 1-bit signs)
// but uses cos/sin pair rotation instead of WHT
#define QK_PLANAR3 128
#define NL_PLANAR3 (QK_PLANAR3 / 16)
#define NL_PLANAR3_VEC (QK_PLANAR3 / 4)
typedef struct {
ggml_half norm;
uint8_t qs[QK_PLANAR3 / 4];
uint8_t signs[QK_PLANAR3 / 8];
} block_planar3_0;
static_assert(sizeof(block_planar3_0) == sizeof(ggml_half) + QK_PLANAR3/4 + QK_PLANAR3/8, "wrong planar3_0 block size/padding");

#define QK_ISO3 128
#define NL_ISO3 (QK_ISO3 / 16)
#define NL_ISO3_VEC (QK_ISO3 / 4)
typedef struct {
ggml_half norm;
uint8_t qs[QK_ISO3 / 4];
uint8_t signs[QK_ISO3 / 8];
} block_iso3_0;
static_assert(sizeof(block_iso3_0) == sizeof(ggml_half) + QK_ISO3/4 + QK_ISO3/8, "wrong iso3_0 block size/padding");

// PlanarQuant 4-bit and IsoQuant 4-bit: same block layout as turbo4
// 3-bit indices (nibble-packed) + 1-bit QJL signs + norm
#define QK_PLANAR4 128
#define NL_PLANAR4 8
#define NL_PLANAR4_VEC 32
#define QK_ISO4 128
#define NL_ISO4 8
#define NL_ISO4_VEC 32
// Reuse block_turbo4_0 layout: these are typedef aliases
typedef block_turbo4_0 block_planar4_0;
typedef block_turbo4_0 block_iso4_0;


//
// Super-block quantization structures
//
Expand Down
98 changes: 98 additions & 0 deletions ggml/src/ggml-cpu/ggml-cpu.c
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
#include "ggml-cpu-impl.h"
#include "ggml-impl.h"
#include "quants.h"
#include "ggml-quants.h"
#include "ggml-threading.h"
#include "unary-ops.h"
#include "binary-ops.h"
Expand Down Expand Up @@ -204,6 +205,17 @@ typedef pthread_t ggml_thread_t;
#include <TargetConditionals.h>
#endif

// Forward declarations — defined below, after utility functions
static void ggml_vec_dot_turbo3_0_f32(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, size_t bx,
const void * GGML_RESTRICT vy, size_t by, int nrc);
static void ggml_vec_dot_turbo2_0_f32(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, size_t bx,
const void * GGML_RESTRICT vy, size_t by, int nrc);
static void ggml_vec_dot_turbo4_0_f32(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, size_t bx,
const void * GGML_RESTRICT vy, size_t by, int nrc);

static const struct ggml_type_traits_cpu type_traits_cpu[GGML_TYPE_COUNT] = {
[GGML_TYPE_F32] = {
.from_float = (ggml_from_float_t) ggml_cpu_fp32_to_fp32,
Expand Down Expand Up @@ -399,6 +411,24 @@ static const struct ggml_type_traits_cpu type_traits_cpu[GGML_TYPE_COUNT] = {
[GGML_TYPE_I32] = {
.from_float = (ggml_from_float_t) ggml_cpu_fp32_to_i32,
},
[GGML_TYPE_TURBO3_0] = {
.from_float = (ggml_from_float_t) quantize_row_turbo3_0_ref,
.vec_dot = (ggml_vec_dot_t) ggml_vec_dot_turbo3_0_f32,
.vec_dot_type = GGML_TYPE_F32,
.nrows = 1,
},
[GGML_TYPE_TURBO2_0] = {
.from_float = (ggml_from_float_t) quantize_row_turbo2_0_ref,
.vec_dot = (ggml_vec_dot_t) ggml_vec_dot_turbo2_0_f32,
.vec_dot_type = GGML_TYPE_F32,
.nrows = 1,
},
[GGML_TYPE_TURBO4_0] = {
.from_float = (ggml_from_float_t) quantize_row_turbo4_0_ref,
.vec_dot = (ggml_vec_dot_t) ggml_vec_dot_turbo4_0_f32,
.vec_dot_type = GGML_TYPE_F32,
.nrows = 1,
},
};

const struct ggml_type_traits_cpu * ggml_get_type_traits_cpu(enum ggml_type type) {
Expand Down Expand Up @@ -2037,6 +2067,10 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
{
ggml_compute_forward_gated_delta_net(params, tensor);
} break;
case GGML_OP_TURBO_WHT:
{
ggml_compute_forward_turbo_wht(params, tensor);
} break;
case GGML_OP_MAP_CUSTOM1:
{
ggml_compute_forward_map_custom1(params, tensor);
Expand Down Expand Up @@ -2217,6 +2251,7 @@ static int ggml_get_n_tasks(struct ggml_tensor * node, int n_threads) {
case GGML_OP_COUNT_EQUAL:
case GGML_OP_SOLVE_TRI:
case GGML_OP_GATED_DELTA_NET:
case GGML_OP_TURBO_WHT:
{
n_tasks = n_threads;
} break;
Expand Down Expand Up @@ -2935,6 +2970,10 @@ struct ggml_cplan ggml_graph_plan(
const int64_t S_v = node->src[2]->ne[0];
cur = S_v * sizeof(float) * n_tasks;
} break;
case GGML_OP_TURBO_WHT:
{
cur = 0; // no extra workspace needed
} break;
case GGML_OP_COUNT:
{
GGML_ABORT("fatal error");
Expand Down Expand Up @@ -3319,6 +3358,65 @@ enum ggml_status ggml_graph_compute_with_ctx(struct ggml_context * ctx, struct g
return ggml_graph_compute(cgraph, &cplan);
}

// TurboQuant3 vec_dot: dequantize turbo3 block to f32, then dot with f32 operand.
// Used by CPU flash attention for models with D not supported by CUDA FA (e.g. D=192).
static void ggml_vec_dot_turbo3_0_f32(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, size_t bx,
const void * GGML_RESTRICT vy, size_t by, int nrc) {
GGML_ASSERT(nrc == 1);
GGML_UNUSED(bs); GGML_UNUSED(bx); GGML_UNUSED(by); GGML_UNUSED(nrc);

// Dequantize turbo3 to f32 temp buffer, then dot
float tmp[4096]; // max head_dim
GGML_ASSERT(n <= 4096);
ggml_get_type_traits(GGML_TYPE_TURBO3_0)->to_float(vx, tmp, n);

const float * y = (const float *)vy;
float sum = 0.0f;
for (int i = 0; i < n; i++) {
sum += tmp[i] * y[i];
}
*s = sum;
}

// TurboQuant2 vec_dot: dequantize turbo2 block to f32, then dot with f32 operand.
static void ggml_vec_dot_turbo2_0_f32(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, size_t bx,
const void * GGML_RESTRICT vy, size_t by, int nrc) {
GGML_ASSERT(nrc == 1);
GGML_UNUSED(bs); GGML_UNUSED(bx); GGML_UNUSED(by); GGML_UNUSED(nrc);

float tmp[4096];
GGML_ASSERT(n <= 4096);
ggml_get_type_traits(GGML_TYPE_TURBO2_0)->to_float(vx, tmp, n);

const float * y = (const float *)vy;
float sum = 0.0f;
for (int i = 0; i < n; i++) {
sum += tmp[i] * y[i];
}
*s = sum;
}

// TurboQuant4 vec_dot: dequantize turbo4 block to f32, then dot with f32 operand.
static void ggml_vec_dot_turbo4_0_f32(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, size_t bx,
const void * GGML_RESTRICT vy, size_t by, int nrc) {
GGML_ASSERT(nrc == 1);
GGML_UNUSED(bs); GGML_UNUSED(bx); GGML_UNUSED(by); GGML_UNUSED(nrc);

float tmp[4096];
GGML_ASSERT(n <= 4096);
ggml_get_type_traits(GGML_TYPE_TURBO4_0)->to_float(vx, tmp, n);

const float * y = (const float *)vy;
float sum = 0.0f;
for (int i = 0; i < n; i++) {
sum += tmp[i] * y[i];
}
*s = sum;
}

void ggml_cpu_fp32_to_fp32(const float * x, float * y, int64_t n) {
memcpy(y, x, n * sizeof(float));
}
Expand Down
Loading
Loading