ggml : add CPU TurboQuant KV cache types (TBQ3_0 / TBQ4_0) by elusznik · Pull Request #21089 · ggml-org/llama.cpp

elusznik · 2026-03-27T23:59:38Z

Summary

This PR adds CPU-only TurboQuant KV-cache support for two new cache types:

tbq3_0
tbq4_0

The scope is intentionally narrow for the first PR:

CPU-only
KV-cache types only
TBQ only (TBQP / Q-prod is left for follow-up work)

That keeps the initial landing aligned with the contributor guidance for new features and new ggml_type additions: start with CPU support first, keep the PR reviewable, and add backend support in follow-up PRs.

What changed

add GGML_TYPE_TBQ3_0 and GGML_TYPE_TBQ4_0
add block layouts and CPU quantize / dequantize support
add CPU vec_dot support so CPU flash attention can consume the new KV types
wire the new types into ggml type traits and quantization entry points
allow tbq3_0 / tbq4_0 in CLI KV-cache arguments
add llama-bench and quantize support for the new types
add CPU regression coverage in test-quantize-fns
add backend-op coverage for GET_ROWS, SET_ROWS, CPY, and FLASH_ATTN_EXT

Why this scope

I started from a broader TurboQuant implementation, but for the first upstream PR I cut the surface down to the part that is strongest on the current CPU-only evaluation:

tbq4_0 is the best-balanced TurboQuant point here
tbq3_0 is the memory-first option
the wider TBQP / split-outlier path is better handled as follow-up work after the plain TBQ CPU base lands

Block layout

tbq3_0: 98 bytes / 256 elements = 3.0625 bits / element
tbq4_0: 130 bytes / 256 elements = 4.0625 bits / element

CPU results

Model: Qwen3.5-4B-Q4_K_M.gguf

Settings:

CPU only
4 threads
flash_attn=on
llama-bench with pp32/tg8
llama-perplexity on wikitext-2-raw/wiki.test.raw
ctx=256, chunks=4

Cache type	Prompt t/s	Gen t/s	KV MiB	Compression vs f16	PPL	KLD	RMS Δp	Same top p
`f16`	50.67	15.72	64.00	1.00x	13.8387	0.00000	0.000%	100.000%
`q8_0`	50.63	15.67	34.00	1.88x	13.8348	0.00320	1.510%	97.835%
`q4_0`	50.46	15.64	18.00	3.56x	13.8400	0.00912	2.179%	93.898%
`tbq3_0`	46.19	8.29	12.25	5.22x	14.3198	0.02647	4.471%	91.732%
`tbq4_0`	45.84	8.31	16.25	3.94x	13.8323	0.00960	2.892%	94.094%

Key takeaways:

tbq4_0 is the best-balanced TurboQuant point in this CPU-only sweep.
tbq4_0 reduces KV cache below stock q4_0 while keeping similar KLD and slightly better perplexity in this run.
tbq3_0 pushes KV memory lower again, with the expected quality tradeoff.

Plots

KV cache memory usage

Throughput

Compression vs speed

Ablation: KV size vs KLD

Validation

Built locally:

cmake -S . -B build-cpu-pr -DCMAKE_BUILD_TYPE=Release
cmake --build build-cpu-pr --target test-quantize-fns test-backend-ops llama-bench llama-cli llama-perplexity -j4

Checks run:

./build-cpu-pr/bin/test-quantize-fns
./build-cpu-pr/bin/test-backend-ops test -b CPU -o GET_ROWS,SET_ROWS,CPY,FLASH_ATTN_EXT -p 'tbq'
llama-bench CPU comparison vs f16, q8_0, q4_0
llama-perplexity + KL divergence comparison vs f16

Follow-up work

Planned follow-ups after this CPU base:

TBQP / Q-prod variants
split outlier path
ROCm backend support
CUDA backend support

Acknowledgements

This work was informed by:

TurboQuant paper: https://arxiv.org/abs/2504.19874
Google Research blog: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
discussion: TurboQuant - Extreme KV Cache Quantization #20969
issue: Feature Request: TurboQuant support #20977
mudler/llama.cpp feat/turbo-quant
Aaryan-Kapoor/llama.cpp turboquant-tq3_0
TheTom/turboquant_plus
tonbistudio/turboquant-pytorch

AI usage disclosure

AI tools were used in an assistive capacity for exploration, mechanical refactoring, test/benchmark scripting, and draft review text. The code and measurements in this PR were manually reviewed locally, the relevant checks were run manually, and I can explain the submitted changes and benchmark setup in detail.

ggml-gh-bot · 2026-03-28T00:03:39Z

Hi @elusznik, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

elusznik · 2026-03-28T00:05:17Z

Issue #20977

Copilot

Pull request overview

This PR introduces two new CPU-only TurboQuant KV-cache ggml types (tbq3_0, tbq4_0) and wires them through ggml’s type system, CPU quantize/dequantize + vec_dot, llama KV/graph handling, tooling, and tests so they can be selected as KV cache formats and consumed by CPU flash-attention.

Changes:

Add GGML_TYPE_TBQ3_0 / GGML_TYPE_TBQ4_0 (plus ftype plumbing) with block layouts, quantize/dequantize, and CPU vec_dot support.
Update llama KV-cache views + attention graph to handle TBQ tensors (cast + reshape for attention).
Expose types in CLI/tools docs and add regression/tests coverage (test-quantize-fns, test-backend-ops).

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tools/server/README.md	Document `tbq3_0`/`tbq4_0` as allowed KV cache types.
tools/quantize/quantize.cpp	Add quantize tool options for TBQ ftypes.
tools/llama-bench/llama-bench.cpp	Allow parsing `tbq3_0`/`tbq4_0` type names.
tools/completion/README.md	Document TBQ cache types for completion tool args.
tools/cli/README.md	Document TBQ cache types for CLI args.
tests/test-quantize-fns.cpp	Add TBQ dispatch + table/codebook checks and error thresholds.
tests/test-backend-ops.cpp	Add backend-op coverage for TBQ in GET_ROWS/SET_ROWS/CPY/FLASH_ATTN_EXT.
src/llama-quant.cpp	Add TBQ ftype/type mapping + fallback behavior.
src/llama-kv-cache.cpp	Add TBQ-specific KV views (3D) for K/V retrieval.
src/llama-graph.cpp	Cast+reshape TBQ KV tensors to feed flash/non-flash attention.
include/llama.h	Add llama ftype enum entries for TBQ.
ggml/src/ggml.c	Register TBQ type traits, ftype mapping, and quantize chunk dispatch.
ggml/src/ggml-turboq.h	New TurboQuant helper API header.
ggml/src/ggml-turboq.c	New TurboQuant helpers + TBQ3/TBQ4 quantize/dequantize implementations.
ggml/src/ggml-turboq-tables.h	New TurboQuant Lloyd-Max codebooks/boundaries.
ggml/src/ggml-quants.h	Declare TBQ quantize/dequantize entry points.
ggml/src/ggml-quants.c	Add row-data validation for TBQ blocks.
ggml/src/ggml-cpu/quants.h	Add CPU quantize + `vec_dot` declarations for TBQ.
ggml/src/ggml-cpu/quants.c	Add CPU quantize wrappers and TBQ `vec_dot` (dequantize-then-dot) fallback.
ggml/src/ggml-cpu/ops.cpp	Extend DUP handling for quantized->F16/BF16 and adjust quantized dup flow.
ggml/src/ggml-cpu/ggml-cpu.c	Register TBQ CPU type traits (from_float, vec_dot, vec_dot_type).
ggml/src/ggml-cpu/arch-fallback.h	Add tbq vec_dot fallback renames for some architectures.
ggml/src/ggml-common.h	Define `block_tbq3_0` / `block_tbq4_0` layouts.
ggml/src/CMakeLists.txt	Build and install TurboQuant sources/headers into ggml-base.
ggml/include/ggml.h	Add new ggml type + ftype enum values.
common/arg.cpp	Allow TBQ types in `--cache-type-k/--cache-type-v` parsing and help text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

animehacker · 2026-03-28T05:24:23Z

I've been working on extending unixsysdev's tq3_0 implementation with V cache support and flash attention. Repo here: https://github.com/animehacker/llama-turboquant

What this adds on top of unixsysdev's work:

Normalization fix (1/32 → 1/√32 for the asymmetric K-side WHT)
V cache compression (non-transposed storage + graph-side dequant to work around GGML's element-scatter path)
Flash attention with tq3_0 (dequant tq3_0 → F32 → F16 in the attention graph, then use existing FA kernel)
CPU backend F32 dequant path for pipeline parallelism

Tested on Llama-3.3-70B-Instruct-Q4_K_M, 2x RTX 3090:

72K context with tq3_0 K+V (4.57x compression)
WikiText-2 PPL: 4.40 vs 4.09 baseline (+7.6%)

To be clear: this implements PolarQuant (Stage 1) only — WHT rotation + 3-bit Lloyd-Max. QJL residual correction is not included.

Paper with implementation details: https://oliverchurch.com/turboquant-for-ggml-achieving-4.57x-kv-cache-compression-in-llama.cpp.html

animehacker · 2026-03-29T15:09:14Z

I've been working on a TurboQuant implementation in llama.cpp's GGML framework (CUDA backend, tested on Llama-3.3-70B with 2x RTX 3090s). A few findings that might be useful for the vLLM implementation:

The normalization factor for the WHT needs to be asymmetric: 1/√32 on the K-side during quantization, unnormalized on the Q-side. Using 1/32 (symmetric) produces plausible-looking but semantically broken output that's hard to catch without perplexity benchmarks.
V cache compression is essentially free in terms of quality. In my WikiText-2 benchmarks, K-only compression cost +6.6% PPL while adding V compression on top only added another +1% for 4.57x total compression.
For long context, dequanting to F16 and feeding into flash attention works well. Memory goes from O(n²) to O(n), which is what got us from 16K to 72K context.

Paper with implementation details: https://oliverchurch.com/turboquant-for-ggml-achieving-4.57x-kv-cache-compression-in-llama.cpp.html
Repo: https://github.com/animehacker/llama-turboquant

Happy to compare notes.

elusznik · 2026-03-29T18:27:52Z

Addressed the actionable points from the Copilot review in 0aae7d78c:

fixed TBQ fallback symbol remaps for ARM/RISC-V in ggml-cpu/arch-fallback.h
fixed TBQ flash-attn AUTO validation to use merged GQA row widths in llama-context.cpp
removed the extra temp-buffer path for quantized -> F32 dup in ggml-cpu/ops.cpp
trimmed the redundant CPY test expansion while keeping the new quantized coverage in test-backend-ops.cpp

Re-ran:

cmake --build build-cpu-pr --target test-quantize-fns test-backend-ops llama-cli -j4
./build-cpu-pr/bin/test-quantize-fns
./build-cpu-pr/bin/test-backend-ops test -b CPU -o GET_ROWS,SET_ROWS,CPY,FLASH_ATTN_EXT -p 'tbq'
llama-bench smoke run with tbq4_0/tbq4_0

CuriosityQuantified · 2026-03-30T05:39:41Z

Hi @elusznik — great work on this PR. I've been running the turboquant-cpu-tbq-pr branch on an Apple Silicon M4 Mac mini (16GB unified memory) and implemented the NEON optimization listed in your follow-up roadmap. Sharing results and code here — I've also pushed a working branch at CuriosityQuantified/llama.cpp:neon-arm-optimization if you want to diff directly.

Build fix: `arch-fallback.h` missing ARM64 defines (blocks build on aarch64)

Before anything else — the branch does not build on ARM64 without this fix. The __aarch64__ section of ggml/src/ggml-cpu/arch-fallback.h is missing the two TBQ _generic symbol aliases, causing a linker failure:

// ggml/src/ggml-cpu/arch-fallback.h  (around line 81, inside #elif defined(__aarch64__))
#define ggml_vec_dot_tbq3_0_q8_K_generic ggml_vec_dot_tbq3_0_q8_K
#define ggml_vec_dot_tbq4_0_q8_K_generic ggml_vec_dot_tbq4_0_q8_K

NEON kernels

ggml/src/ggml-cpu/quants.c — ggml_vec_dot_tbq3_0_q8_K_generic and ggml_vec_dot_tbq4_0_q8_K_generic (identical pattern for both):

int j = 0;
#if defined(__ARM_NEON)
float32x4_t acc0 = vdupq_n_f32(0.0f);
float32x4_t acc1 = vdupq_n_f32(0.0f);
for (; j + 7 < QK_K; j += 8) {
    const float32x4_t tv0 = vld1q_f32(tmp + idx + j);
    const float32x4_t tv1 = vld1q_f32(tmp + idx + j + 4);
    const int8x8_t qi = vld1_s8(y[i].qs + j);
    const int16x8_t qi16 = vmovl_s8(qi);
    const float32x4_t qf0 = vcvtq_f32_s32(vmovl_s16(vget_low_s16(qi16)));
    const float32x4_t qf1 = vcvtq_f32_s32(vmovl_s16(vget_high_s16(qi16)));
    acc0 = vfmaq_f32(acc0, tv0, qf0);
    acc1 = vfmaq_f32(acc1, tv1, qf1);
}
sumf += d * vaddvq_f32(vaddq_f32(acc0, acc1));
#endif
// scalar tail
for (; j < QK_K; j++) {
    sumf += tmp[idx + j] * (d * y[i].qs[j]);
}
idx += QK_K;

ggml/src/ggml-turboq.c — matvec_row and matvec_t (add after existing AVX2 block):

#elif defined(__ARM_NEON)
float32x4_t acc0 = vdupq_n_f32(0.0f);
float32x4_t acc1 = vdupq_n_f32(0.0f);
for (; j + 7 < d; j += 8) {
    acc0 = vfmaq_f32(acc0, vld1q_f32(row + j),     vld1q_f32(x + j));
    acc1 = vfmaq_f32(acc1, vld1q_f32(row + j + 4), vld1q_f32(x + j + 4));
}
for (; j + 3 < d; j += 4) {
    acc0 = vfmaq_f32(acc0, vld1q_f32(row + j), vld1q_f32(x + j));
}
sum += vaddvq_f32(vaddq_f32(acc0, acc1));

Benchmark results — 8K context, Qwen3.5-4B-Q4_K_M, M4 Mac mini, 4 threads

Build: -DCMAKE_BUILD_TYPE=Release, Metal enabled, flash attention on, -nkvo 1 (KV on CPU, model weights on Metal GPU)

KV type	pp t/s (generic C)	pp t/s (NEON)	delta	compression
f16	312	306	—	1.0×
q4_0	307	291	—	3.6×
tbq4_0	258	276	+18 t/s (+7%)	3.9×
tbq3_0	253	274	+21 t/s (+8%)	5.2×

The gap to q4_0 narrows from ~50 t/s → ~16 t/s after NEON. The residual cost is the 128×128 Hadamard rotation matmuls (2 dense matmuls per 256-element block per TBQ block) — closing that fully would require a structured butterfly/WHT transform at the quantization algorithm level, not a kernel change.

Apple Silicon note

Running with -ngl 99 -ctk tbq4_0 (Metal KV offload) crashes — Metal backend does not support SET_ROWS for TBQ types. -nkvo 1 is the workaround (KV stays on CPU, model layers on Metal). Not a blocker for this PR scope, just worth noting for anyone testing on Apple Silicon.

Happy to submit a follow-up PR with these changes once this lands — or fold them in here if you prefer. Let me know what works best.

— @CuriosityQuantified

elusznik · 2026-03-30T07:20:03Z

Hello @CuriosityQuantified, thanks for your input. Unfortunately I do not have ARM64 experience so I couldn't really do it myself. When it comes to the PR, I think a separate request would be more in line with the contribution guidelines of this project

Based on PR ggml-org#21089 (CPU TurboQuant by elusznik), this adds CUDA kernel support for the TBQ3_0 and TBQ4_0 KV cache quantization types. New files: - turboq.cu: GPU rotation matrix init, CUDA dequantize/quantize kernels - 128 threads/block, shared memory for codebook decode - O(d²) rotation matvec per block via global memory - turboq.cuh: Kernel declarations Modified files: - set-rows.cu: Custom TBQ quantize dispatch - convert.cu: TBQ→F32/F16 row dequantize - cpy.cu: TBQ→F32/F16 copy (enables GPU-side ggml_cast in attention) - ggml-cuda.cu: TBQ in SET_ROWS + CPY capability checks - arch-fallback.h: ARM build fix (missing TBQ vec_dot macros) - CMakeLists.txt: turboq.cu added to build Key fix: Adding TBQ types to GGML_OP_CPY capability check enables the existing ggml_cast() dequantize path in llama-graph.cpp to run on GPU, improving generation from 2 → 9.5 tok/s (Llama 3B, GTX 1660 Super). Benchmark (Llama 3.2 3B Q4_K_M, GTX 1660 Super 6GB): - Prefill: 308 tok/s (4x baseline 75 tok/s) - Generation: 9.5 tok/s (22% of baseline 42 tok/s) - Max context: ~98K tokens (2x baseline ~49K) The O(d²) rotation in dequantize remains the generation bottleneck. Fused flash attention kernels would eliminate this overhead.

CuriosityQuantified · 2026-03-30T14:27:02Z

@elusznik — understood, separate PR it is. I've opened it against your fork: elusznik#1

It's scoped to just the two changes from my comment above — the arch-fallback.h ARM64 fix and the NEON kernels. No other modifications. Should be easy to review and fold in whenever this lands upstream.

mihai-chiorean · 2026-03-31T19:02:48Z

I have a working CUDA flash attention implementation for TBQ4_0 and TBQ3_0 on DGX Spark (GB10, SM121) and would be interested in contributing it as a follow-on once the CPU types land.

Quick summary of what I have:

CUDA FA vec kernels for both TBQ4_0 and TBQ3_0 with native codebook lookups (no dequant-to-fp16)
Q rotation via shared memory (once per kernel invocation, O(1) per-token overhead)
SET_ROWS kernels for KV cache insertion
Tested on Llama-3.1-8B (WikiText-2): tbq4_0/tbq4_0 at +1.07% PPL, tbq4_0/tbq3_0 at +1.80% — both within the 2% quality bar
Also validated on MiniMax M2.5 (95GB MoE model)

My current implementation uses QK=128 blocks (different from the QK_K=256 here), so I'd need to adapt for 256-element blocks with 2 rotation sub-groups. Happy to coordinate on that.

Branch for reference: https://github.com/mihai-chiorean/turbo3-cuda/tree/feat/tbq4-cuda-fa-sm121

pwilkin · 2026-04-03T15:24:46Z

@ggerganov I think this one is worth a look - good PR, keeps core changes minimal, CPU only, specialized code in separate files. There will be pressure to adopt the TQ3 quants now that Gemma 4 is relatively context hungry 😃

ggerganov · 2026-04-03T16:08:19Z

@pwilkin I don't know - it looks like pure slop to me. What makes you think it is worth the look from the presented results?

elusznik · 2026-04-03T18:26:52Z

@pwilkin I don't know - it looks like pure slop to me. What makes you think it is worth the look from the presented results?

@ggerganov I don't want to step out of line, but how is it pure slop? I put in some serious work and the result of 3.94x memory reduction with closer matching top p than q4_0 is something to be discarded?

Cache type	Prompt t/s	Gen t/s	KV MiB	Compression vs f16	PPL	KLD	RMS Δp	Same top p
`f16`	50.67	15.72	64.00	1.00x	13.8387	0.00000	0.000%	100.000%
`q8_0`	50.63	15.67	34.00	1.88x	13.8348	0.00320	1.510%	97.835%
`q4_0`	50.46	15.64	18.00	3.56x	13.8400	0.00912	2.179%	93.898%
`tbq4_0`	45.84	8.31	16.25	3.94x	13.8323	0.00960	2.892%	94.094%

CISC · 2026-04-03T18:49:24Z

Cache type Prompt t/s Gen t/s KV MiB Compression vs f16 PPL KLD RMS Δp Same top p

q4_0 50.46 15.64 18.00 3.56x 13.8400 0.00912 2.179% 93.898%

tbq4_0 45.84 8.31 16.25 3.94x 13.8323 0.00960 2.892% 94.094%

So, half the speed, but negligible quality difference? Is it even against latest master?

elusznik · 2026-04-03T18:54:59Z

So, half the speed, but negligible quality difference? Is it even against latest master?

the speed is obviously reduced due to the matrix multiplication operations being done on CPU. as per the contribution rules, initial commits of a given features are to be implemented CPU-only. it was against the master at the time of submitting the PR (last Sunday)

CISC · 2026-04-03T18:57:13Z

So, half the speed, but negligible quality difference? Is it even against latest master?

the speed is obviously reduced due to the matrix multiplication operations being done on CPU. as per the contribution rules, initial commits of a given features are to be implemented CPU-only. it was against the master at the time of submitting the PR (last Sunday)

Try latest master.

elusznik · 2026-04-03T18:58:12Z

So, half the speed, but negligible quality difference? Is it even against latest master?

the speed is obviously reduced due to the matrix multiplication operations being done on CPU. as per the contribution rules, initial commits of a given features are to be implemented CPU-only. it was against the master at the time of submitting the PR (last Sunday)

Try latest master.

so resubmit the PR today and close this one?

CISC · 2026-04-03T19:00:23Z

Try latest master.

so resubmit the PR today and close this one?

No, I mean do the tests again on latest master, q4_0 kv-cache has improved.

Mushoz · 2026-04-03T19:02:10Z

Master now already applies rotations on the KV cache before quantization, improving the quality of all the regular quantization types. And since TurboQuant is barely any better than the regular quants without rotation, latest master might actually show better performance with the regular quants. Worth testing.

I don't think you will have to close this PR. If TurboQuant still shows improvements you could simply rebase this PR on latest master and solve merge conflicts (if any).

pwilkin · 2026-04-03T20:20:38Z

@ggerganov I think it's worth looking at contributions that adhere to the contribution standards, something that to my knowledge only this PR out of all the TurboQuant PRs has done. While the results right now might be a bit underwhelming, they're (a) close to mainline (so possibly can be improved) and (b) even if the TQ4 quant is not worth it, the TQ3 quant might be worth considering. But I'm far from an expert on quants, that's why I asked :)

ericcurtin · 2026-04-04T12:29:13Z

My 2 cents on this PR, not that it matters, it's a self-contained contained change, doesn't negatively effect the rest of the codebase, gets a thumbs up from me.

I did start another project, not sure how long it will last:

https://github.com/ericcurtin/inferrs

It has a different vibe. The goal is to get llama.cpp-like functionality via:

inferrs run --quantize google/gemma-4-E2B-it

Get vllm-like functionality via:

inferrs run --paged-attention google/gemma-4-E2B-it

And the default is closest to candle:

inferrs run google/gemma-4-E2B-it

would love PRs like this in that project. But I think this has a half-decent chance of getting merged here (and I understand the hesitancy, there's a lot of code to maintain here in the project as a whole).

ericcurtin · 2026-04-04T13:50:10Z

Cache type
Prompt t/s
Gen t/s
KV MiB
Compression vs f16
PPL
KLD
RMS Δp
Same top p

q4_0
50.46
15.64
18.00
3.56x
13.8400
0.00912
2.179%
93.898%

tbq4_0
45.84
8.31
16.25
3.94x
13.8323
0.00960
2.892%
94.094%

So, half the speed, but negligible quality difference? Is it even against latest master?

This is fair feedback.

elusznik · 2026-04-04T14:52:04Z

@CISC Did a new benchmark run on today's master 650bf14eb

Qwen3.5-4B-Q4_K_M, CPU-only, 4 threads, flash attention enabled

Perplexity & KL Divergence (wikitext-2 test, 5 chunks, ctx=512)

Cache type	PPL	Δ vs f16	Mean KLD	RMS Δp	Same top p
`f16`	9.027	—	—	—	—
`q4_0`	9.047	+0.22%	0.9165	16.065%	85.49%
`tbq4_0`	9.046	+0.21%	0.9089	16.137%	85.88%
`tbq3_0`	9.178	+1.67%	0.9166	16.257%	83.06%

Decode throughput (p=256, n=64, 3 repetitions)

Cache type	Gen t/s	KV compression
`f16`	14.22	1.00x
`q8_0`	14.12	1.88x
`q4_0`	14.09	3.56x
`tbq4_0`	6.68	3.94x
`tbq3_0`	6.74	5.19x

tbq4_0 matches q4_0 quality — PPL is essentially identical (9.046 vs 9.047), same top p is slightly better (85.88% vs 85.49%), while providing ~10% more compression (3.94x vs 3.56x). Gap is far smaller than before @ggerganov rotation PR, but it's still here
tbq3_0 is a viable high-compression option — 5.19x compression with only +1.67% PPL delta and 83% same top p.

Not a huge breakthrough by any means, but still some improvement. As a CPU-only implementation according to the rules, the throughput is obviously still a matter to be taken care of by CUDA/ROCm kernels, so I hope the potential is visible

JohannesGaessler · 2026-04-04T16:48:04Z

I think it's worth looking at contributions that adhere to the contribution standards, something that to my knowledge only this PR out of all the TurboQuant PRs has done.

If a PR blatantly violates the contributing guidelines it should be closed immediately but that does not automatically mean that a PR following them is worth reviewing in terms of opportunity cost.

In any case, if you just look at the numbers from the OP the tbq4_0 type isn't even really better than the existing naive q4_0 type. That alone is I think enough of an argument as to why we should not invest maintainer time here.

ServeurpersoCom · 2026-04-04T17:07:07Z

Look at 744c0c7 this is art

JohnAlcatraz · 2026-04-05T18:24:40Z

The numbers for tbq3_0 look great, so I hope this will be merged.

ekryski · 2026-04-05T21:46:39Z

fwiw there are those of us (@TheTom, myself, and many others) that have been, and continue to, work on GPU optimized implementations. The CPU one is by far going to be the most underwhelming if you're comparing benchmarks.

Yes, now upstream does the rotation which is a lot of the work of TurboQuant (actually PolarQuant if you don't include QJL which kills speed but whatever). This was a good call out by @Mushoz! However, as @TheTom mentions above there are many other improvements that make getting this started worthwhile. Asymmetric KV compression is actually massive and works regardless of TurboQuant or not. In mine, Tom's and other people's extensive testing it is holding up very well - massive improvement in quantization, improvement in prefill and decode speed, with very little to even improved PPL and KLD. Truly a thing of beauty that @TheTom discovered there last week.

I very much respect @elusznik's intent to stick to guidelines and make this a minimal change in order to get the ball rolling. It's intimidating coming into such a popular repo with so much churn so respect and appreciation where it's due. With all due respect @ggerganov (and I have an immense amount), shitting on it and just calling it slop without any actual critique is just rude and makes you look like a jerk. I understand there's a lot of AI shit PRs around but this ain't one of them.

My two cents, is I think it is in everyone's best interest (the OS AI community) to help ensure that we don't have a ton of forks with disparate but material performance improvements littered about.

Hoping this lands (or something similar) and some of us can layer on discreet upstream PRs for some of the further decode, prefill and memory improvements we've identified.

ZhaoanTan · 2026-04-06T05:02:07Z

It looks good, waiting for code to be merged hungry.

emircanerkul · 2026-04-06T11:01:05Z

No news so far? I maybe expect too much but seeing only %10 improvement made me sad. Although it looks like cpu only. I want to use that for 6800xt gpu. Currently using gemma-4-26B-A4B-it-UD-IQ4_XS.gguf and just want to get best result from that. And indeed local LLMs getting better each day which is great news.

elusznik · 2026-04-06T12:40:41Z

@ekryski thanks for the endorsement and the kind words, the slop comment made me feel kinda bad after putting in a couple of days' serious work into this

Green-Sky · 2026-04-07T10:15:21Z

Instead of looking at new quants, you can take a look at existing quants for kv cache here: #21551

Green-Sky · 2026-04-07T17:38:19Z

Asymmetric K/V is critical for some models
This is the biggest practical finding from my testing. Symmetric turbo (same type for K and V) is catastrophic on certain model families with Q4_K_M weights:

While looking at existing quants, I found an outlier kv quant pair that seems to perform better than quant neighbors. q3_K for K and q2_K for V.

Research of TurboQuant paper, QJL reference code, and community implementations reveals critical insights: - QJL should be dropped entirely (MSE-only beats MSE+QJL in practice) - Nobody uses TurboQuant for V (all use group quant or fp16) - Without QJL, TBQ4 gets 16 centroids (matching q4_0 level count) - PR ggml-org#21089 got PPL=9.046 (matching our 9.53) — gap vs q4_0 is expected - The paper reports LongBench/NIAH, not perplexity Also adds build time tracking log documenting CUDA template compilation issues (ptxas uses 36GB+ RAM, 2+ hours for fattn.cu with VEC templates).

…lama.cpp PR ggml-org#21089

elusznik added 2 commits March 27, 2026 22:58

feat: add CPU TurboQuant KV cache types

d5a7164

ggml : limit the first TurboQuant CPU PR to TBQ

f96df92

github-actions Bot added testing Everything test related examples server ggml changes relating to the ggml tensor library for machine learning labels Mar 27, 2026

elusznik mentioned this pull request Mar 28, 2026

Feature Request: TurboQuant support #20977

Open

4 tasks

elusznik marked this pull request as ready for review March 28, 2026 00:06

elusznik requested review from a team, CISC, ggerganov and ngxson as code owners March 28, 2026 00:06

Copilot AI review requested due to automatic review settings March 28, 2026 00:06

Copilot started reviewing on behalf of elusznik March 28, 2026 00:07 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

Comment thread ggml/src/ggml-cpu/arch-fallback.h

Comment thread ggml/src/ggml-cpu/ops.cpp Outdated

Comment thread tests/test-backend-ops.cpp

Comment thread common/arg.cpp

SirDank mentioned this pull request Mar 28, 2026

[FEATURE] TurboQuant support leehack/llamadart#99

Open

loci-dev mentioned this pull request Mar 28, 2026

UPSTREAM PR #21089: ggml : add CPU TurboQuant KV cache types (TBQ3_0 / TBQ4_0) auroralabs-loci/llama.cpp#1311

Open

lishunyang12 mentioned this pull request Mar 28, 2026

[Quantization] Add TurboQuant dynamic kv cache compression vllm-project/vllm#38280

Closed

ggml : fix TurboQuant CPU review issues

0aae7d7

marksverdhei mentioned this pull request Mar 29, 2026

feat: TurboQuant KV cache (TBQ3_0/TBQ4_0) with CPU + CUDA support heiervang-technologies/ht-llama.cpp#21

Merged

6 tasks

TimPietruskyRunPod mentioned this pull request Apr 9, 2026

Investigate TurboQuant for KV cache compression (llama.cpp + MLX) runpod-labs/a2go#115

Open

6 tasks

This was referenced Apr 12, 2026

TurboQuant: Apple Accelerate + norm correction for CPU dequant #21817

Closed

(AI policy violation) Metal: TurboQuant GPU dequant kernels + host buffer type #21818

Closed

MrLordCat pushed a commit to MrLordCat/llama.cpp-with-GUI that referenced this pull request Apr 14, 2026

feat: add TurboQuant KV cache support (tbq3_0/tbq4_0) from elusznik/l…

8263ecd

…lama.cpp PR ggml-org#21089

This was referenced Apr 18, 2026

Follow-up: TurboQuant upstream landscape + independent implementation feasibility dogkeeper886/ollama37#101

Closed

Study Google's recent KV cache compression/quantization research for K80 adoption dogkeeper886/ollama37#99

Closed

Conversation

elusznik commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Why this scope

Block layout

CPU results

Plots

KV cache memory usage

Throughput

Compression vs speed

Ablation: KV size vs KLD

Validation

Follow-up work

Acknowledgements

AI usage disclosure

Uh oh!

ggml-gh-bot Bot commented Mar 28, 2026

Uh oh!

elusznik commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

animehacker commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

animehacker commented Mar 29, 2026

Uh oh!

elusznik commented Mar 29, 2026

Uh oh!

CuriosityQuantified commented Mar 30, 2026

Build fix: arch-fallback.h missing ARM64 defines (blocks build on aarch64)

NEON kernels

Benchmark results — 8K context, Qwen3.5-4B-Q4_K_M, M4 Mac mini, 4 threads

Apple Silicon note

Uh oh!

elusznik commented Mar 30, 2026

Uh oh!

CuriosityQuantified commented Mar 30, 2026

Uh oh!

mihai-chiorean commented Mar 31, 2026

Uh oh!

pwilkin commented Apr 3, 2026

Uh oh!

ggerganov commented Apr 3, 2026

Uh oh!

elusznik commented Apr 3, 2026

Uh oh!

CISC commented Apr 3, 2026

Uh oh!

elusznik commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Apr 3, 2026

Uh oh!

elusznik commented Apr 3, 2026

Uh oh!

CISC commented Apr 3, 2026

Uh oh!

Mushoz commented Apr 3, 2026

Uh oh!

pwilkin commented Apr 3, 2026

Uh oh!

ericcurtin commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Apr 4, 2026

Uh oh!

elusznik commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perplexity & KL Divergence (wikitext-2 test, 5 chunks, ctx=512)

Decode throughput (p=256, n=64, 3 repetitions)

elusznik commented Mar 27, 2026 •

edited

Loading

animehacker commented Mar 28, 2026 •

edited

Loading

Build fix: `arch-fallback.h` missing ARM64 defines (blocks build on aarch64)

elusznik commented Apr 3, 2026 •

edited

Loading

ericcurtin commented Apr 4, 2026 •

edited

Loading

elusznik commented Apr 4, 2026 •

edited

Loading

ekryski commented Apr 5, 2026 •

edited

Loading

Green-Sky commented Apr 7, 2026 •

edited

Loading