4-bit KV-cache compression using 64 block-diagonal 2D Givens rotations. Built on top of the TurboQuant llama.cpp fork; targets models with large head dimensions (D=256, D=512) such as Gemma 4 26B A4B.
On a 16 GB RTX 5080 with gemma4 26B A4B at ctx = 32K:
| K cache | V cache | tg128 @ 32K | VRAM regime |
|---|---|---|---|
f16 |
f16 |
59.45 ± 2.31 | partial CPU offload |
q8_0 |
turbo4 |
54.57 ± 0.08 | partial CPU offload |
turbo4 |
turbo4 |
40.61 ± 0.39 | partial CPU offload (worst) |
q8_0 |
pq4 |
136.86 ± 1.07 | fully GPU-resident ✓ |
q8_0/pq4 is +130.2% over f16/f16, +150.8% over q8_0/turbo4. All three alternatives partial-offload V to host at ctx=32K and collapse to 41–59 t/s. pq4's KV is the only one that fits in the remaining ~1.85 GiB headroom.
Quality (Phi-3.5-MoE Q2_K, wikitext-2): Δ PPL = −0.06% at ctx=512, −0.22% at ctx=2048. Noise-equivalent to f16. V-cache footprint −60.5%.
Note: The 32K win is VRAM-fit-driven, not kernel speed. On a 24 GB card where all configs stay resident, turbo4 beats pq4 at 2K–8K by ~5–6%. See BENCHMARKS-PQ4.md for the full 3-way ctx sweep.
Block layout: 128 elements / 66 bytes = 4.125 bits per value (~3.88× smaller than f16, ~2.06× smaller than q8_0).
Each 128-element block is pre-processed with 64 2D Givens rotations that decorrelate adjacent pairs before quantization. At inference, instead of inverting all 64 rotations per V load (O(D × n_kv) per token), PairQuant uses lazy inverse Givens (Option b): the centroid-weighted accumulation happens in the rotated frame, then a single O(D) Givens pass converts the final VKQ vector back to original space. This works because the rotation is linear and the accumulated sum distributes through it.
| Property | TurboQuant | PairQuant |
|---|---|---|
| Transform | Walsh-Hadamard (global) | Block-diagonal 2D Givens (local pairs) |
| Bits/value | ~4 bpv | 4.125 bpv |
| Post-process | warp-cooperative WHT butterfly | strictly thread-local Givens pairs |
| Prefill throughput | ✓ good | ✗ slower (rotate per load) |
| Decode @ 32K / 16 GB | overflows VRAM | fits — only config that does |
They are complementary: turbo4 wins at ctx ≤ 8K on cards with VRAM headroom; pq4 is the only viable choice at ctx ≥ 16K on 16 GB-class cards with large models.
git clone https://github.com/dknos/pairquant
cd pairquant
cmake -B build -S . -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89 # or 120 for Blackwell
cmake --build build --config Release -j$(nproc)build/bin/llama-server \
-m gemma4-Q4_0_imatrix.gguf \
--cache-type-k q8_0 --cache-type-v pq4 \
--flash-attn on --ctx-size 32768 \
--n-gpu-layers 99 --host 0.0.0.0 --port 8080Or with llama-bench to reproduce headline numbers:
build/bin/llama-bench -m gemma4-Q4_0_imatrix.gguf \
-ctk q8_0 -ctv pq4 -fa 1 -ngl 99 \
-d 0 -d 1536 -d 7680 -d 31744 -r 3| File | Change |
|---|---|
ggml/src/ggml-cuda/pq4.cuh |
2D Givens quant/dequant, block layout |
ggml/src/ggml-cuda/fattn-vec-pq4.cu |
FA vec kernel instantiations (D=128/256/512) |
ggml/src/ggml-cuda/set-rows.cu |
k_set_rows write path for GGML_TYPE_PQ4_0 |
ggml/src/ggml-cuda/fattn.cu |
FATTN_VEC_CASE dispatch for pq4 |
ggml/src/ggml-cuda/fattn-vec.cuh |
pq4_post_process_VKQ hook |
ggml/src/ggml-utils.cpp |
dequantize_row_pq4_0, CPU path |
ggml/include/ggml.h |
GGML_TYPE_PQ4_0 enum entry |
tests/test_pq4.c |
14 unit tests (round-trip, cosine sim, CPU perf) |
BENCHMARKS-PQ4.md |
Full ctx-sweep data, quality results |
- Cosine similarity (round-trip synthetic): 0.9952
- L2 norm ratio: 0.9998
- Wikitext-2 PPL vs f16/f16 (Phi-3.5-MoE Q2_K): −0.06% at ctx=512, −0.22% at ctx=2048
- 14/14 unit tests pass (
tests/test_pq4.c)
For gemma4 IT model PPL behavior see BENCHMARKS-PQ4.md § Quality and llama.cpp#14437.
- CPU dequant + unit tests
- CUDA FA vec kernel (D=128, D=256, D=512)
- Lazy inverse Givens post-process (Option b)
- Write kernel (
k_set_rows) - FATTN_VEC_CASE dispatch (both
FA_ALL_QUANTSand default build) - wikitext-2 perplexity validation
- pq3_0 (3-bit, 3.125 bpv) — CPU scaffold committed, CUDA not yet ported
- Option a re-verification on clean GPU (option b vs option a headroom)
CUDA kernel development used Claude Code (claude-opus-4-6 / claude-sonnet-4-6). All builds, benchmarks, and validation runs were done by the repository author.