Skip to content

dknos/pairquant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PairQuant (pq4_0) — 4-bit KV-cache compression for llama.cpp

4-bit KV-cache compression using 64 block-diagonal 2D Givens rotations. Built on top of the TurboQuant llama.cpp fork; targets models with large head dimensions (D=256, D=512) such as Gemma 4 26B A4B.

Headline result

On a 16 GB RTX 5080 with gemma4 26B A4B at ctx = 32K:

K cache V cache tg128 @ 32K VRAM regime
f16 f16 59.45 ± 2.31 partial CPU offload
q8_0 turbo4 54.57 ± 0.08 partial CPU offload
turbo4 turbo4 40.61 ± 0.39 partial CPU offload (worst)
q8_0 pq4 136.86 ± 1.07 fully GPU-resident

q8_0/pq4 is +130.2% over f16/f16, +150.8% over q8_0/turbo4. All three alternatives partial-offload V to host at ctx=32K and collapse to 41–59 t/s. pq4's KV is the only one that fits in the remaining ~1.85 GiB headroom.

Quality (Phi-3.5-MoE Q2_K, wikitext-2): Δ PPL = −0.06% at ctx=512, −0.22% at ctx=2048. Noise-equivalent to f16. V-cache footprint −60.5%.

Note: The 32K win is VRAM-fit-driven, not kernel speed. On a 24 GB card where all configs stay resident, turbo4 beats pq4 at 2K–8K by ~5–6%. See BENCHMARKS-PQ4.md for the full 3-way ctx sweep.

How it works

Block layout: 128 elements / 66 bytes = 4.125 bits per value (~3.88× smaller than f16, ~2.06× smaller than q8_0).

Each 128-element block is pre-processed with 64 2D Givens rotations that decorrelate adjacent pairs before quantization. At inference, instead of inverting all 64 rotations per V load (O(D × n_kv) per token), PairQuant uses lazy inverse Givens (Option b): the centroid-weighted accumulation happens in the rotated frame, then a single O(D) Givens pass converts the final VKQ vector back to original space. This works because the rotation is linear and the accumulated sum distributes through it.

Compared to TurboQuant (Walsh-Hadamard)

Property TurboQuant PairQuant
Transform Walsh-Hadamard (global) Block-diagonal 2D Givens (local pairs)
Bits/value ~4 bpv 4.125 bpv
Post-process warp-cooperative WHT butterfly strictly thread-local Givens pairs
Prefill throughput ✓ good ✗ slower (rotate per load)
Decode @ 32K / 16 GB overflows VRAM fits — only config that does

They are complementary: turbo4 wins at ctx ≤ 8K on cards with VRAM headroom; pq4 is the only viable choice at ctx ≥ 16K on 16 GB-class cards with large models.

Build

git clone https://github.com/dknos/pairquant
cd pairquant
cmake -B build -S . -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89  # or 120 for Blackwell
cmake --build build --config Release -j$(nproc)

Run with pq4 V cache

build/bin/llama-server \
  -m gemma4-Q4_0_imatrix.gguf \
  --cache-type-k q8_0 --cache-type-v pq4 \
  --flash-attn on --ctx-size 32768 \
  --n-gpu-layers 99 --host 0.0.0.0 --port 8080

Or with llama-bench to reproduce headline numbers:

build/bin/llama-bench -m gemma4-Q4_0_imatrix.gguf \
  -ctk q8_0 -ctv pq4 -fa 1 -ngl 99 \
  -d 0 -d 1536 -d 7680 -d 31744 -r 3

Files added / modified

File Change
ggml/src/ggml-cuda/pq4.cuh 2D Givens quant/dequant, block layout
ggml/src/ggml-cuda/fattn-vec-pq4.cu FA vec kernel instantiations (D=128/256/512)
ggml/src/ggml-cuda/set-rows.cu k_set_rows write path for GGML_TYPE_PQ4_0
ggml/src/ggml-cuda/fattn.cu FATTN_VEC_CASE dispatch for pq4
ggml/src/ggml-cuda/fattn-vec.cuh pq4_post_process_VKQ hook
ggml/src/ggml-utils.cpp dequantize_row_pq4_0, CPU path
ggml/include/ggml.h GGML_TYPE_PQ4_0 enum entry
tests/test_pq4.c 14 unit tests (round-trip, cosine sim, CPU perf)
BENCHMARKS-PQ4.md Full ctx-sweep data, quality results

Quality

  • Cosine similarity (round-trip synthetic): 0.9952
  • L2 norm ratio: 0.9998
  • Wikitext-2 PPL vs f16/f16 (Phi-3.5-MoE Q2_K): −0.06% at ctx=512, −0.22% at ctx=2048
  • 14/14 unit tests pass (tests/test_pq4.c)

For gemma4 IT model PPL behavior see BENCHMARKS-PQ4.md § Quality and llama.cpp#14437.

Status

  • CPU dequant + unit tests
  • CUDA FA vec kernel (D=128, D=256, D=512)
  • Lazy inverse Givens post-process (Option b)
  • Write kernel (k_set_rows)
  • FATTN_VEC_CASE dispatch (both FA_ALL_QUANTS and default build)
  • wikitext-2 perplexity validation
  • pq3_0 (3-bit, 3.125 bpv) — CPU scaffold committed, CUDA not yet ported
  • Option a re-verification on clean GPU (option b vs option a headroom)

Transparency

CUDA kernel development used Claude Code (claude-opus-4-6 / claude-sonnet-4-6). All builds, benchmarks, and validation runs were done by the repository author.

About

PairQuant (pq4_0): 4-bit KV-cache compression via 2D Givens rotations for llama.cpp. +130% decode throughput vs f16 on gemma4 26B at ctx=32K (RTX 5080 16GB)

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors