PairQuant (pq4_0) — 4-bit KV-cache compression for llama.cpp

4-bit KV-cache compression using 64 block-diagonal 2D Givens rotations. Built on top of the TurboQuant llama.cpp fork; targets models with large head dimensions (D=256, D=512) such as Gemma 4 26B A4B.

Headline result

On a 16 GB RTX 5080 with gemma4 26B A4B at ctx = 32K:

K cache	V cache	tg128 @ 32K	VRAM regime
`f16`	`f16`	59.45 ± 2.31	partial CPU offload
`q8_0`	`turbo4`	54.57 ± 0.08	partial CPU offload
`turbo4`	`turbo4`	40.61 ± 0.39	partial CPU offload (worst)
`q8_0`	`pq4`	136.86 ± 1.07	fully GPU-resident ✓

q8_0/pq4 is +130.2% over f16/f16, +150.8% over q8_0/turbo4. All three alternatives partial-offload V to host at ctx=32K and collapse to 41–59 t/s. pq4's KV is the only one that fits in the remaining ~1.85 GiB headroom.

Quality (Phi-3.5-MoE Q2_K, wikitext-2): Δ PPL = −0.06% at ctx=512, −0.22% at ctx=2048. Noise-equivalent to f16. V-cache footprint −60.5%.

Note: The 32K win is VRAM-fit-driven, not kernel speed. On a 24 GB card where all configs stay resident, turbo4 beats pq4 at 2K–8K by ~5–6%. See BENCHMARKS-PQ4.md for the full 3-way ctx sweep.

How it works

Block layout: 128 elements / 66 bytes = 4.125 bits per value (~3.88× smaller than f16, ~2.06× smaller than q8_0).

Each 128-element block is pre-processed with 64 2D Givens rotations that decorrelate adjacent pairs before quantization. At inference, instead of inverting all 64 rotations per V load (O(D × n_kv) per token), PairQuant uses lazy inverse Givens (Option b): the centroid-weighted accumulation happens in the rotated frame, then a single O(D) Givens pass converts the final VKQ vector back to original space. This works because the rotation is linear and the accumulated sum distributes through it.

Compared to TurboQuant (Walsh-Hadamard)

Property	TurboQuant	PairQuant
Transform	Walsh-Hadamard (global)	Block-diagonal 2D Givens (local pairs)
Bits/value	~4 bpv	4.125 bpv
Post-process	warp-cooperative WHT butterfly	strictly thread-local Givens pairs
Prefill throughput	✓ good	✗ slower (rotate per load)
Decode @ 32K / 16 GB	overflows VRAM	fits — only config that does

They are complementary: turbo4 wins at ctx ≤ 8K on cards with VRAM headroom; pq4 is the only viable choice at ctx ≥ 16K on 16 GB-class cards with large models.

Build

git clone https://github.com/dknos/pairquant
cd pairquant
cmake -B build -S . -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89  # or 120 for Blackwell
cmake --build build --config Release -j$(nproc)

Run with pq4 V cache

build/bin/llama-server \
  -m gemma4-Q4_0_imatrix.gguf \
  --cache-type-k q8_0 --cache-type-v pq4 \
  --flash-attn on --ctx-size 32768 \
  --n-gpu-layers 99 --host 0.0.0.0 --port 8080

Or with llama-bench to reproduce headline numbers:

build/bin/llama-bench -m gemma4-Q4_0_imatrix.gguf \
  -ctk q8_0 -ctv pq4 -fa 1 -ngl 99 \
  -d 0 -d 1536 -d 7680 -d 31744 -r 3

Files added / modified

File	Change
`ggml/src/ggml-cuda/pq4.cuh`	2D Givens quant/dequant, block layout
`ggml/src/ggml-cuda/fattn-vec-pq4.cu`	FA vec kernel instantiations (D=128/256/512)
`ggml/src/ggml-cuda/set-rows.cu`	`k_set_rows` write path for GGML_TYPE_PQ4_0
`ggml/src/ggml-cuda/fattn.cu`	FATTN_VEC_CASE dispatch for pq4
`ggml/src/ggml-cuda/fattn-vec.cuh`	`pq4_post_process_VKQ` hook
`ggml/src/ggml-utils.cpp`	`dequantize_row_pq4_0`, CPU path
`ggml/include/ggml.h`	`GGML_TYPE_PQ4_0` enum entry
`tests/test_pq4.c`	14 unit tests (round-trip, cosine sim, CPU perf)
`BENCHMARKS-PQ4.md`	Full ctx-sweep data, quality results

Quality

Cosine similarity (round-trip synthetic): 0.9952
L2 norm ratio: 0.9998
Wikitext-2 PPL vs f16/f16 (Phi-3.5-MoE Q2_K): −0.06% at ctx=512, −0.22% at ctx=2048
14/14 unit tests pass (tests/test_pq4.c)

For gemma4 IT model PPL behavior see BENCHMARKS-PQ4.md § Quality and llama.cpp#14437.

Status

CPU dequant + unit tests
CUDA FA vec kernel (D=128, D=256, D=512)
Lazy inverse Givens post-process (Option b)
Write kernel (k_set_rows)
FATTN_VEC_CASE dispatch (both FA_ALL_QUANTS and default build)
wikitext-2 perplexity validation
pq3_0 (3-bit, 3.125 bpv) — CPU scaffold committed, CUDA not yet ported
Option a re-verification on clean GPU (option b vs option a headroom)

Transparency

CUDA kernel development used Claude Code (claude-opus-4-6 / claude-sonnet-4-6). All builds, benchmarks, and validation runs were done by the repository author.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches		benches
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
AUTHORS		AUTHORS
BENCHMARKS-PQ4.md		BENCHMARKS-PQ4.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
build-xcframework.sh		build-xcframework.sh
canary_debug.txt		canary_debug.txt
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
diag.txt		diag.txt
flake.lock		flake.lock
flake.nix		flake.nix
full_log.txt		full_log.txt
gguf_dump.txt		gguf_dump.txt
help.txt		help.txt
llama3_calls.txt		llama3_calls.txt
model_info.txt		model_info.txt
mse.txt		mse.txt
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
qwen_calls.txt		qwen_calls.txt
requirements.txt		requirements.txt
test_pq4.c		test_pq4.c
ty.toml		ty.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PairQuant (pq4_0) — 4-bit KV-cache compression for llama.cpp

Headline result

How it works

Compared to TurboQuant (Walsh-Hadamard)

Build

Run with pq4 V cache

Files added / modified

Quality

Status

Transparency

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PairQuant (pq4_0) — 4-bit KV-cache compression for llama.cpp

Headline result

How it works

Compared to TurboQuant (Walsh-Hadamard)

Build

Run with pq4 V cache

Files added / modified

Quality

Status

Transparency

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages