Add PolarQuant backend to QuantizedCache (Hadamard-rotated Lloyd-Max)#45364
Add PolarQuant backend to QuantizedCache (Hadamard-rotated Lloyd-Max)#45364caiovicentino wants to merge 3 commits intohuggingface:mainfrom
Conversation
Adds a new `polarquant` backend to QuantizedCache, joining the existing
`quanto` and `hqq` options. PolarQuant compresses KV cache vectors via
Walsh-Hadamard rotation followed by Lloyd-Max optimal scalar quantization
for the standard normal distribution.
The implementation follows the existing layer-based pattern:
`PolarQuantizedLayer(QuantizedLayer)` mirrors `QuantoQuantizedLayer` and
`HQQQuantizedLayer`, exposing only `_quantize` / `_dequantize`. Users
access it through the standard `QuantizedCache(backend="polarquant", ...)`
or `cache_implementation="quantized"` + `cache_config={"backend": "polarquant"}`
API.
PolarQuant is fully self-contained: pure PyTorch, zero new dependencies.
Lloyd-Max centroids for N(0, 1) at 2/3/4/5 bits are precomputed and
hardcoded in the module. Non-power-of-two head dimensions (e.g. Phi-3
mini's 96) are handled by zero-padding to the next power of two before
the Hadamard rotation and slicing back during dequantization.
Round-trip cosine similarity on random bf16 KV tensors at head_dim=128:
- 2-bit: 0.94 (7.5x compression)
- 3-bit: 0.98 (5.1x compression, recommended default)
- 4-bit: 0.995 (3.9x compression)
- 5-bit: 0.999 (3.1x compression)
Closes huggingface#45203.
AI assistance: code drafted with Claude Code (Anthropic) assistance; every
line was reviewed and tested. Math ported from the existing vLLM KV cache
module at github.com/caiovicentino/polarengine-vllm (Apache-2.0).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous hardcoded centroid table had a small asymmetry residue on the order of 5e-5 from the offline Lloyd-Max iteration not fully converging on the 5-bit case. For N(0, 1), the optimal centroids ARE exactly symmetric around zero by the symmetry of the source distribution, so any asymmetry is a numerical artifact rather than a real feature of the optimum. Recompute with 500 iterations of a symmetry-preserving Lloyd-Max variant: only the positive half is iterated, the negative half is mirrored, and the central decision boundary is fixed at zero. The resulting tables now have max asymmetry = 0.0 exactly. Cosine similarity is unchanged (within rounding) at all bit-widths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous quantizer normalized each KV vector by its L2 norm, then ran a Walsh-Hadamard rotation, then mapped to Lloyd-Max centroids for N(0, 1). This works perfectly on synthetic Gaussian inputs and produces high cosine similarity (0.98 at 3-bit on random data). But cosine similarity is direction-only. Real attention K/V tensors from production transformers exhibit two pathologies that defeat the per-vector L2 approach: 1. Heavy outliers: Qwen2.5-0.5B K-cache values span [-130, +121] with std ~32, far from the unit Gaussian assumption. 2. Per-channel scale variance: e.g. channel 5 has std 0.19 while channel 6 has std 6.5 - a 30x scale gap that the per-vector L2 norm cannot correct because Hadamard rotates each vector independently and never normalizes across the batch. The result was that on a chunked-forward PPL test against the cached first chunk, baseline 8.18 PPL ballooned to 308 (3-bit), 79 (4-bit), and 14 (5-bit). Cosine similarity stayed high but the relative L2 error per vector was 17% at 3-bit, which is enough to corrupt the softmax distribution downstream. The fix is the same per-channel handling that SmoothQuant, AWQ, and KIVI all use: subtract a per-channel mean and divide by a per-channel standard deviation before the rotation. After this step every channel is approximately unit-Gaussian, the rotation preserves the distribution, and the Lloyd-Max codebook prior matches what it was designed for. Per-channel mean and std are stored as bfloat16 alongside the packed codes, contributing a constant `2 * head_dim * 2` byte overhead per quantize call - independent of how many vectors are being compressed - rather than the linear `2 * N` overhead of the previous per-vector norms. For typical chunk sizes (>= 128 vectors) the new approach is at least as compact as the old one and pulls ahead from there. Cosine similarity on random Gaussian inputs is unchanged (within rounding) at every supported bit-width. Edge cases (empty tensor, non-power-of-two head dim) are unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Closing as draft after deeper review of prior art. This PR's PolarQuant approach (Walsh-Hadamard rotation + Lloyd-Max scalar quantization) overlaps substantially with Google's TurboQuant (Zandieh et al., 2025), which @SunMarc rightly flagged in #45203. TurboQuant pioneered the rotation-then-per-coordinate-quantization approach for KV cache and achieves better quality on the same task (3.5-bit lossless, 2.5-bit marginal degradation; PolarQuant 5-bit at +0.31 PPL on Qwen2.5-0.5B in our worst-case chunked test). Submitting this PR without proper attribution and a clear differentiating contribution would not be the right move. Will revisit with one of:
Sorry for the noise. cc @SunMarc @jagmarques — appreciate the patience, will return with a tighter scope. |
|
Thanks for the honest close. On the multi-needle question our own data lines up with the concern about single-position tests. Qwen2.5-7B-Instruct K3V3 quant-only, needle positions 25 / 50 / 75 % of 4K context:
Gemma-2-2b-it K3V2 quant-only, same positions:
Neither quantizer has a stable NIAH advantage at that sample size. On 8+ KV head models (Mistral-7B, Llama-3-8B, Qwen3-8B) all three quantizers pass at every position we tested, so the failure mode seems tied to 4-KV-head models specifically. If path (a), the TurboQuant port, lands we will run it on the same harness and share numbers. If path (c), the hybrid, the E8 lattice VQ is a separate axis from scalar Lloyd-Max (structured 8-D packing rather than per-coordinate lookup), so Hadamard + E8 would be a fourth option rather than a Polar-plus-Turbo mix. |
|
Thank you for taking the time to run the multi-needle and for the honest framing. The 4-vs-8 KV-head split in your data is the most useful piece of context I've seen on this — it suggests the quantizer-choice question is largely settled for 8+ head modern architectures and only differentiates in the 4-head regime where the per-head representational budget is tighter. A few thoughts on what that implies for any next attempt:
I'll pause on the KV-cache track for now and focus on the WEIGHT side (HLWQ) where the prior-art situation is cleaner. If/when I come back with a port worth running, I'll ping you here. Thanks again for the rigor — this kind of community benchmarking is exactly what makes the format worth contributing to. |
|
Thanks for the careful write-up here. I still think a clean TurboQuant follow-up for QuantizedCache would be very valuable. @caiovicentino if you are already planning to work on that, I would be happy to collaborate instead of duplicating effort. @SunMarc if that still sounds like a good direction, I would be glad to help with a small PR focused on that. |
|
Quick follow-up on the saliency-aware rotation point. I tested K-side sharpening (multiply K-side amax by 1.05 before rotation) on Qwen2.5-7B-Instruct K3V3 quant-only multi-needle. This is the symmetric-config sharpening direction that @domvox flagged on vllm#38479. Qwen2.5-7B-Instruct, K3V3 quant-only, K-side alpha=1.05, positions 25 / 50 / 75 %:
Compared to baseline without sharpening:
K-side sharpening fixed the E8 failure at position 25% without touching the rotation, with zero quality cost on the positions where E8 was already passing. Lloyd-Max at 75% is still failing, so the same fix does not generalize across quantizers. This is single-needle still, n=1 per cell. Will run multi-needle and add value-side asymmetric configs (K3V2) before reading too much into it. But the directional answer to your saliency point looks like 'yes, a per-side scaling factor on the rotated representation matters and the right side depends on whether the config is symmetric'. |
|
@jagmarques — fast turnaround, and the directional answer is exactly the empirical signal I was hoping for. A few thoughts on the result: The factorization is clean. Separating "rotation" from "per-side scaling on the rotated representation" is a useful split — it lets you ablate them independently without conflating "the rotation isn't right" with "the post-rotation distribution needs adjustment". Your data shows it's the latter. The pre-rotation placement is structurally important. A separate result we hit in the weight quantization track (not KV cache) showed that applying a Walsh-Hadamard rotation BEFORE saliency-aware adjustments destroys the outlier signal: the rotation spreads outlier energy uniformly across coordinates, and the detection gap grows roughly Θ(O²) where O is the outlier magnitude. Your fix sidesteps this by sharpening BEFORE the rotation, which preserves the outlier structure that the rotation will then mix. Doing the same sharpening AFTER the rotation would likely not have helped — worth confirming as an ablation if it's cheap. Lloyd-Max not responding is plausibly a codebook-geometry difference. E8 has a fixed 8-D lattice; widening the dynamic range (×1.05 amax) pushes ambiguous points off cell boundaries, which is exactly where the wrong-cell errors live. Lloyd-Max has per-coordinate scalar centroids — a global amax multiplier mostly just rescales the codebook proportionally without re-binning the marginal distribution. To get Lloyd-Max to respond, the analog would probably be a non-uniform per-channel scaling rather than a single multiplier. Hypothesis for the K3V2 asymmetric case. If softmax-amplification of K errors is the underlying mechanism, asymmetric configs should see less benefit from K-side sharpening (K is already protected by the richer codebook), and might instead benefit from V-side adjustment to keep value reads stable under aggressive V quantization. Open question whether the same per-side scaling factorization applies, or if V-side asymmetry needs a different transform entirely. Looking forward to the multi-needle + K3V2 numbers when they're ready. This is the most actionable progression on the KV cache thread I've seen so far. |
|
Sharpening is post-rotation in our code, so the pre-rotation outlier argument does not apply. A monotonicity sweep on Qwen K3V2 at pos 25% shows the fix present at alpha 1.01; a finer 1%-depth sweep finds a narrow 3-depth cliff where K-shrp rescues 2/3 depths but not depth 0.23 (alpha-resistant up to 1.20). Lloyd-Max centroids do not respond to the same rescaling, consistent with a cell-boundary mechanism. Gemma-2-2b K3V2 is a different failure: 4/8 positions pass, no alpha helps, disabling softcapping does not help, and a 6K sliding-window-safe test rules out absolute position. A cross-model control on Qwen2.5-7B at the same 8 indices gives 7/8, so the scatter is Gemma-2-2b-specific. |
|
Measured entropy of E8 lattice indices on Mistral-7B. At K2V2 (2-bit keys + values) the coordinates have Shannon entropy 1.32 bits/symbol with only 5 unique values across 114M coordinates. zstd on the raw indices gives 13.7x vs FP16, no calibration, and K2V2 passes 8/8 NIAH positions (FP16 baseline is 7/8). |
Summary
Adds a third backend to
QuantizedCache:polarquant. Joins the existingquantoandhqqoptions and implements a Walsh-Hadamard rotation plus Lloyd-Max scalar quantization scheme tuned for KV cache compression. Pure PyTorch, zero new dependencies.Closes #45203.
Coordination
Scope and direction approved by @SunMarc in #45203:
PolarQuantizedCache"The six-point scope agreed in the issue thread is fully implemented:
PolarQuantizedLayersubclass ofQuantizedLayer, mirroring the layer-based pattern ofQuantoQuantizedLayer/HQQQuantizedLayerhead_dimbefore quantizationN(0, 1), hardcoded with exact symmetry — no scipy dependencycc @jagmarques per the cross-check commitment in the #45203 thread — independent E8 lattice VQ implementation at nexusquant, the first-and-last-2-layer observation on Qwen2.5-1.5B, and the Phi-3
head_dim=96padding path are all referenced in the design.Not duplicating any existing PR
Searched open PRs against transformers
mainforpolarquant,hadamard quantization, andKV cache backend. No overlapping work found. The most recent KV cache quantization change is the layer-refactor that introducedQuantoQuantizedLayer/HQQQuantizedLayer; this PR plugs into that architecture as a new sibling.AI assistance disclosure
Code drafted with Claude Code (Anthropic) assistance. Every line was reviewed, tested, and is defensible by the submitter. The math primitives (Hadamard construction, bit packing) were ported from our existing vLLM KV cache module at polarengine-vllm (Apache-2.0, same author). The per-channel z-score handling and the hardcoded symmetric Lloyd-Max table were redesigned during this PR after a chunked-forward PPL benchmark on Qwen2.5-0.5B revealed that a per-vector L2-norm scheme produced unacceptable PPL drift on real attention K/V.
What changed
New file
src/transformers/integrations/polarquant.py(~470 lines, pure PyTorch, zero new dependencies)Contents:
N(0, 1)at 2/3/4/5 bits, computed offline with a symmetry-preserving Lloyd-Max iteration so the table is exactly symmetric around zerobuild_hadamard(n)— cached Sylvester construction (powers of two only)next_power_of_two(n)— used to zero-pad non-power-of-two head dims (e.g. Phi-3-mini'shead_dim=96)BitPacker— dense pack/unpack for 2/3/4/5-bit codes, byte-aligned, with explicit empty-tensor handlingPolarQTensor— dataclass carrying packed codes + per-channelmean+ per-channelstd+ the original tensor shapepolarquant_quantize()/polarquant_dequantize()— stateless primitivesModified files
src/transformers/cache_utils.py(+109 lines)PolarQuantizedLayer(QuantizedLayer)with_quantize/_dequantizeimplementing the per-channel-z-score + Hadamard + Lloyd-Max pipeline. The centroid table and Hadamard matrix are lazily initialized on first use, on the same device and dtype as the incoming tensor"polarquant"branch in theQuantizedCache.__init__backend dispatch("quanto", "hqq", "polarquant")src/transformers/__init__.py(+2 lines)PolarQuantizedLayeralongside the existingQuantoQuantizedLayer/HQQQuantizedLayerexports, both in_import_structureand in theTYPE_CHECKINGblocktests/utils/test_cache_utils.py(+186 lines)PolarQuantizedCacheUnitTestclass with 10 tests:n ∈ {4, 8, 16, 32, 64, 128, 256}head_dim=96roundtrip via zero-padding (the Phi-3 case)nbitsraisesValueErroraxis_key/axis_valueraisesValueErroraxis=-1accepted as alias foraxis=0QuantizedCache(backend="polarquant")correctly dispatches toPolarQuantizedLayerfor every transformer layertest_polarquant_cache_generationinCacheIntegrationTestmirroring the existingquanto/HQQpatterns: drivesmodel.generate(..., cache_implementation="quantized", cache_config={"backend": "polarquant", ...})end-to-end and asserts the generation completes and starts with the promptdocs/source/en/kv_cache.md(+16 lines)Usage
Algorithm
For each chunk of
head_dim-sized vectors that the cache decides to compress:(N, head_dim)and zero-pad to the next power of two whenhead_dimis not already a power of two.Nvectors. After this step every channel is approximately unit-Gaussian. This is the same per-channel handling that SmoothQuant, AWQ, and KIVI all rely on, and is essential because real attention K/V tensors exhibit heavy outliers and large per-channel scale variance that a single per-vector L2 norm cannot correct.N(0, 1). Because the Hadamard entries are±1/sqrt(padded_dim), the variance is preserved at 1 by construction, with no extra rescaling step.N(0, 1). Lloyd-Max centroids are provably MSE-optimal scalar quantizers for a given distribution (Max, 1960), so this step is optimal under the Gaussian prior produced by step 3.nbitsbits per code.The per-channel
meanandstdare stored asbfloat16alongside the packed codes. They contribute a constant2 * head_dim * 2byte overhead per quantize call, independent of how many vectors are being compressed - so for typical chunks (>= 128vectors) the overhead is at parity with or smaller than a per-vector L2-norm scheme.Dequantization inverts each step: unpack codes, apply Hadamard again (the matrix is symmetric and orthogonal so the inverse equals itself), invert the z-score, slice off any padding, reshape.
Non-power-of-two
head_dim(e.g. Phi-3-mini's 96) is handled transparently by zero-padding to the next power of two before the rotation and slicing the padding off after. This path is unit-tested.Benchmarks
Model:
Qwen/Qwen2.5-0.5B(494M parameters, head_dim=64, 24 layers, 2 KV heads).This is a small dense model deliberately picked as a stress test - small models are far more sensitive to KV cache quantization noise than 7B+ models.
Harness: chunked forward PPL on 20 long English passages. Each text is split into two contiguous 32-token chunks. The first chunk is consumed by
model(..., past_key_values=cache)to populate the cache (which triggersPolarQuant._quantizebecauseresidual_lengthis set lower than the chunk size). The second chunk is then forwarded against the cached, dequantized first chunk and the loss on its tokens is averaged. This isolates the cross-attention loss to the (de)quantized history and is the worst-case test for the cache: 100% of the first chunk is quantized, with no full-precision residual buffer covering the prefix.Headline: PolarQuant 5-bit is essentially lossless on this stress test (+4% PPL relative). 4-bit is acceptable for memory-constrained scenarios. 3-bit is too aggressive on a 0.5B model with no residual buffer; on larger models the same 3-bit configuration would degrade much less, but quantifying that requires gated-model access (Llama 3) that I'll add as a follow-up benchmark when the access request clears.
Round-trip cosine similarity on random
bfloat16KV tensors athead_dim=128, from the unit tests:The per-channel
meanandstdoverhead is constant per quantize call (independent ofN), so for batched/long-context workloads the effective compression matches the headline ratios above.Testing
Tested locally and on Colab RTX PRO 6000 Blackwell (96 GB). All 10 unit tests pass; the integration test passes; the chunked PPL benchmark gives the numbers reported above.
ruff checkandruff format --checkare clean on the four modified files plus the newpolarquant.py. The remainingmake stylefailures all live inutils/get_test_reports.pyandutils/create_dummy_models.pyand are pre-existing onmain- none of them touch files modified by this PR.TurboQuant note
@SunMarc flagged Google's TurboQuant as potentially complementary in the issue thread. TurboQuant uses random rotations followed by uniform quantization; PolarQuant uses a deterministic Walsh-Hadamard rotation followed by Lloyd-Max scalar quantization. The two approaches share the core insight that "rotation before quantization decorrelates outliers" but land on different choices for the rotation generator and the codebook. A unified "rotation-based cache quantization" path could subsume both in a future PR - happy to explore that as a follow-up once this lands.
Honest limitations
nbits=5as the production default andnbits=3as a memory-constrained option that the user explicitly opts into._quantizerecomputes a freshmeanandstdfrom whatever vectors it's compressing. For a residual-overflow re-quantization that includes both old quantized history and new tokens, this means the stats shift over time as more context accumulates. KIVI handles this by keeping per-channel stats stable across the lifetime of the cache; doing so would require a slightly larger surface change toQuantizedLayerand is left as follow-up.polarengine-vllmrepo but depend on Triton's version matrix, which adds CI complexity. I dropped them to keep this PR pure-PyTorch. A follow-up can add an optional Triton fast path behindis_triton_available().quanto/HQQon the same chunked PPL test would be ideal but I hit ahuggingface_hub/diffusersdependency conflict in the Colab environment when installingoptimum-quanto. Happy to run this comparison in CI if a reviewer can confirm the right environment setup.