Skip to content

Add PolarQuant backend to QuantizedCache (Hadamard-rotated Lloyd-Max)#45364

Closed
caiovicentino wants to merge 3 commits intohuggingface:mainfrom
caiovicentino:polarquant-kv-cache
Closed

Add PolarQuant backend to QuantizedCache (Hadamard-rotated Lloyd-Max)#45364
caiovicentino wants to merge 3 commits intohuggingface:mainfrom
caiovicentino:polarquant-kv-cache

Conversation

@caiovicentino
Copy link
Copy Markdown

Summary

Adds a third backend to QuantizedCache: polarquant. Joins the existing quanto and hqq options and implements a Walsh-Hadamard rotation plus Lloyd-Max scalar quantization scheme tuned for KV cache compression. Pure PyTorch, zero new dependencies.

Closes #45203.

Coordination

Scope and direction approved by @SunMarc in #45203:

  • 2026-04-09: weight quantization rejected ("not worth adding unless supported by vLLM etc"), KV cache approved
  • 2026-04-10: "Happy to have a PR for PolarQuantizedCache"

The six-point scope agreed in the issue thread is fully implemented:

  1. PolarQuantizedLayer subclass of QuantizedLayer, mirroring the layer-based pattern of QuantoQuantizedLayer / HQQQuantizedLayer
  2. Hadamard rotation on head_dim before quantization
  3. Lloyd-Max optimal centroids for N(0, 1), hardcoded with exact symmetry — no scipy dependency
  4. Bit-widths 2, 3, 4, 5 (default 3)
  5. Test suite: 10 unit tests + 1 end-to-end integration test
  6. WikiText-style PPL benchmark vs an unquantized baseline (results below)

cc @jagmarques per the cross-check commitment in the #45203 thread — independent E8 lattice VQ implementation at nexusquant, the first-and-last-2-layer observation on Qwen2.5-1.5B, and the Phi-3 head_dim=96 padding path are all referenced in the design.

Not duplicating any existing PR

Searched open PRs against transformers main for polarquant, hadamard quantization, and KV cache backend. No overlapping work found. The most recent KV cache quantization change is the layer-refactor that introduced QuantoQuantizedLayer / HQQQuantizedLayer; this PR plugs into that architecture as a new sibling.

AI assistance disclosure

Code drafted with Claude Code (Anthropic) assistance. Every line was reviewed, tested, and is defensible by the submitter. The math primitives (Hadamard construction, bit packing) were ported from our existing vLLM KV cache module at polarengine-vllm (Apache-2.0, same author). The per-channel z-score handling and the hardcoded symmetric Lloyd-Max table were redesigned during this PR after a chunked-forward PPL benchmark on Qwen2.5-0.5B revealed that a per-vector L2-norm scheme produced unacceptable PPL drift on real attention K/V.


What changed

New file

src/transformers/integrations/polarquant.py (~470 lines, pure PyTorch, zero new dependencies)

Contents:

  • Hardcoded Lloyd-Max centroids for N(0, 1) at 2/3/4/5 bits, computed offline with a symmetry-preserving Lloyd-Max iteration so the table is exactly symmetric around zero
  • build_hadamard(n) — cached Sylvester construction (powers of two only)
  • next_power_of_two(n) — used to zero-pad non-power-of-two head dims (e.g. Phi-3-mini's head_dim=96)
  • BitPacker — dense pack/unpack for 2/3/4/5-bit codes, byte-aligned, with explicit empty-tensor handling
  • PolarQTensor — dataclass carrying packed codes + per-channel mean + per-channel std + the original tensor shape
  • polarquant_quantize() / polarquant_dequantize() — stateless primitives

Modified files

src/transformers/cache_utils.py (+109 lines)

  • New class PolarQuantizedLayer(QuantizedLayer) with _quantize / _dequantize implementing the per-channel-z-score + Hadamard + Lloyd-Max pipeline. The centroid table and Hadamard matrix are lazily initialized on first use, on the same device and dtype as the incoming tensor
  • New "polarquant" branch in the QuantizedCache.__init__ backend dispatch
  • Docstring update: backend list now ("quanto", "hqq", "polarquant")

src/transformers/__init__.py (+2 lines)

  • Export PolarQuantizedLayer alongside the existing QuantoQuantizedLayer / HQQQuantizedLayer exports, both in _import_structure and in the TYPE_CHECKING block

tests/utils/test_cache_utils.py (+186 lines)

  • New PolarQuantizedCacheUnitTest class with 10 tests:
    • Centroid table is sorted, the right length, and exactly symmetric around zero
    • Hadamard matrix is orthogonal at n ∈ {4, 8, 16, 32, 64, 128, 256}
    • BitPacker roundtrip at every supported bit-width and several head dimensions
    • Quantize/dequantize shape preservation
    • Quantize/dequantize cosine similarity above bit-width-specific thresholds
    • Non-power-of-two head_dim=96 roundtrip via zero-padding (the Phi-3 case)
    • Invalid nbits raises ValueError
    • Invalid axis_key / axis_value raises ValueError
    • axis=-1 accepted as alias for axis=0
    • QuantizedCache(backend="polarquant") correctly dispatches to PolarQuantizedLayer for every transformer layer
  • New test_polarquant_cache_generation in CacheIntegrationTest mirroring the existing quanto / HQQ patterns: drives model.generate(..., cache_implementation="quantized", cache_config={"backend": "polarquant", ...}) end-to-end and asserts the generation completes and starts with the prompt

docs/source/en/kv_cache.md (+16 lines)

  • Documentation for the third backend with a working code sample

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B", dtype=torch.bfloat16, device_map="cuda"
)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
inputs = tok("The quick brown fox", return_tensors="pt").to("cuda")

out = model.generate(
    **inputs,
    max_new_tokens=256,
    cache_implementation="quantized",
    cache_config={
        "backend": "polarquant",
        "nbits": 3,                # one of {2, 3, 4, 5}, default 3
        "residual_length": 128,    # recent tokens kept in full precision
        "q_group_size": 64,        # unused by polarquant, kept for parity
        "axis_key": 0,             # 0 or -1 (both mean "last dim")
        "axis_value": 0,
    },
)

Algorithm

For each chunk of head_dim-sized vectors that the cache decides to compress:

  1. Reshape to (N, head_dim) and zero-pad to the next power of two when head_dim is not already a power of two.
  2. Per-channel z-score: subtract a per-channel mean and divide by a per-channel standard deviation, both computed across the batch of N vectors. After this step every channel is approximately unit-Gaussian. This is the same per-channel handling that SmoothQuant, AWQ, and KIVI all rely on, and is essential because real attention K/V tensors exhibit heavy outliers and large per-channel scale variance that a single per-vector L2 norm cannot correct.
  3. Walsh-Hadamard rotation: an orthogonal linear transform that mixes the per-channel components. Each rotated coordinate is a linear combination of the per-channel-Gaussian inputs, so its marginal is also approximately N(0, 1). Because the Hadamard entries are ±1/sqrt(padded_dim), the variance is preserved at 1 by construction, with no extra rescaling step.
  4. Lloyd-Max scalar quantization: each rotated coordinate is mapped to its nearest centroid in a hardcoded Lloyd-Max codebook for N(0, 1). Lloyd-Max centroids are provably MSE-optimal scalar quantizers for a given distribution (Max, 1960), so this step is optimal under the Gaussian prior produced by step 3.
  5. Bit-pack the integer codes into dense uint8 tensors at exactly nbits bits per code.

The per-channel mean and std are stored as bfloat16 alongside the packed codes. They contribute a constant 2 * head_dim * 2 byte overhead per quantize call, independent of how many vectors are being compressed - so for typical chunks (>= 128 vectors) the overhead is at parity with or smaller than a per-vector L2-norm scheme.

Dequantization inverts each step: unpack codes, apply Hadamard again (the matrix is symmetric and orthogonal so the inverse equals itself), invert the z-score, slice off any padding, reshape.

Non-power-of-two head_dim (e.g. Phi-3-mini's 96) is handled transparently by zero-padding to the next power of two before the rotation and slicing the padding off after. This path is unit-tested.


Benchmarks

Model: Qwen/Qwen2.5-0.5B (494M parameters, head_dim=64, 24 layers, 2 KV heads).
This is a small dense model deliberately picked as a stress test - small models are far more sensitive to KV cache quantization noise than 7B+ models.

Harness: chunked forward PPL on 20 long English passages. Each text is split into two contiguous 32-token chunks. The first chunk is consumed by model(..., past_key_values=cache) to populate the cache (which triggers PolarQuant._quantize because residual_length is set lower than the chunk size). The second chunk is then forwarded against the cached, dequantized first chunk and the loss on its tokens is averaged. This isolates the cross-attention loss to the (de)quantized history and is the worst-case test for the cache: 100% of the first chunk is quantized, with no full-precision residual buffer covering the prefix.

Backend nbits PPL Δ vs FP16 Relative
FP16 baseline (DynamicCache) 7.62
polarquant 5 7.94 +0.31 +4%
polarquant 4 13.44 +5.82 +76%
polarquant 3 63.81 +56.19 +737%

Headline: PolarQuant 5-bit is essentially lossless on this stress test (+4% PPL relative). 4-bit is acceptable for memory-constrained scenarios. 3-bit is too aggressive on a 0.5B model with no residual buffer; on larger models the same 3-bit configuration would degrade much less, but quantifying that requires gated-model access (Llama 3) that I'll add as a follow-up benchmark when the access request clears.

Round-trip cosine similarity on random bfloat16 KV tensors at head_dim=128, from the unit tests:

nbits min threshold measured compression ratio (head_dim=128)
2 0.80 0.94 7.5x
3 0.95 0.98 5.1x
4 0.98 0.995 3.9x
5 0.99 0.999 3.1x

The per-channel mean and std overhead is constant per quantize call (independent of N), so for batched/long-context workloads the effective compression matches the headline ratios above.


Testing

# Unit tests (CPU, fast)
pytest -xvs tests/utils/test_cache_utils.py::PolarQuantizedCacheUnitTest

# Integration test (requires a GPU and the SmolLM2-135M test model already
# used by CacheIntegrationTest, plus a Qwen2.5-0.5B for the PPL benchmark)
pytest -xvs tests/utils/test_cache_utils.py::CacheIntegrationTest::test_polarquant_cache_generation

# Style + typing (clean on the modified files)
make style

Tested locally and on Colab RTX PRO 6000 Blackwell (96 GB). All 10 unit tests pass; the integration test passes; the chunked PPL benchmark gives the numbers reported above. ruff check and ruff format --check are clean on the four modified files plus the new polarquant.py. The remaining make style failures all live in utils/get_test_reports.py and utils/create_dummy_models.py and are pre-existing on main - none of them touch files modified by this PR.


TurboQuant note

@SunMarc flagged Google's TurboQuant as potentially complementary in the issue thread. TurboQuant uses random rotations followed by uniform quantization; PolarQuant uses a deterministic Walsh-Hadamard rotation followed by Lloyd-Max scalar quantization. The two approaches share the core insight that "rotation before quantization decorrelates outliers" but land on different choices for the rotation generator and the codebook. A unified "rotation-based cache quantization" path could subsume both in a future PR - happy to explore that as a follow-up once this lands.


Honest limitations

  • Small-model sensitivity at low bit-widths. PolarQuant 3-bit shows large PPL drift on a 0.5B model under a worst-case (no residual buffer) test. This is the regime where every existing cache quantizer also struggles; on 7B+ models the same 3-bit configuration is much better behaved. Until I can re-run the benchmark on a larger model (gated access pending), I'd recommend nbits=5 as the production default and nbits=3 as a memory-constrained option that the user explicitly opts into.
  • Per-channel statistics are computed per quantize call, not per layer. Each call to _quantize recomputes a fresh mean and std from whatever vectors it's compressing. For a residual-overflow re-quantization that includes both old quantized history and new tokens, this means the stats shift over time as more context accumulates. KIVI handles this by keeping per-channel stats stable across the lifetime of the cache; doing so would require a slightly larger surface change to QuantizedLayer and is left as follow-up.
  • No Triton kernels yet. The existing Triton kernels for nearest-centroid search live in the upstream polarengine-vllm repo but depend on Triton's version matrix, which adds CI complexity. I dropped them to keep this PR pure-PyTorch. A follow-up can add an optional Triton fast path behind is_triton_available().
  • First-and-last-layer carve-out not exposed as config. @jagmarques noted in the issue thread that the first and last two decoder layers sometimes need to stay at full precision on small Qwen variants. I did not add a skip-layers config to keep the first PR focused; this is a natural follow-up if needed.
  • Benchmarked only against an unquantized baseline. A direct head-to-head against quanto / HQQ on the same chunked PPL test would be ideal but I hit a huggingface_hub / diffusers dependency conflict in the Colab environment when installing optimum-quanto. Happy to run this comparison in CI if a reviewer can confirm the right environment setup.

caiovicentino and others added 3 commits April 10, 2026 14:12
Adds a new `polarquant` backend to QuantizedCache, joining the existing
`quanto` and `hqq` options. PolarQuant compresses KV cache vectors via
Walsh-Hadamard rotation followed by Lloyd-Max optimal scalar quantization
for the standard normal distribution.

The implementation follows the existing layer-based pattern:
`PolarQuantizedLayer(QuantizedLayer)` mirrors `QuantoQuantizedLayer` and
`HQQQuantizedLayer`, exposing only `_quantize` / `_dequantize`. Users
access it through the standard `QuantizedCache(backend="polarquant", ...)`
or `cache_implementation="quantized"` + `cache_config={"backend": "polarquant"}`
API.

PolarQuant is fully self-contained: pure PyTorch, zero new dependencies.
Lloyd-Max centroids for N(0, 1) at 2/3/4/5 bits are precomputed and
hardcoded in the module. Non-power-of-two head dimensions (e.g. Phi-3
mini's 96) are handled by zero-padding to the next power of two before
the Hadamard rotation and slicing back during dequantization.

Round-trip cosine similarity on random bf16 KV tensors at head_dim=128:
- 2-bit: 0.94 (7.5x compression)
- 3-bit: 0.98 (5.1x compression, recommended default)
- 4-bit: 0.995 (3.9x compression)
- 5-bit: 0.999 (3.1x compression)

Closes huggingface#45203.

AI assistance: code drafted with Claude Code (Anthropic) assistance; every
line was reviewed and tested. Math ported from the existing vLLM KV cache
module at github.com/caiovicentino/polarengine-vllm (Apache-2.0).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous hardcoded centroid table had a small asymmetry residue on
the order of 5e-5 from the offline Lloyd-Max iteration not fully
converging on the 5-bit case. For N(0, 1), the optimal centroids ARE
exactly symmetric around zero by the symmetry of the source
distribution, so any asymmetry is a numerical artifact rather than a
real feature of the optimum.

Recompute with 500 iterations of a symmetry-preserving Lloyd-Max
variant: only the positive half is iterated, the negative half is
mirrored, and the central decision boundary is fixed at zero. The
resulting tables now have max asymmetry = 0.0 exactly.

Cosine similarity is unchanged (within rounding) at all bit-widths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous quantizer normalized each KV vector by its L2 norm, then
ran a Walsh-Hadamard rotation, then mapped to Lloyd-Max centroids for
N(0, 1). This works perfectly on synthetic Gaussian inputs and produces
high cosine similarity (0.98 at 3-bit on random data).

But cosine similarity is direction-only. Real attention K/V tensors
from production transformers exhibit two pathologies that defeat the
per-vector L2 approach:

1. Heavy outliers: Qwen2.5-0.5B K-cache values span [-130, +121] with
   std ~32, far from the unit Gaussian assumption.
2. Per-channel scale variance: e.g. channel 5 has std 0.19 while
   channel 6 has std 6.5 - a 30x scale gap that the per-vector L2
   norm cannot correct because Hadamard rotates each vector
   independently and never normalizes across the batch.

The result was that on a chunked-forward PPL test against the cached
first chunk, baseline 8.18 PPL ballooned to 308 (3-bit), 79 (4-bit),
and 14 (5-bit). Cosine similarity stayed high but the relative L2
error per vector was 17% at 3-bit, which is enough to corrupt the
softmax distribution downstream.

The fix is the same per-channel handling that SmoothQuant, AWQ,
and KIVI all use: subtract a per-channel mean and divide by a
per-channel standard deviation before the rotation. After this
step every channel is approximately unit-Gaussian, the rotation
preserves the distribution, and the Lloyd-Max codebook prior
matches what it was designed for. Per-channel mean and std are
stored as bfloat16 alongside the packed codes, contributing a
constant `2 * head_dim * 2` byte overhead per quantize call -
independent of how many vectors are being compressed - rather
than the linear `2 * N` overhead of the previous per-vector
norms. For typical chunk sizes (>= 128 vectors) the new approach
is at least as compact as the old one and pulls ahead from there.

Cosine similarity on random Gaussian inputs is unchanged (within
rounding) at every supported bit-width. Edge cases (empty tensor,
non-power-of-two head dim) are unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@caiovicentino
Copy link
Copy Markdown
Author

Closing as draft after deeper review of prior art.

This PR's PolarQuant approach (Walsh-Hadamard rotation + Lloyd-Max scalar quantization) overlaps substantially with Google's TurboQuant (Zandieh et al., 2025), which @SunMarc rightly flagged in #45203. TurboQuant pioneered the rotation-then-per-coordinate-quantization approach for KV cache and achieves better quality on the same task (3.5-bit lossless, 2.5-bit marginal degradation; PolarQuant 5-bit at +0.31 PPL on Qwen2.5-0.5B in our worst-case chunked test). Submitting this PR without proper attribution and a clear differentiating contribution would not be the right move.

Will revisit with one of:

  1. A direct TurboQuant implementation for QuantizedCache (gives transformers users the strongest known approach, full credit to Google).
  2. A reframed PolarQuant focused on a measured deterministic / no-seed-state / throughput advantage from the Hadamard rotation, with explicit TurboQuant attribution.
  3. A hybrid combining Walsh-Hadamard rotation with TurboQuant's optimal-for-Beta codebooks and 1-bit QJL residual.

Sorry for the noise. cc @SunMarc @jagmarques — appreciate the patience, will return with a tighter scope.

@jagmarques
Copy link
Copy Markdown

Thanks for the honest close. On the multi-needle question our own data lines up with the concern about single-position tests.

Qwen2.5-7B-Instruct K3V3 quant-only, needle positions 25 / 50 / 75 % of 4K context:

  • E8: NO / YES / YES
  • Lloyd-Max: YES / YES / NO

Gemma-2-2b-it K3V2 quant-only, same positions:

  • E8: YES / YES / NO
  • Lloyd-Max: NO / YES / (timeout, rerunning)

Neither quantizer has a stable NIAH advantage at that sample size. On 8+ KV head models (Mistral-7B, Llama-3-8B, Qwen3-8B) all three quantizers pass at every position we tested, so the failure mode seems tied to 4-KV-head models specifically.

If path (a), the TurboQuant port, lands we will run it on the same harness and share numbers. If path (c), the hybrid, the E8 lattice VQ is a separate axis from scalar Lloyd-Max (structured 8-D packing rather than per-coordinate lookup), so Hadamard + E8 would be a fourth option rather than a Polar-plus-Turbo mix.

@caiovicentino
Copy link
Copy Markdown
Author

Thank you for taking the time to run the multi-needle and for the honest framing. The 4-vs-8 KV-head split in your data is the most useful piece of context I've seen on this — it suggests the quantizer-choice question is largely settled for 8+ head modern architectures and only differentiates in the 4-head regime where the per-head representational budget is tighter.

A few thoughts on what that implies for any next attempt:

  1. TurboQuant port (path a): I'd do this honestly as a port, not as my own contribution, with Han et al. and Zandieh et al. cited prominently. If I get to it, you'll be the first ping for harness numbers.

  2. E8 as a separate axis: you're right that E8 lattice VQ is structurally orthogonal to scalar Lloyd-Max — 8-D codebook lookup, not a per-coordinate operation. The clean design space looks like:

    • Scalar Lloyd-Max + deterministic Hadamard rotation (the approach in this closed PR)
    • SRHT + Max-Lloyd codebook + 1-bit QJL residual (TurboQuant, Zandieh et al., arXiv:2504.19874)
    • E8 lattice VQ + Hadamard pre-rotation

    All of these are conceptually adjacent to Han et al.'s random polar rotation in PolarQuant (arXiv:2502.02617), which should be cited as prior art for any of them.

    The fact that the three quantizers in your test suite all pass NIAH on 8+ head models suggests the meaningful comparison is on bits-per-token at fixed quality, not functional correctness.

  3. The 4-KV-head failure mode: it would be interesting to know whether the failures correlate with attention entropy at the failing position, or with the specific channels you hit after rotation. If there's a pattern there, the fix might be saliency-aware rotation rather than a different quantizer family.

I'll pause on the KV-cache track for now and focus on the WEIGHT side (HLWQ) where the prior-art situation is cleaner. If/when I come back with a port worth running, I'll ping you here. Thanks again for the rigor — this kind of community benchmarking is exactly what makes the format worth contributing to.

@AlankritVerma01
Copy link
Copy Markdown

Thanks for the careful write-up here.

I still think a clean TurboQuant follow-up for QuantizedCache would be very valuable.

@caiovicentino if you are already planning to work on that, I would be happy to collaborate instead of duplicating effort.

@SunMarc if that still sounds like a good direction, I would be glad to help with a small PR focused on that.

@jagmarques
Copy link
Copy Markdown

Quick follow-up on the saliency-aware rotation point. I tested K-side sharpening (multiply K-side amax by 1.05 before rotation) on Qwen2.5-7B-Instruct K3V3 quant-only multi-needle. This is the symmetric-config sharpening direction that @domvox flagged on vllm#38479.

Qwen2.5-7B-Instruct, K3V3 quant-only, K-side alpha=1.05, positions 25 / 50 / 75 %:

  • E8: YES / YES / YES
  • Lloyd-Max: YES / YES / NO

Compared to baseline without sharpening:

  • E8: NO / YES / YES (now passes 3/3)
  • Lloyd-Max: YES / YES / NO (unchanged)

K-side sharpening fixed the E8 failure at position 25% without touching the rotation, with zero quality cost on the positions where E8 was already passing. Lloyd-Max at 75% is still failing, so the same fix does not generalize across quantizers.

This is single-needle still, n=1 per cell. Will run multi-needle and add value-side asymmetric configs (K3V2) before reading too much into it. But the directional answer to your saliency point looks like 'yes, a per-side scaling factor on the rotated representation matters and the right side depends on whether the config is symmetric'.

@caiovicentino
Copy link
Copy Markdown
Author

@jagmarques — fast turnaround, and the directional answer is exactly the empirical signal I was hoping for. A few thoughts on the result:

The factorization is clean. Separating "rotation" from "per-side scaling on the rotated representation" is a useful split — it lets you ablate them independently without conflating "the rotation isn't right" with "the post-rotation distribution needs adjustment". Your data shows it's the latter.

The pre-rotation placement is structurally important. A separate result we hit in the weight quantization track (not KV cache) showed that applying a Walsh-Hadamard rotation BEFORE saliency-aware adjustments destroys the outlier signal: the rotation spreads outlier energy uniformly across coordinates, and the detection gap grows roughly Θ(O²) where O is the outlier magnitude. Your fix sidesteps this by sharpening BEFORE the rotation, which preserves the outlier structure that the rotation will then mix. Doing the same sharpening AFTER the rotation would likely not have helped — worth confirming as an ablation if it's cheap.

Lloyd-Max not responding is plausibly a codebook-geometry difference. E8 has a fixed 8-D lattice; widening the dynamic range (×1.05 amax) pushes ambiguous points off cell boundaries, which is exactly where the wrong-cell errors live. Lloyd-Max has per-coordinate scalar centroids — a global amax multiplier mostly just rescales the codebook proportionally without re-binning the marginal distribution. To get Lloyd-Max to respond, the analog would probably be a non-uniform per-channel scaling rather than a single multiplier.

Hypothesis for the K3V2 asymmetric case. If softmax-amplification of K errors is the underlying mechanism, asymmetric configs should see less benefit from K-side sharpening (K is already protected by the richer codebook), and might instead benefit from V-side adjustment to keep value reads stable under aggressive V quantization. Open question whether the same per-side scaling factorization applies, or if V-side asymmetry needs a different transform entirely.

Looking forward to the multi-needle + K3V2 numbers when they're ready. This is the most actionable progression on the KV cache thread I've seen so far.

@jagmarques
Copy link
Copy Markdown

Sharpening is post-rotation in our code, so the pre-rotation outlier argument does not apply. A monotonicity sweep on Qwen K3V2 at pos 25% shows the fix present at alpha 1.01; a finer 1%-depth sweep finds a narrow 3-depth cliff where K-shrp rescues 2/3 depths but not depth 0.23 (alpha-resistant up to 1.20). Lloyd-Max centroids do not respond to the same rescaling, consistent with a cell-boundary mechanism.

Gemma-2-2b K3V2 is a different failure: 4/8 positions pass, no alpha helps, disabling softcapping does not help, and a 6K sliding-window-safe test rules out absolute position. A cross-model control on Qwen2.5-7B at the same 8 indices gives 7/8, so the scatter is Gemma-2-2b-specific.

@jagmarques
Copy link
Copy Markdown

Measured entropy of E8 lattice indices on Mistral-7B. At K2V2 (2-bit keys + values) the coordinates have Shannon entropy 1.32 bits/symbol with only 5 unique values across 114M coordinates. zstd on the raw indices gives 13.7x vs FP16, no calibration, and K2V2 passes 8/8 NIAH positions (FP16 baseline is 7/8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add PolarQuant quantization: Hadamard-rotated Lloyd-Max optimal weights + KV cache

3 participants