TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
-
Updated
Apr 2, 2026 - Python
TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware with ease
TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs
First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.
Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core — optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).
Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement
Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.
Native Windows build of vLLM v0.17.1 with Triton support and TurboQuant KV cache compression — Qwen 3.5, Llama 4, and more. No WSL, no Docker. Pre-built wheel + patchset for MSVC 2022 + CUDA 12.6.
TurboQuant (ICLR 2026) ported to Apple Silicon — KV cache compression with MLX Metal kernels + PyTorch CPU
ROCm/HIP fork of SGLang with TurboQuant tq2/tq3/tq4 KV cache, Triton and radix-cache serving, EAGLE3 speculative decoding, P-EAGLE checkpoint support, and PrismML Bonsai 1-bit GGUF compatibility on gfx1030/RDNA2.
Near-optimal vector quantization for LLM KV cache compression. Python implementation of TurboQuant (ICLR 2026) — PolarQuant + QJL for 3-bit quantization with minimal accuracy loss and up to 8x memory reduction.
Interactive Benchmarking Tool for TurboQuant KV Cache Compression. Supports 2-4 bit quantization with Real-time Metrics
AI agent skill implementing Google's TurboQuant compression algorithm (ICLR 2026) — 6x KV cache memory reduction, 8x speedup, zero accuracy loss. Compatible with Claude Code, Codex CLI, and all Agent Skills-compatible tools.
KV Cache with PagedAttention vs PagedAttention + TurboQuant - experiments across token sizes comparing memory, latency, and accuracy.
Turbo Index
AI Code Review Memory - learns from your team's bug history and warns when similar patterns appear
Unofficial Python library implementing TurboQuant (Q_mse and Q_prod) KV cache compression for HuggingFace Transformers. One-line activate() API plus a CLI. Tested on Llama 3.1 and Qwen 2.5.
TurboQuant‑style embedding compression for RAG: an SDK using fixed rotations, PolarQuant, and QJL residual sketches for compact storage and fast similarity search
Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.
To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."