#

turboquant

Here are 46 public repositories matching this topic...

arozanov / turboquant-mlx

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

metal quantization mlx kv-cache apple-silicon llm turboquant

Updated Apr 2, 2026
Python

PacifAIst / Quansloth

Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware with ease

cuda turboquant quansloth vram-wall

Updated Apr 6, 2026
Python

Alberto-Codes / turboquant-vllm

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

compression transformer triton quantization inference-optimization kv-cache llm vllm consumer-gpu turboquant

Updated Apr 8, 2026
Python

back2matching / turboquant

First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.

machine-learning compression gpu transformers inference pytorch quantization vram huggingface kv-cache llm turboquant

Updated Mar 30, 2026
Python

mindtro / semafold

Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core — optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).

retrieval quantization vector-database kv-cache llm-inference embedding-compression turboquant vector-compression qjl semafold

Updated Apr 1, 2026
Python

Firmamento-Technologies / TurboQuant

Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement

Updated Mar 28, 2026
Python

Argonaut790 / fused-turboquant

Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.

Updated Apr 1, 2026
Python

rookiemann / vllm-windows-build

Native Windows build of vLLM v0.17.1 with Triton support and TurboQuant KV cache compression — Qwen 3.5, Llama 4, and more. No WSL, no Docker. Pre-built wheel + patchset for MSVC 2022 + CUDA 12.6.

windows gpu cuda inference pytorch triton msvc quantization kv-cache llm fp8 vllm turboquant

Updated Apr 4, 2026
Python

Sggin1 / spark-ai-containers

Docker containers for AI models on NVIDIA DGX Spark (GB10, SM121, aarch64). TurboQuant KV cache compression + mamba-ssm aarch64 build.

aarch64 blackwell kv-cache vllm nvfp4 dgx-spark mamba-ssm sm121 turboquant

Updated Mar 31, 2026
Python

yzamari / turboQuantPlayground

TurboQuant (ICLR 2026) ported to Apple Silicon — KV cache compression with MLX Metal kernels + PyTorch CPU

machine-learning deep-learning metal transformers inference pytorch attention quantization mlx iclr kv-cache apple-silicon llm llm-inference turboquant

Updated Mar 31, 2026
Python

carlosfundora / sglang-1-bit-turbo

ROCm/HIP fork of SGLang with TurboQuant tq2/tq3/tq4 KV cache, Triton and radix-cache serving, EAGLE3 speculative decoding, P-EAGLE checkpoint support, and PrismML Bonsai 1-bit GGUF compatibility on gfx1030/RDNA2.

triton hip bonsai rocm amd-gpu gguf speculative-decoding sglang rdna2 eagle3 turboquant prismml gfx1030 p-eagle radix-cache

Updated Apr 9, 2026
Python

Ryuketsukami / turboquant-compression

Near-optimal vector quantization for LLM KV cache compression. Python implementation of TurboQuant (ICLR 2026) — PolarQuant + QJL for 3-bit quantization with minimal accuracy loss and up to 8x memory reduction.

Updated Mar 28, 2026
Python

turboquant-experiments

chahero / turboquant-experiments

Interactive Benchmarking Tool for TurboQuant KV Cache Compression. Supports 2-4 bit quantization with Real-time Metrics

nlp machine-learning deep-learning pytorch transformer mistral vector-quantization model-compression inference-optimization kv-cache llm vllm qwen iclr-2026 turboquant

Updated Mar 28, 2026
Python

Ryuketsukami / turboquant-skill

AI agent skill implementing Google's TurboQuant compression algorithm (ICLR 2026) — 6x KV cache memory reduction, 8x speedup, zero accuracy loss. Compatible with Claude Code, Codex CLI, and all Agent Skills-compatible tools.

Updated Mar 28, 2026
Python

turboquant-experiment

amitshekhariitbhu / turboquant-experiment

KV Cache with PagedAttention vs PagedAttention + TurboQuant - experiments across token sizes comparing memory, latency, and accuracy.

inference large-language-models llm llms llm-inference kvcache kvcache-optimization kvcache-compression turboquant

Updated Mar 26, 2026
Python

consilium-ai / consilium-ai

Local AI agent with 16K context on 8GB RAM. No cloud, no API keys.

apple google web ai model cache opus kv silicon distilled rlm 16k 8gb recurcive qwen3 turboquant polarquant

Updated Mar 29, 2026
Python

tushu1232 / turboquant-server

Turbo Index

google hpc gpu information-theory pytorch nearest-neighbor-search quantization vector-quantization kv-cache large-language-models llm-inference turboquant turboindexer

Updated Mar 25, 2026
Python

wjddusrb03 / diffmind

AI Code Review Memory - learns from your team's bug history and warns when similar patterns appear

python git ai developer-tools code-review semantic-search bug-detection turboquant

Updated Mar 28, 2026
Python

Echen1246 / local-turboquant

Unofficial Python library implementing TurboQuant (Q_mse and Q_prod) KV cache compression for HuggingFace Transformers. One-line activate() API plus a CLI. Tested on Llama 3.1 and Qwen 2.5.

modal quantization llama3 turboquant

Updated Apr 7, 2026
Python

diyagk01 / TurboRAG

TurboQuant‑style embedding compression for RAG: an SDK using fixed rotations, PolarQuant, and QJL residual sketches for compact storage and fast similarity search

rag turboquant

Updated Mar 28, 2026
Python

Improve this page

Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."