A downstream fork of Madreag/turbo3-cuda and TheTom/llama-cpp-turboquant, focused on adaptive KV layout selection, MoE partial offload, and long-context local coding workloads on consumer Blackwell GPUs.
Lineage: TurboQuant paper (Google Research) → TheTom → signalnine → @Madreag → this fork. What this layer adds:
- sm_120 (consumer Blackwell) ptxas-crash workarounds for Windows nvcc 12.9
- TCQ (Trellis Coded Quantization) integrated as a
turbo3_tcqKV type - A VRAM-fit auto-selector that probes free GPU memory and picks the most
aggressive layer-adaptive K/V promotion mode that fits (
mode 1 → 7 → 13 → off) - MoE offload tuning —
--n-cpu-moesweep methodology and validated 16 GB configs for Qwen3.6-35B-A3B - Long-context depth-sweep validation at d=0/16K/32K/65K/128K rather than d=0 only
I tuned this for one specific stack (RTX 5080 16 GB / Ryzen 9700X / DDR5 / Windows 11), but the code paths apply to any sm_120 setup, and the build / run instructions below cover other system configurations.
The original TurboQuant CUDA work — what makes the turbo* cache types fast at all —
isn't mine. See Acknowledgments.
CUDA toolchain: sm_120 builds require CUDA 12.9.x. I tested 13.x and it produced garbage output; 13.1 segfaulted in MMQ kernels. CUDA 13 may fix this in future releases — until then, pin 12.9.
CUDA implementation of TurboQuant (ICLR 2026) KV cache compression for llama.cpp, targeting NVIDIA GPUs (SM86+).
The KV cache is the memory bottleneck for long-context LLM inference. At 32K+ tokens, the KV cache can exceed the model weights in size, consuming VRAM and bandwidth. TurboQuant compresses KV values from 8.5 bits (q8_0) down to 2-4 bits — slashing memory 4-8x while maintaining quality. The result: longer context, more concurrent users, and on bandwidth-limited GPUs, faster decode.
| Type | Bits/Value | Compression | Best For | Trade-off |
|---|---|---|---|---|
| turbo4 | 4.25 | 3.76x | Best quality | +0.97% PPL, lowest KL divergence |
| turbo3 | 3.125 | 5.12x | Best balance | +1.38% PPL at ctx=512, equals q8_0 at ctx=2048 |
| turbo2 | 2.125 | 7.53x | Long context / speed | +5.35% PPL, but fastest at 32K+ on all GPUs |
| turbo1.5 | 2.00 | 8x | Maximum compression | +8.18% PPL, most memory savings |
What This Fork Adds (over TheTom's base implementation)
This fork by @Madreag adds aggressive CUDA kernel optimizations that improve turbo decode by 13-69% at 32K context over the base implementation (verified on 4 GPUs: 5090, 3090 Ti, 3090, 4090M):
| Optimization | Impact |
|---|---|
| 8-wide LUT scoring (turbo3/turbo2) | +4.7% at 32K |
nthreads_KQ=8 for all types |
up to +17.7% at 32K |
| Sparse V skip (type-adaptive thresholds) | +4.6% at 32K, zero PPL cost |
__launch_bounds__(128, 3) occupancy |
+7-13% at 32K |
Half-precision LUT, __expf softmax, L2 prefetch |
cumulative ~9% |
At short context, both builds are identical or near-identical. The advantage shows at 32K+ where KV bandwidth dominates — the bigger the context, the larger the gain.
Built on signalnine's pre-rotate-queries architecture with parallel SET_ROWS, native Flash Attention vec_dot, and MMA prefill. All 4 turbo types with 36 asymmetric K/V combinations. Validated across 5 models, 4 GPUs, 1,351+ stability iterations with zero failures.
All numbers below are from my own measurements on a single RTX 5080 16GB / Ryzen 9700X / 96 GB DDR5 / Windows 11 / CUDA 12.9.1 box. Measured with llama-bench -d <depth> (decode tg128 at the listed prompt depth), 3 reps each.
A coding-tuned IQ3_M quant of Qwen3.6-27B that fits comfortably at 128K context on 16 GB. KV layout is turbo3_tcq for K and V; the auto-selector picks TURBO_LAYER_ADAPTIVE per depth. Strong decode rate at low-to-mid context (where most editor sessions run) and graceful degrade past 96K — the second daily-driver alongside the MoE path; launch with qwen-turbo.ps1.
| Context depth | Old path (mode 13) | Auto-selector (this fork) | Auto picks |
|---|---|---|---|
| 0 | 40.5 | ~40 | mode 1 |
| 16K | 17.4 | ~26 | mode 1 |
| 32K | 10.6 | ~19 | mode 1 |
| 65K | 6.0 | 17.15 (+186%) | mode 1 |
| 90K | — | 13.56 | mode 1 |
| 131K | 3.2 | 7.30 (+128%) | mode 13 (auto falls back; mode 1 would PCIe-spill) |
The big depth-jump win is the VRAM-fit auto-selector: at d=65K it correctly picks mode 1 (K&V first-4 + last-4 promoted to q8_0), which is +34.7% over the prior SHIP mode-13 baseline at the same depth. At d=131K mode 1 would PCIe-spill, so the auto-selector falls back to mode 13 — graceful degrade rather than a cliff.
Estimate-vs-actual at d=65K: 1510 MiB predicted, 1509.88 MiB allocated (the auto-selector uses the same ggml_row_size formula the allocator uses, so the budgeted size matches reality to ~0.01 MiB).
VRAM at d=128K decode: ~14.4 / 16.0 GB (model 12.0 + KV 1.5 + compute peak ~1.0).
35B total / 3B active, with --n-cpu-moe 8 keeping the upper layers' experts on GPU and offloading the first 8 expert layers to CPU. KV layout is turbo3_tcq with the auto-selector enabled.
| Context depth | Decode (t/s) | vs dense 27B at same depth |
|---|---|---|
| 0 | 92.3 | +128% |
| 16K | 75.9 | +193% |
| 32K | 64.2 | +238% |
| 65K | 48.0 | +180% |
| 128K | 31.3 | +329% |
This is the daily-driver config. ~30 t/s sustained at 128K context is what makes long-context coding-agent workflow actually usable on a 16GB card.
VRAM at d=128K decode: ~13.3 / 16.0 GB (model on-GPU portion + KV + compute peak). PCIe Gen 5 x16 sits at ~89% saturation during decode (56–61 GB/s of 63 GB/s theoretical).
--n-cpu-moe has sharp phase cliffs. Sweep on UD-Q4_K_XL (20.81 GiB) at d=16K:
| ncmoe | tg32 (t/s) | Notes |
|---|---|---|
| 40 (all CPU) | 36.4 | baseline |
| 20 | 53.2 | safe |
| 16 | 58.9 | sweet spot for the 21 GB file |
| 12 | 36.1 | hit VRAM cliff |
| 8 | 5.9 | catastrophic spill |
APEX-I-Compact's smaller file (16.10 GiB vs 20.81 GiB) lets ncmoe=8 fit, which reduces PCIe traffic enough to hit the higher decode rate above. APEX-I-Quality (Q6_K, 21.25 GiB) needed ncmoe=20 and showed no quality win on a shared 11-test coding harness — dropped from rotation.
| Quant | File size | Fits at 131K? |
|---|---|---|
| NEO-CODE IQ3_M | 12.0 GiB | ✅ comfortably (~14.4 GB total with KV) |
| UD-Q3_K_XL | 13.5 GiB | ✅ tight |
| IQ4_XS | 14.3 GiB | ❌ ~1.6 GiB over |
| Q4_K_S | 14.8 GiB | ❌ |
| IQ4_NL | 15.0 GiB | ❌ |
| Q4_K_M | 15.7 GiB | ❌ |
| Q5 / Q6 | 19+ GiB | ❌ (5090 territory) |
Every Q4-class quant and above is out of reach on dense 27B at usable 128K context on 16GB. IQ4_XS would need ~7 layers offloaded to CPU which kills decode to ~5 t/s. NEO-CODE IQ3_M was the dense-path ship pick; the 35B-A3B MoE path was added alongside it for cases where I wanted higher t/s at deep context.
PCIe Gen 5 x16 hits ~89% saturation during MoE decode (56–61 GB/s burst against ~63 GB/s theoretical). SM utilization sits at 93–97%. Decode is bound by PCIe traffic from CPU-resident expert weights, not GPU compute. Getting past ~50 t/s sustained at long context on this stack would need more VRAM (fewer experts on CPU = less PCIe traffic), not more clever kernels.
I also profiled the dense 27B SHIP path with ncu: mul_mat_q<IQ3_S> is the hot kernel and is register-bound (254 regs/thread, ~12.5% theoretical occupancy, DRAM throughput <7%). Validated that cp.async / prefetch tricks don't help in this regime — they address memory latency that doesn't exist here.
The fork is tuned on 16 GB but the auto-selector and MoE-offload paths scale cleanly upward. I haven't measured these myself — the numbers below are expected behavior based on how the auto-selector and --n-cpu-moe cliff work. PRs with measured runs from bigger cards are welcome.
Dense 27B / 32B with bigger quants. On 16 GB, IQ3_M was the largest dense quant that fits at 128K. With more VRAM the quant ladder opens up:
| Card | Dense quant headroom at 128K |
|---|---|
| RTX 5080 16 GB | IQ3_M (12.0 GiB), UD-Q3_K_XL (13.5 GiB) — current ship |
| RTX 5090 32 GB / RTX 4090 24 GB | Q4_K_M (15.7 GiB) and IQ4_XS (14.3 GiB) become comfortable; Q5_K_S / Q5_K_M plausible at 128K |
| RTX 6000 Ada 48 GB / A6000 48 GB | Q6_K (~22 GiB) at 128K with full headroom; Q8_0 plausible at 64K |
| A100 80 GB / H100 80 GB | Q8_0 dense at 128K with room for compute peak |
MoE 35B-A3B with less expert offload. --n-cpu-moe is the bandwidth lever — every layer you keep on GPU eliminates that layer's PCIe traffic. APEX-I-Compact (16.10 GiB) on 16 GB needed ncmoe=8. With more VRAM:
| Card | APEX-I-Compact ncmoe |
Expected decode gain |
|---|---|---|
| 16 GB | 8 (current ship) | baseline |
| 24 GB | 0–4 | ~30–50% higher decode at deep context (less PCIe traffic) |
| 32 GB+ | 0 (fully on GPU) | PCIe stops being the bottleneck; you're back in pure compute regime |
UD-Q4_K_XL (20.81 GiB) at ncmoe=0 likewise becomes viable on a 24 GB card with room for KV at 128K, and on 32 GB with margin.
Auto-selector goes more aggressive automatically. The TURBO_LAYER_ADAPTIVE auto-selector probes free VRAM at startup and picks the most aggressive K/V promotion mode that fits. On a 16 GB card it falls back to mode 13 past 96K; on 24/32/80 GB cards it should pick mode 1 (K&V first-4 + last-4 q8_0) at every depth — meaning the +35% TG win it delivers at d=65K on 16 GB carries through the entire depth sweep instead of degrading past 96K. No config change needed; the log line confirms which mode was picked:
llama_kv_cache: TCQ auto-selected mode 1 (KV 1510 MiB, free 28432 MiB, margin 1024 MiB)
If you run this fork on a bigger card, please open a PR or issue with llama-bench -d 0,16384,32768,65536,98304,131072 for whichever model you tested and I'll roll the numbers into this README.
| Type | Bits/Value | Compression | Short Decode | 32K Decode | PPL ctx=512 | PPL ctx=2048 |
|---|---|---|---|---|---|---|
| q8_0 | 8.5 | 1.88x | 63.40 tok/s | 55.60 | 6.759 | 5.674 |
| turbo4 | 4.25 | 3.76x | 63.70 | 56.73 | 6.825 (+0.97%) | 5.694 |
| turbo3 | 3.125 | 5.12x | 63.55 | 55.84 | 6.852 (+1.38%) | 5.674 (=q8_0) |
| turbo2 | 2.125 | 7.53x | 65.50 | 58.61 | 7.121 (+5.35%) | 5.873 |
| turbo1.5 | 2.00 | 8.0x | 63.13 | 55.16 | 7.312 (+8.18%) | 6.103 |
Speed measured with llama-bench -d 32768 (tg128 @ depth), ±0.3% variance. PPL from wikitext-2, 8 chunks.
Key takeaways from this table:
- turbo2 at 32K beats q8_0 by 5.4% (58.61 vs 55.60) — the long-context champion at 7.5x compression
- turbo4 at 32K beats q8_0 by 2.0% (56.73 vs 55.60) at 3.76x compression, best quality
- turbo3 PPL at ctx=2048 equals q8_0 (5.674 = 5.674) — lossless quality at 5.1x compression
- All types match or beat q8_0 at short context — turbo2 +3.3%, others within 1%
More highlights across models and contexts:
| Result | Numbers |
|---|---|
| turbo2 32K decode | 58.61 tok/s — 5.4% faster than q8_0 at 7.5x compression |
| turbo2 at 256K tokens (Q4_K_M) | 42.57 tok/s — consumer GPU, 8x cheaper KV than f16 |
| Kernel optimization impact (4 GPUs) | +13-69% at 32K vs base implementation, confirmed on 5090/3090 Ti/3090/4090M |
| NIAH retrieval (4 GPUs) | q8_0/turbo3/turbo2 100% on 5090, all types 92% on 3090 Ti |
| Stability across 4 GPUs | 1,351+ iterations, 0 failures, PPL bit-exact |
| Type | bpv | PPL ctx=512 | vs q8_0 | PPL ctx=2048 | vs q8_0 |
|---|---|---|---|---|---|
| q8_0 | 8.5 | 6.759 | — | 5.674 | — |
| turbo4 | 4.25 | 6.825 | +0.97% | 5.694 | +0.34% |
| turbo3 | 3.125 | 6.852 | +1.38% | 5.674 | 0.00% |
| turbo2 | 2.125 | 7.121 | +5.35% | 5.873 | +3.50% |
| turbo1.5 | 2.0 | 7.312 | +8.18% | 6.103 | +7.55% |
| Your priority | Mode | Why | Command |
|---|---|---|---|
| Best balance | turbo3 | q8_0 quality at 5.1x compression | -ctk turbo3 -ctv turbo3 |
| Long context | turbo2 | 32K champion (+5.4% vs q8_0), 42 tok/s at 256K, 7.5x compression | -ctk turbo2 -ctv turbo2 |
| Best quality | turbo4 | +0.97% PPL at 3.76x compression | -ctk turbo4 -ctv turbo4 |
| Maximum compression | turbo1.5 | 8x compression, 212 tok/s MoE | -ctk turbo1.5 -ctv turbo1.5 |
Combining Q4_K_M weight quantization with turbo KV cache compression enables extreme context lengths. Decode speed measured with llama-bench -d [depth] (tg128 @ depth):
| KV Type | bpv | 32K | 65K | 131K | 256K |
|---|---|---|---|---|---|
| turbo4 | 4.25 | 66.33 | 60.41 | 49.06 | OOM |
| turbo3 | 3.125 | 66.88 | 58.37 | 47.36 | 35.38 |
| turbo2 | 2.125 | 70.65 | 63.94 | 51.23 | 42.57 |
| turbo1.5 | 2.00 | 64.77 | 57.99 | 46.38 | 33.40 |
turbo2 is the long-context champion at every depth. At 256K, turbo2 generates 42+ tok/s on a consumer 5090 — a context length where q8_0 would OOM.
PPL impact: Q4_K_M + turbo3 = 7.127 (+1.39% vs q8_0 = 7.030). Safe on 27B+ models.
Warning: Small Q4_K_M models (<10B) may have catastrophic PPL with symmetric turbo K. Use asymmetric (-ctk q8_0 -ctv turbo3) for safety. See TheTom's research.
| Goal | Config | Command |
|---|---|---|
| Maximum short-ctx speed | Q4_K_M weights + turbo3 KV | -m model-Q4_K_M.gguf -ctk turbo3 -ctv turbo3 -fa |
| Maximum long-ctx speed | Q4_K_M weights + turbo2 KV | -m model-Q4_K_M.gguf -ctk turbo2 -ctv turbo2 -fa |
| Best quality | Q6_K weights + turbo4 KV | -m model-Q6_K.gguf -ctk turbo4 -ctv turbo4 -fa |
| Quality-optimal asymmetric | Q6_K weights + K=turbo4/V=q8_0 | -m model-Q6_K.gguf -ctk turbo4 -ctv q8_0 -fa |
| Maximum compression | Q4_K_M weights + turbo1.5 KV | -m model-Q4_K_M.gguf -ctk turbo1.5 -ctv turbo1.5 -fa |
| Boundary V protection | turbo2 V (auto-enabled) | -m model.gguf -ctk turbo3 -ctv turbo2 -fa (Boundary V activates automatically) |
The fork ships three PowerShell scripts at the repo root (compile.ps1, qwen-turbo.ps1,
qwen-moe-turbo.ps1) that capture the exact configuration I run.
They are starting points — adapt paths and flags for your system.
# Defaults: CUDA 12.9, sm_120, Ninja, parallel=4
.\compile.ps1
# Force a clean rebuild (e.g. after changing CUDA version)
.\compile.ps1 -CleanYou'll need Ninja, CMake ≥ 3.18, the
MSVC build tools, and CUDA Toolkit 12.9.x. Edit the top of compile.ps1 if your
nvcc lives somewhere else.
cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="120-real" \
-DGGML_CUDA_FA=ON \
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_NO_MXFP4=ON \
-DLLAMA_CURL=OFF \
-DLLAMA_BUILD_SERVER=ON
cmake --build build --target llama-server llama-cli llama-bench -j$(nproc)GGML_CUDA_NO_MXFP4=ON is required on sm_120 — the consumer Blackwell silicon does
not implement the MXFP4 PTX instructions, so leaving these kernels enabled fails to
build (or builds and crashes ptxas on Windows).
The TCQ KV path and the auto-selector work on any CUDA arch the upstream TurboQuant
fork supported. Build the same way as Path B, just point CMAKE_CUDA_ARCHITECTURES at
your card and drop the MXFP4 gate (it's only needed on sm_120):
# RTX 3090 / 3090 Ti
-DCMAKE_CUDA_ARCHITECTURES="86-real"
# RTX 4090 / 4090M
-DCMAKE_CUDA_ARCHITECTURES="89-real"The sm_120 ptxas workarounds (the --ptxas-options=-O0 fallback for some turbo3_* TUs
in ggml/src/ggml-cuda/CMakeLists.txt) are gated to sm_120 builds and don't slow down
older arches.
Tested CUDA 13.x produces garbage output on sm_120 builds and 13.1 segfaults inside MMQ kernels. Stick to CUDA 12.9.x until upstream nvcc fixes the codegen issues. If you have a working CUDA-13 build on a different arch, please open an issue.
The two launcher scripts at the repo root document the validated runtime configurations
on a 16 GB card. Both invoke llama-server on 127.0.0.1:8080 with a Claude-/OpenAI-
compatible chat endpoint.
.\qwen-turbo.ps1 -Model path\to\Qwen3.6-27B.gguf -Context 131072Defaults: --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq, the VRAM-fit auto-selector
picks TURBO_LAYER_ADAPTIVE mode, attention sinks on, prompt cache enabled. Override
with -Fit if you want llama.cpp's automatic CPU-offload-on-overflow behaviour;
otherwise the script forces -ngl 999 so OOM is the hard signal that you need a
smaller context.
.\qwen-moe-turbo.ps1 -Model path\to\Qwen3.6-35B-A3B-APEX-I-Compact.gguf -NCpuMoE 8Pick -NCpuMoE to match your GGUF size on 16 GB:
| GGUF size | -NCpuMoE |
Notes |
|---|---|---|
| ~16 GB Q4 (e.g. APEX-I-Compact) | 8 | validated SHIP, 30+ t/s @ d=128K |
| ~21 GB Q4_K (e.g. UD-Q4_K_XL) | 16 | sweet spot for the heavier file |
| ~21 GB Q6_K (e.g. APEX-I-Quality) | 20 | fits but no quality win on shared harness |
The cliff is sharp. Going one step lower than the matched value spills VRAM and decode
collapses (e.g. ncmoe=8 on a 21 GB file → ~6 t/s).
If you'd rather skip the launcher scripts and call llama.cpp directly:
# Dense, TCQ KV with auto-selector
./build/bin/llama-server -m model.gguf \
-ngl 999 -c 131072 \
--flash-attn on \
--cache-type-k turbo3_tcq --cache-type-v turbo3_tcq \
--batch-size 2048 --ubatch-size 1024 \
--no-mmap --jinja --port 8080
# MoE with expert offload
./build/bin/llama-server -m model.gguf \
-ngl 999 --n-cpu-moe 8 -c 131072 \
--flash-attn on \
--cache-type-k turbo3_tcq --cache-type-v turbo3_tcq \
--no-mmap --port 8080Optional knobs (set before launching the server):
| Var | Default | Purpose |
|---|---|---|
TURBO_LAYER_ADAPTIVE |
auto-selected | Force a specific layer-adaptive mode (override the auto-selector). 0=disable, 1=K&V first4+last4 q8_0, 7=K-only last8 q8_0, 13=V-only first2+last2 q8_0 |
TURBO_SINK_SIZE |
0 | Number of leading tokens kept at fp16 as attention sinks (use 4 for chat templates with system tokens) |
TURBO_NORM_ALPHA_V |
1.04 | TurboQuant V-cache norm scaling (KLD-optimal for Qwen3 27B) |
TURBO_TCQ_ALPHA_V |
1.04 | TCQ-specific V-cache norm scaling |
TURBO_INNERQ / TURBO_INNERQ_STRENGTH |
4096 / 1.0 | InnerQ per-channel calibration window and mix |
Look for llama_kv_cache: TCQ auto-selected mode N (KV X MiB, free Y MiB, margin 1024 MiB)
in the server log to confirm the auto-selector picked a mode.
The server speaks Anthropic's /v1/messages endpoint. Point any client that accepts
ANTHROPIC_BASE_URL at it:
export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_API_KEY=anything
claude # or your Anthropic-SDK appOpenAI-compatible (/v1/chat/completions) also works — see the existing llama.cpp
server docs further down this README.
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120"
cmake --build build -j$(nproc)
# turbo3 (best balance — matches q8_0 quality at 5.1x compression)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo3 -ctv turbo3 -fa -ngl 99
# turbo2 (long-context champion — beats q8_0 speed at 32K)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo2 -ctv turbo2 -fa -ngl 99
# turbo1.5 (8x compression, maximum memory savings)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo1.5 -ctv turbo1.5 -fa -ngl 99
# Server mode
./build/bin/llama-server -hf your-model-GGUF -ctk turbo3 -ctv turbo3 -fa -ngl 99 --port 8080
# Asymmetric (different K and V types)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo4 -ctv turbo3 -fa -ngl 99Notes:
-faenables Flash Attention (required for native turbo decode)- Use
--no-mmapon WSL2 to disable mmap (avoids GPU stalls from page cache) - Adjust
-DCMAKE_CUDA_ARCHITECTURESfor your GPU:86(3090 Ti),89(4090),120(5090)
Tested across 5 model architectures with head dimensions D=64, 96, 128, 256 on RTX 5090:
| Model | Params | D | GQA | Status | turbo3 tok/s | q8_0 tok/s | Prefill tok/s |
|---|---|---|---|---|---|---|---|
| Llama-3.2-1B | 1.24B | 64 | 4:1 | PASS | 672 | 691 | 38,930 |
| Phi-3.5-mini | 3.82B | 96 | 1:1 | FALLBACK | 221* | 247 (f16) | N/A |
| Phi-4-mini | 3.84B | 128 | 3:1 | PASS | 274 | 275 | 18,433 |
| Llama-3.3-8B | 8.03B | 128 | 4:1 | PASS | 177 | 181 | 10,558 |
| Gemma-3-12B | 12.2B | 256 | 2:1 | PASS | 106 | 91 | 6,632 |
* D=96: graceful fallback to non-FA attention. Slower but correct — not a crash.
The VEC Flash Attention kernel supports D=64, D=128, D=256 (D % 64 == 0 required). Models with other head dimensions (e.g., D=96) fall back to standard mul_mat attention automatically — slower but fully functional.
Validated on 4 NVIDIA GPUs across 3 architecture generations, 1,351+ total stability iterations, zero failures:
| GPU | SM | VRAM | Stability | PPL Drift | turbo2 > q8_0 at 32K? |
|---|---|---|---|---|---|
| RTX 5090 | SM120 | 32 GB | 340+ iterations | None | Yes (58.61 vs 55.60) |
| RTX 3090 Ti (OC) | SM86 | 24 GB | 486+ iterations, 48 PPL checks | Bit-exact | Yes (81.58 vs 77.44) |
| RTX 3090 | SM86 | 24 GB | 100+ iterations | PPL bit-exact | Yes (63.12 vs 61.0) |
| RTX 4090M | SM89 | 16 GB | 425+ iterations, 14+ PPL checks | Bit-exact | Yes (52.7 vs 52.0) |
| Type | bpv | Short | 32K | 64K | PPL ctx=512 |
|---|---|---|---|---|---|
| q8_0 | 8.5 | 91.01 | 77.44 | OOM | 8.525 |
| turbo4 | 4.25 | 90.03 | 75.55 | OOM | 8.634 |
| turbo3 | 3.125 | 90.35 | 75.01 | 61.47 | 8.624 |
| turbo2 | 2.125 | 90.75 | 81.58 | 72.79 | 8.747 |
| turbo1.5 | 2.00 | 90.13 | 74.85 | 63.44 | 9.402 |
turbo2 at 32K = 81.58 tok/s — beats q8_0 (77.44) by 5.3% at 7.5x compression. turbo2 64K = 72.79 tok/s where q8_0 OOMs. K=turbo3/V=q8_0 PPL (8.515) beats pure q8_0 (8.525) — K compression is free. OC: +100 core, +2200 mem (golden sample), 516W. Speed measured with -d flag (tg128 @ depth), ±0.3% variance.
NIAH (25 tests, 4K-64K, max_tokens=4000): q8_0=turbo3=turbo2=92%, turbo1.5=100%. With sufficient token budget, all types converge — remaining failures at 32K/64K depth 10% are model-specific, not turbo degradation.
| Type | bpv | Short | 32K | PPL ctx=512 |
|---|---|---|---|---|
| q8_0 | 8.5 | 55.5 | 52.0 | 9.374 |
| turbo4 | 4.25 | 55.9 | 52.4 | 9.535 |
| turbo3 | 3.125 | 55.7 | 49.0 | 9.683 |
| turbo2 | 2.125 | 55.9 | 52.7 | 9.584 |
| turbo1.5 | 2.00 | 55.7 | 48.3 | 10.394 |
All types ~55-56 tok/s at short context. turbo2 at 32K matches q8_0 (52.7 vs 52.0) on a 16GB laptop GPU. Max context capped at 32K (65K crashes WSL2 OOM). Speed measured with -d flag (tg128 @ depth). NIAH (max_tokens=4000): q8_0=turbo3=100%, turbo2=95%, turbo1.5=50%.
| Model | Params | D | turbo2 32K | q8_0 32K | Advantage |
|---|---|---|---|---|---|
| Phi-4-mini | 3.84B | 128 | 182.50 | 139.72 | +31% |
| Llama-3.3-8B | 8.03B | 128 | 131.64 | 117.73 | +12% |
| Gemma-3-12B | 12.2B | 256 | 104.50 | 95.76 | +9% |
| Qwen 27B | 26.9B | 256 | 58.61 | 55.60 | +5% |
turbo2 advantage scales with bandwidth-boundedness: smaller models benefit more.
| Type | KL Divergence | Top-1 Agreement | Delta-p RMS |
|---|---|---|---|
| q8_0 | 0.000408 | 100.0% | 0.0153 |
| turbo4 | 0.006485 | 99.0% | 0.0488 |
| turbo3 | 0.012495 | 93.0% | 0.0664 |
| turbo2 | 0.032700 | 91.0% | 0.1146 |
| turbo1.5 | 0.062681 | 88.0% | 0.1502 |
| Context | q8_0 | turbo4 | turbo3 | turbo2 | turbo1.5 |
|---|---|---|---|---|---|
| pp512 | 3,512 | 3,548 | 3,547 | 3,649 | 3,577 |
| pp4096 | 3,457 | 3,494 | 3,495 | 3,452 | 3,467 |
| pp8192 | 3,390 | 3,390 | 3,414 | 3,394 | 3,394 |
| pp16384 | 3,347 | 3,304 | 3,304 | 3,304 | 3,304 |
| pp32768 | 2,839 | 2,815 | 2,801 | 2,805 | 2,808 |
Prefill auto-dequants turbo→fp16 and uses MMA/TILE kernels. All types track q8_0 with negligible overhead.
| Metric | Sparse V ON | Sparse V OFF | Delta |
|---|---|---|---|
| turbo3 PPL ctx=512 | 6.7251 | 6.7251 | 0.000 |
| turbo3 32K speed | +4.6% | baseline | +4.6% |
Sparse V skips V dequantization for attention positions with negligible weight. Proven zero quality impact via controlled A/B test (PPL bit-identical). Type-adaptive thresholds: 5e-3 for turbo3/turbo4, 1e-2 for turbo2/turbo1.5.
| K \ V | q8_0 | turbo4 | turbo3 | turbo2 |
|---|---|---|---|---|
| q8_0 | 6.6395 | 6.6935 | 6.6885 | 6.8630 |
| turbo4 | 6.6580 | 6.7102 | 6.7088 | 6.8821 |
| turbo3 | 6.6698 | 6.7259 | 6.7251 | 6.8849 |
| turbo2 | 6.8168 | 6.8687 | 6.8429 | 7.0396 |
V type dominates PPL (columns vary more than rows). K compression is nearly free — K=turbo3/V=q8_0 is almost identical to q8_0/q8_0.
- Best quality-per-bit:
K=turbo4/V=q8_0asymmetric config actually beats pure q8_0 PPL (6.155 vs 6.162 at ctx=2048 on 9B) while using less memory. - Layer-adaptive mode 2:
TURBO_LAYER_ADAPTIVE=2closes 40% of the turbo3-to-q8_0 PPL gap at zero performance cost. - Boundary V protection: Auto-enabled when using
-ctv turbo2(mode 12). Protects first4+last4 layers with q8_0-V, recovers 37-91% of the turbo2-to-turbo3 quality gap. Opt-out:TURBO_LAYER_ADAPTIVE=0. - Q4_K_M stacking: Safe on 27B+ models (PPL +1.39%). For small Q4_K_M models (<10B), use
-ctk q8_0 -ctv turbo3to avoid catastrophic PPL from double quantization noise in K.
- Head dimension: Only D∈{64, 128, 256} use native Flash Attention. D=80, D=96, D=112, and others gracefully fall back to mul_mat attention (slower but correct).
- SM120 D=256 LUT: Due to a confirmed NVIDIA compiler bug (NVBUG 5218000, NVBUG 5288270), the LUT scoring optimization is automatically disabled for D=256 models on SM120 (RTX 5090). The VEC kernel uses vec_dot scoring instead — same speed, correct output, zero PPL impact. D=64 and D=128 models use LUT normally. Tested across CUDA 12.8 through 13.2 — all affected. Will re-enable when NVIDIA fixes SM120 codegen.
- Attention sinks: Implemented but provide 0% PPL improvement across all tested configurations. Warning:
TURBO_SINK_SIZEvalues {1, 4, 16} crash on SM89 (RTX 4090). Sizes {0, 2, 8} work. SM86 and SM120 are unaffected. - V sinks: Dead end — register pressure causes -12.7% speed regression at 32K.
- FP4 tensor core acceleration: Not viable. Q values are too small for E2M1 (99.5% map to zero), and no mixed fp16×E2M1 MMA instruction exists on SM120.
- Known Gemma 3 issues: Gibberish after context shift and slow quantized KV cache are upstream llama.cpp bugs, not TurboQuant-specific.
Measured by comparing the base TurboQuant implementation against the optimized fork on the same GPU, same model, back-to-back. All speed with -d flag (tg128 @ depth).
| Type | Before | After | Improvement |
|---|---|---|---|
| Short (all types) | 63-65 | 63-65 | ~tie |
| turbo4 32K | 38.88 | 56.73 | +45.9% |
| turbo3 32K | 46.62 | 55.84 | +19.8% |
| turbo2 32K | 51.69 | 58.61 | +13.4% |
| Type | Before | After | Improvement |
|---|---|---|---|
| q8_0 32K | 56.91 | 61.0 | +7.2% |
| turbo4 32K | 35.63 | 60.28 | +69% |
| turbo3 32K | 44.79 | 56.82 | +27% |
| turbo2 32K | 53.21 | 63.12 | +19% |
| turbo3 64K | 33.43 | 49.27 | +47% |
| turbo2 64K | 42.45 | 56.91 | +34% |
| Type | Before | After | Improvement |
|---|---|---|---|
| Short (all types) | 55-56 | 55-56 | ~tie |
| q8_0 32K | 48.2 | 52.0 | +8% |
| turbo4 32K | 34.5 | 52.4 | +52% |
| turbo3 32K | 40.3 | 49.0 | +22% |
| turbo2 32K | 44.9 | 52.7 | +17% |
Pattern across 4 GPUs: Short context is identical or near-identical (weight-loading bound). Optimizations show at 32K+ where KV bandwidth dominates — LUT scoring, nthreads_KQ=8, and sparse V skip reduce per-token KV access cost. turbo4 benefits most (+46-68%) because its larger KV amplifies the unoptimized dequant cost. Advantage grows with context depth: 32K → 64K shows +34-47% on the 3090.
| Metric | Before | After | Delta |
|---|---|---|---|
| q8_0 PPL 512 | 6.7590 | 6.7590 | identical |
| turbo3 PPL 512 | 6.8380 | 6.8522 | +0.2% |
| turbo3 PPL 2048 | 5.6997 | 5.6744 (=q8_0) | -0.4% (better) |
q8_0 identical. Optimized turbo3 at ctx=2048 equals q8_0 exactly (5.6744 = 5.6744).
This Layer — Adaptive Blackwell (@craftogrammer)
Tuning + integration on top of Madreag's TurboQuant CUDA fork, focused on consumer Blackwell (sm_120, RTX 5080 16 GB) and long-context coding-agent workflow:
Blackwell silicon support:
- sm_120 + Windows nvcc 12.9 ptxas-crash workarounds (
__noinline__on q4_0 / turbo3_tcq helpers;--ptxas-options=-O0fallback forturbo3_0andturbo3_tcqTUs;MXFP4paths gated behindGGML_CUDA_NO_MXFP4) wgmma/setmaxnregconfirmed unavailable on consumer Blackwell;cp.async,mbarrier, TMA, andprefetch.global.L2(lowers toCCTL.E.PF2SASS) verified available- Pinned to
120-real(avoid silent 12X→12Xa coercion that targets datacenter-only ops)
TCQ KV path:
turbo3_tcqcache type integrated as a same-type and mixed-pair (turbo3_tcq↔q8_0) attention path; D=128/256 dispatch; FWHT groups + attention-sink capture- Inline V dequantization + byte-pair vectorization in the same-type FA TU (cumulative +5.1% / +9.9% / +13.0% TG at d=16K / 32K / 64K)
K_set_rowsbacktrace in dynamic SMEM (drops a 128 MiB scratch alloc)
Auto-selection + adaptive layout:
- VRAM-fit auto-selector in
llama-kv-cache.cpp— probesggml_backend_dev_memory, estimates per-mode KV bytes with the sameggml_row_sizeformula the allocator uses, picks the most aggressiveTURBO_LAYER_ADAPTIVEmode that fits under free VRAM minus 1 GiB compute-peak margin; predicted-vs-actual 1510 / 1509.88 MiB at d=65K - Mode 1 (K&V first-4 + last-4 q8_0) → mode 7 (K-only last-8 q8_0) → mode 13 (V-only first-2 + last-2) → off cascade
MoE offload tuning:
--n-cpu-moesweep methodology validated for Qwen3.6-35B-A3B on 16 GB; APEX-I-Compact (16 GB Q4) atncmoe=8is the SHIP MoE config (~30 t/s @ d=128K)
Validation:
- Long-context depth-sweep harness at d=0/16K/32K/65K/128K (rather than the d=0-only numbers most posts report)
- ncu-profiled the SHIP decode path:
mul_mat_q<IQ3_S>is register-bound (254 regs/thread, ~12.5% theoretical occupancy) — validated that cp.async / prefetch tricks don't help - Dropped optimizations that didn't survive clean rebench (e.g.
TURBO_SPARSE_V_THRESHOLDruntime knob caused a 32% decode regression — reverted toconstexpr 1e-6f)
CUDA kernel optimizations, cross-GPU validation, and quality testing by @Madreag:
Kernel Optimizations:
- 8-wide LUT scoring for turbo3/turbo2 — 2 qs bytes per iteration, +4.7% at 32K
- Half-precision shared memory LUT (float→half) — halves shmem bandwidth, +2.45% at 32K
__expffast-math softmax — all 5 sites in VEC kernel, +3.69% at 32K, PPL bit-exactnthreads_KQ=8for all turbo types — 4 interleaved dots/warp, up to +17.7% at 32Kstatic constexpr __device__centroid arrays — register-allocated, 0 latency- L2 prefetch hints in VEC decode loop — +2.9% at 32K
__launch_bounds__(128, 3)occupancy fix — 2→3 blocks/SM, +7-13% at 32K- Sparse V threshold escalation (1e-6→5e-3/1e-2) — type-adaptive, +5-28% at 32K, PPL bit-exact
- D=256 LUT disable for SM120 — workaround for NVIDIA codegen bug (NVBUG 5218000/5288270)
- Block-128 CUDA validation — turbo3 5.12x compression, turbo2 7.53x
Architecture & Features:
- All 4 turbo types ported to CUDA (turbo4, turbo3, turbo2, turbo1.5)
- 36 asymmetric K×V combinations with full VEC template instances
- 15 layer-adaptive modes (KV ordinal-based, hybrid architecture compatible)
- Graph-compatible attention sinks (
__device__+cudaMemcpyAsync) - D=64/128/256 FA dispatch with graceful D=96 fallback
Validation:
- 1,351+ stability iterations across 4 NVIDIA GPUs (SM86×2/SM89/SM120), zero failures
- 5-model architecture sweep (D=64/96/128/256, GQA 1:1 to 4:1)
- NIAH quality testing across 4 GPUs (4K-64K): q8_0/turbo3 100% on 5090, 3090, 4090M; all types 92% on 3090 Ti
- Extreme context: turbo2 at 256K = 42.57 tok/s on consumer RTX 5090
- TheTom — Metal implementation, turbo4 resurrection (7 bugs fixed), asymmetric K/V discovery, turbo3 norm correction, block-128 storage research, sparse V concept, quality validation methodology
- signalnine — Original CUDA port of TurboQuant for llama.cpp (PR #3 to TheTom's repo), InnerQ per-channel equalization
- spiritbuun — turbo4 norm correction (separate CUDA fork), inverse FWHT prefill optimization
- HyperionMS2040 — Block-128 SET_ROWS warp-to-block mapping fix (
7cb6edb), validated PPL-identical on SM86
TurboQuant: Online Vector Quantization for KV Cache Compression — Google Research, ICLR 2026.
Below is the original llama.cpp README.
LLM inference in C/C++
- Hugging Face cache migration: models downloaded with
-hfare now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools. - guide : using the new WebUI of llama.cpp
- guide : running gpt-oss with llama.cpp
- [FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗
- Support for the
gpt-ossmodel with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment - Multimodal support arrived in
llama-server: #12898 | documentation - VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
- Hugging Face Inference Endpoints now support GGUF out of the box! ggml-org#9669
- Hugging Face GGUF editor: discussion | tool
Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:
- Install
llama.cppusing brew, nix or winget - Run with Docker - see our Docker documentation
- Download pre-built binaries from the releases page
- Build from source by cloning this repository - check out our build guide
Once installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more.
Example command:
# Use a local model file
llama-cli -m my_model.gguf
# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUFThe main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
range of hardware - locally and in the cloud.
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2, AVX512 and AMX support for x86 architectures
- RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
The llama.cpp project is the main playground for developing new features for the ggml library.
Models
Typically finetunes of the base models below are supported as well.
Instructions for adding support for new models: HOWTO-add-model.md
- LLaMA 🦙
- LLaMA 2 🦙🦙
- LLaMA 3 🦙🦙🦙
- Mistral 7B
- Mixtral MoE
- DBRX
- Jamba
- Falcon
- Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
- Vigogne (French)
- BERT
- Koala
- Baichuan 1 & 2 + derivations
- Aquila 1 & 2
- Starcoder models
- Refact
- MPT
- Bloom
- Yi models
- StableLM models
- Deepseek models
- Qwen models
- PLaMo-13B
- Phi models
- PhiMoE
- GPT-2
- Orion 14B
- InternLM2
- CodeShell
- Gemma
- Mamba
- Grok-1
- Xverse
- Command-R models
- SEA-LION
- GritLM-7B + GritLM-8x7B
- OLMo
- OLMo 2
- OLMoE
- Granite models
- GPT-NeoX + Pythia
- Snowflake-Arctic MoE
- Smaug
- Poro 34B
- Bitnet b1.58 models
- Flan T5
- Open Elm models
- ChatGLM3-6b + ChatGLM4-9b + GLMEdge-1.5b + GLMEdge-4b
- GLM-4-0414
- SmolLM
- EXAONE-3.0-7.8B-Instruct
- FalconMamba Models
- Jais
- Bielik-11B-v2.3
- RWKV-7
- RWKV-6
- QRWKV-6
- GigaChat-20B-A3B
- Trillion-7B-preview
- Ling models
- LFM2 models
- Hunyuan models
- BailingMoeV2 (Ring/Ling 2.0) models
Bindings
- Python: ddh0/easy-llama
- Python: abetlen/llama-cpp-python
- Go: go-skynet/go-llama.cpp
- Node.js: withcatai/node-llama-cpp
- JS/TS (llama.cpp server client): lgrammel/modelfusion
- JS/TS (Programmable Prompt Engine CLI): offline-ai/cli
- JavaScript/Wasm (works in browser): tangledgroup/llama-cpp-wasm
- Typescript/Wasm (nicer API, available on npm): ngxson/wllama
- Ruby: yoshoku/llama_cpp.rb
- Rust (more features): edgenai/llama_cpp-rs
- Rust (nicer API): mdrokz/rust-llama.cpp
- Rust (more direct bindings): utilityai/llama-cpp-rs
- Rust (automated build from crates.io): ShelbyJenkins/llm_client
- C#/.NET: SciSharp/LLamaSharp
- C#/VB.NET (more features - community license): LM-Kit.NET
- Scala 3: donderom/llm4s
- Clojure: phronmophobic/llama.clj
- React Native: mybigday/llama.rn
- Java: kherud/java-llama.cpp
- Java: QuasarByte/llama-cpp-jna
- Zig: deins/llama.cpp.zig
- Flutter/Dart: netdur/llama_cpp_dart
- Flutter: xuegao-tzx/Fllama
- PHP (API bindings and features built on top of llama.cpp): distantmagic/resonance (more info)
- Guile Scheme: guile_llama_cpp
- Swift srgtuszy/llama-cpp-swift
- Swift ShenghaiWang/SwiftLlama
- Delphi Embarcadero/llama-cpp-delphi
- Go (no CGo needed): hybridgroup/yzma
- Android: llama.android
UIs
(to have a project listed here, it should clearly state that it depends on llama.cpp)
- AI Sublime Text plugin (MIT)
- BonzAI App (proprietary)
- cztomsik/ava (MIT)
- Dot (GPL)
- eva (MIT)
- iohub/collama (Apache-2.0)
- janhq/jan (AGPL)
- johnbean393/Sidekick (MIT)
- KanTV (Apache-2.0)
- KodiBot (GPL)
- llama.vim (MIT)
- LARS (AGPL)
- Llama Assistant (GPL)
- LlamaLib (Apache-2.0)
- LLMFarm (MIT)
- LLMUnity (MIT)
- LMStudio (proprietary)
- LocalAI (MIT)
- LostRuins/koboldcpp (AGPL)
- MindMac (proprietary)
- MindWorkAI/AI-Studio (FSL-1.1-MIT)
- Mobile-Artificial-Intelligence/maid (MIT)
- Mozilla-Ocho/llamafile (Apache-2.0)
- nat/openplayground (MIT)
- nomic-ai/gpt4all (MIT)
- ollama/ollama (MIT)
- oobabooga/text-generation-webui (AGPL)
- PocketPal AI (MIT)
- psugihara/FreeChat (MIT)
- ptsochantaris/emeltal (MIT)
- pythops/tenere (AGPL)
- ramalama (MIT)
- semperai/amica (MIT)
- withcatai/catai (MIT)
- Autopen (GPL)
Tools
- akx/ggify – download PyTorch models from Hugging Face Hub and convert them to GGML
- akx/ollama-dl – download models from the Ollama library to be used directly with llama.cpp
- crashr/gppm – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
- gpustack/gguf-parser - review/check the GGUF file and estimate the memory usage
- Styled Lines (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
- unslothai/unsloth – 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)
Infrastructure
- Paddler - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
- GPUStack - Manage GPU clusters for running LLMs
- llama_cpp_canister - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
- llama-swap - transparent proxy that adds automatic model switching with llama-server
- Kalavai - Crowdsource end to end LLM deployment at any scale
- llmaz - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
- LLMKube - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal support"
Games
- Lucy's Labyrinth - A simple maze game where agents controlled by an AI model will try to trick you.
| Backend | Target devices |
|---|---|
| Metal | Apple Silicon |
| BLAS | All |
| BLIS | All |
| SYCL | Intel and Nvidia GPU |
| OpenVINO [In Progress] | Intel CPUs, GPUs, and NPUs |
| MUSA | Moore Threads GPU |
| CUDA | Nvidia GPU |
| HIP | AMD GPU |
| ZenDNN | AMD CPU |
| Vulkan | GPU |
| CANN | Ascend NPU |
| OpenCL | Adreno GPU |
| IBM zDNN | IBM Z & LinuxONE |
| WebGPU [In Progress] | All |
| RPC | All |
| Hexagon [In Progress] | Snapdragon |
| VirtGPU | VirtGPU APIR |
The Hugging Face platform hosts a number of LLMs compatible with llama.cpp:
You can either manually download the GGUF file or directly use any llama.cpp-compatible models from Hugging Face or other model hosting sites, by using this CLI argument: -hf <user>/<model>[:quant]. For example:
llama-cli -hf ggml-org/gemma-3-1b-it-GGUFBy default, the CLI would download from Hugging Face, you can switch to other options with the environment variable MODEL_ENDPOINT. The MODEL_ENDPOINT must point to a Hugging Face compatible API endpoint.
After downloading a model, use the CLI tools to run it locally - see below.
llama.cpp requires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.
The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp:
- Use the GGUF-my-repo space to convert to GGUF format and quantize model weights to smaller sizes
- Use the GGUF-my-LoRA space to convert LoRA adapters to GGUF format (more info: ggml-org#10123)
- Use the GGUF-editor space to edit GGUF meta data in the browser (more info: ggml-org#9268)
- Use the Inference Endpoints to directly host
llama.cppin the cloud (more info: ggml-org#9669)
To learn more about model quantization, read this documentation
-
Run in conversation mode
Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding
-cnvand specifying a suitable chat template with--chat-template NAMEllama-cli -m model.gguf # > hi, who are you? # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today? # # > what is 1+1? # Easy peasy! The answer to 1+1 is... 2!
-
Run in conversation mode with custom chat template
# use the "chatml" template (use -h to see the list of supported templates) llama-cli -m model.gguf -cnv --chat-template chatml # use a custom template llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
-
Constrain the output with a custom grammar
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:' # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
The grammars/ folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.
For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
A lightweight, OpenAI API compatible, HTTP server for serving LLMs.
-
Start a local HTTP server with default configuration on port 8080
llama-server -m model.gguf --port 8080 # Basic web UI can be accessed via browser: http://localhost:8080 # Chat completion endpoint: http://localhost:8080/v1/chat/completions
-
Support multiple-users and parallel decoding
# up to 4 concurrent requests, each with 4096 max context llama-server -m model.gguf -c 16384 -np 4 -
Enable speculative decoding
# the draft.gguf model should be a small variant of the target model.gguf llama-server -m model.gguf -md draft.gguf -
Serve an embedding model
# use the /embedding endpoint llama-server -m model.gguf --embedding --pooling cls -ub 8192 -
Serve a reranking model
# use the /reranking endpoint llama-server -m model.gguf --reranking -
Constrain all outputs with a grammar
# custom grammar llama-server -m model.gguf --grammar-file grammar.gbnf # JSON llama-server -m model.gguf --grammar-file grammars/json.gbnf
A tool for measuring the perplexity 1 (and other quality metrics) of a model over a given text.
-
Measure the perplexity over a text file
llama-perplexity -m model.gguf -f file.txt # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ... # Final estimate: PPL = 5.4007 +/- 0.67339
-
Measure KL divergence
# TODO
-
Run default benchmark
llama-bench -m model.gguf # Output: # | model | size | params | backend | threads | test | t/s | # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | # | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | pp512 | 5765.41 ± 20.55 | # | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | tg128 | 197.71 ± 0.81 | # # build: 3e0ba0e60 (4229)
-
Basic text completion
llama-simple -m model.gguf # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
- Contributors can open PRs
- Collaborators will be invited based on contributions
- Maintainers can push to branches in the
llama.cpprepo and merge PRs into themasterbranch - Any help with managing issues, PRs and projects is very appreciated!
- See good first issues for tasks suitable for first contributions
- Read the CONTRIBUTING.md for more information
- Make sure to read this: Inference at the edge
- A bit of backstory for those who are interested: Changelog podcast
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
- LLaMA:
- GPT-3
- GPT-3.5 / InstructGPT / ChatGPT:
The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS, and macOS. It can be used in Swift projects without the need to compile the library from source. For example:
// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.
import PackageDescription
let package = Package(
name: "MyLlamaPackage",
targets: [
.executableTarget(
name: "MyLlamaPackage",
dependencies: [
"LlamaFramework"
]),
.binaryTarget(
name: "LlamaFramework",
url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip",
checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab"
)
]
)The above example is using an intermediate build b5046 of the library. This can be modified
to use a different version by changing the URL and checksum.
Command-line completion is available for some environments.
$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bashOptionally this can be added to your .bashrc or .bash_profile to load it
automatically. For example:
$ echo "source ~/.llama-completion.bash" >> ~/.bashrc- yhirose/cpp-httplib - Single-header HTTP server, used by
llama-server- MIT license - stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
- nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
- miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain
- subprocess.h - Single-header process launching solution for C and C++ - Public domain
