llama.cpp-adaptive-turboquant

A downstream fork of Madreag/turbo3-cuda and TheTom/llama-cpp-turboquant, focused on adaptive KV layout selection, MoE partial offload, and long-context local coding workloads on consumer Blackwell GPUs.

Lineage: TurboQuant paper (Google Research) → TheTom → signalnine → @Madreag → this fork. What this layer adds:

sm_120 (consumer Blackwell) ptxas-crash workarounds for Windows nvcc 12.9
TCQ (Trellis Coded Quantization) integrated as a turbo3_tcq KV type
A VRAM-fit auto-selector that probes free GPU memory and picks the most aggressive layer-adaptive K/V promotion mode that fits (mode 1 → 7 → 13 → off)
MoE offload tuning — --n-cpu-moe sweep methodology and validated 16 GB configs for Qwen3.6-35B-A3B
Long-context depth-sweep validation at d=0/16K/32K/65K/128K rather than d=0 only

I tuned this for one specific stack (RTX 5080 16 GB / Ryzen 9700X / DDR5 / Windows 11), but the code paths apply to any sm_120 setup, and the build / run instructions below cover other system configurations.

The original TurboQuant CUDA work — what makes the turbo* cache types fast at all — isn't mine. See Acknowledgments.

CUDA toolchain: sm_120 builds require CUDA 12.9.x. I tested 13.x and it produced garbage output; 13.1 segfaulted in MMQ kernels. CUDA 13 may fix this in future releases — until then, pin 12.9.

Why TurboQuant?

CUDA implementation of TurboQuant (ICLR 2026) KV cache compression for llama.cpp, targeting NVIDIA GPUs (SM86+).

The KV cache is the memory bottleneck for long-context LLM inference. At 32K+ tokens, the KV cache can exceed the model weights in size, consuming VRAM and bandwidth. TurboQuant compresses KV values from 8.5 bits (q8_0) down to 2-4 bits — slashing memory 4-8x while maintaining quality. The result: longer context, more concurrent users, and on bandwidth-limited GPUs, faster decode.

The 4 Turbo Types at a Glance

Type	Bits/Value	Compression	Best For	Trade-off
turbo4	4.25	3.76x	Best quality	+0.97% PPL, lowest KL divergence
turbo3	3.125	5.12x	Best balance	+1.38% PPL at ctx=512, equals q8_0 at ctx=2048
turbo2	2.125	7.53x	Long context / speed	+5.35% PPL, but fastest at 32K+ on all GPUs
turbo1.5	2.00	8x	Maximum compression	+8.18% PPL, most memory savings

What This Fork Adds (over TheTom's base implementation)

This fork by @Madreag adds aggressive CUDA kernel optimizations that improve turbo decode by 13-69% at 32K context over the base implementation (verified on 4 GPUs: 5090, 3090 Ti, 3090, 4090M):

Optimization	Impact
8-wide LUT scoring (turbo3/turbo2)	+4.7% at 32K
`nthreads_KQ=8` for all types	up to +17.7% at 32K
Sparse V skip (type-adaptive thresholds)	+4.6% at 32K, zero PPL cost
`__launch_bounds__(128, 3)` occupancy	+7-13% at 32K
Half-precision LUT, `__expf` softmax, L2 prefetch	cumulative ~9%

At short context, both builds are identical or near-identical. The advantage shows at 32K+ where KV bandwidth dominates — the bigger the context, the larger the gain.

Built on signalnine's pre-rotate-queries architecture with parallel SET_ROWS, native Flash Attention vec_dot, and MMA prefill. All 4 turbo types with 36 asymmetric K/V combinations. Validated across 5 models, 4 GPUs, 1,351+ stability iterations with zero failures.

RTX 5080 16GB — What This Fork Was Tuned For

All numbers below are from my own measurements on a single RTX 5080 16GB / Ryzen 9700X / 96 GB DDR5 / Windows 11 / CUDA 12.9.1 box. Measured with llama-bench -d <depth> (decode tg128 at the listed prompt depth), 3 reps each.

Dense Qwen3.6-27B (NEO-CODE IQ3_M, 12.0 GiB on disk)

A coding-tuned IQ3_M quant of Qwen3.6-27B that fits comfortably at 128K context on 16 GB. KV layout is turbo3_tcq for K and V; the auto-selector picks TURBO_LAYER_ADAPTIVE per depth. Strong decode rate at low-to-mid context (where most editor sessions run) and graceful degrade past 96K — the second daily-driver alongside the MoE path; launch with qwen-turbo.ps1.

Context depth	Old path (mode 13)	Auto-selector (this fork)	Auto picks
0	40.5	~40	mode 1
16K	17.4	~26	mode 1
32K	10.6	~19	mode 1
65K	6.0	17.15 (+186%)	mode 1
90K	—	13.56	mode 1
131K	3.2	7.30 (+128%)	mode 13 (auto falls back; mode 1 would PCIe-spill)

The big depth-jump win is the VRAM-fit auto-selector: at d=65K it correctly picks mode 1 (K&V first-4 + last-4 promoted to q8_0), which is +34.7% over the prior SHIP mode-13 baseline at the same depth. At d=131K mode 1 would PCIe-spill, so the auto-selector falls back to mode 13 — graceful degrade rather than a cliff.

Estimate-vs-actual at d=65K: 1510 MiB predicted, 1509.88 MiB allocated (the auto-selector uses the same ggml_row_size formula the allocator uses, so the budgeted size matches reality to ~0.01 MiB).

VRAM at d=128K decode: ~14.4 / 16.0 GB (model 12.0 + KV 1.5 + compute peak ~1.0).

Sparse-MoE Qwen3.6-35B-A3B (APEX-I-Compact, 16.10 GiB on disk)

35B total / 3B active, with --n-cpu-moe 8 keeping the upper layers' experts on GPU and offloading the first 8 expert layers to CPU. KV layout is turbo3_tcq with the auto-selector enabled.

Context depth	Decode (t/s)	vs dense 27B at same depth
0	92.3	+128%
16K	75.9	+193%
32K	64.2	+238%
65K	48.0	+180%
128K	31.3	+329%

This is the daily-driver config. ~30 t/s sustained at 128K context is what makes long-context coding-agent workflow actually usable on a 16GB card.

VRAM at d=128K decode: ~13.3 / 16.0 GB (model on-GPU portion + KV + compute peak). PCIe Gen 5 x16 sits at ~89% saturation during decode (56–61 GB/s of 63 GB/s theoretical).

Why APEX-I-Compact is the SHIP MoE

--n-cpu-moe has sharp phase cliffs. Sweep on UD-Q4_K_XL (20.81 GiB) at d=16K:

ncmoe	tg32 (t/s)	Notes
40 (all CPU)	36.4	baseline
20	53.2	safe
16	58.9	sweet spot for the 21 GB file
12	36.1	hit VRAM cliff
8	5.9	catastrophic spill

APEX-I-Compact's smaller file (16.10 GiB vs 20.81 GiB) lets ncmoe=8 fit, which reduces PCIe traffic enough to hit the higher decode rate above. APEX-I-Quality (Q6_K, 21.25 GiB) needed ncmoe=20 and showed no quality win on a shared 11-test coding harness — dropped from rotation.

What Actually Fits on 16GB at 131K Context (dense 27B)

Quant	File size	Fits at 131K?
NEO-CODE IQ3_M	12.0 GiB	✅ comfortably (~14.4 GB total with KV)
UD-Q3_K_XL	13.5 GiB	✅ tight
IQ4_XS	14.3 GiB	❌ ~1.6 GiB over
Q4_K_S	14.8 GiB	❌
IQ4_NL	15.0 GiB	❌
Q4_K_M	15.7 GiB	❌
Q5 / Q6	19+ GiB	❌ (5090 territory)

Every Q4-class quant and above is out of reach on dense 27B at usable 128K context on 16GB. IQ4_XS would need ~7 layers offloaded to CPU which kills decode to ~5 t/s. NEO-CODE IQ3_M was the dense-path ship pick; the 35B-A3B MoE path was added alongside it for cases where I wanted higher t/s at deep context.

Hardware Ceiling

PCIe Gen 5 x16 hits ~89% saturation during MoE decode (56–61 GB/s burst against ~63 GB/s theoretical). SM utilization sits at 93–97%. Decode is bound by PCIe traffic from CPU-resident expert weights, not GPU compute. Getting past ~50 t/s sustained at long context on this stack would need more VRAM (fewer experts on CPU = less PCIe traffic), not more clever kernels.

I also profiled the dense 27B SHIP path with ncu: mul_mat_q<IQ3_S> is the hot kernel and is register-bound (254 regs/thread, ~12.5% theoretical occupancy, DRAM throughput <7%). Validated that cp.async / prefetch tricks don't help in this regime — they address memory latency that doesn't exist here.

Scaling Up to More VRAM (24 GB / 32 GB / 48 GB / 80 GB)

The fork is tuned on 16 GB but the auto-selector and MoE-offload paths scale cleanly upward. I haven't measured these myself — the numbers below are expected behavior based on how the auto-selector and --n-cpu-moe cliff work. PRs with measured runs from bigger cards are welcome.

Dense 27B / 32B with bigger quants. On 16 GB, IQ3_M was the largest dense quant that fits at 128K. With more VRAM the quant ladder opens up:

Card	Dense quant headroom at 128K
RTX 5080 16 GB	IQ3_M (12.0 GiB), UD-Q3_K_XL (13.5 GiB) — current ship
RTX 5090 32 GB / RTX 4090 24 GB	Q4_K_M (15.7 GiB) and IQ4_XS (14.3 GiB) become comfortable; Q5_K_S / Q5_K_M plausible at 128K
RTX 6000 Ada 48 GB / A6000 48 GB	Q6_K (~22 GiB) at 128K with full headroom; Q8_0 plausible at 64K
A100 80 GB / H100 80 GB	Q8_0 dense at 128K with room for compute peak

MoE 35B-A3B with less expert offload. --n-cpu-moe is the bandwidth lever — every layer you keep on GPU eliminates that layer's PCIe traffic. APEX-I-Compact (16.10 GiB) on 16 GB needed ncmoe=8. With more VRAM:

Card	APEX-I-Compact `ncmoe`	Expected decode gain
16 GB	8 (current ship)	baseline
24 GB	0–4	~30–50% higher decode at deep context (less PCIe traffic)
32 GB+	0 (fully on GPU)	PCIe stops being the bottleneck; you're back in pure compute regime

UD-Q4_K_XL (20.81 GiB) at ncmoe=0 likewise becomes viable on a 24 GB card with room for KV at 128K, and on 32 GB with margin.

Auto-selector goes more aggressive automatically. The TURBO_LAYER_ADAPTIVE auto-selector probes free VRAM at startup and picks the most aggressive K/V promotion mode that fits. On a 16 GB card it falls back to mode 13 past 96K; on 24/32/80 GB cards it should pick mode 1 (K&V first-4 + last-4 q8_0) at every depth — meaning the +35% TG win it delivers at d=65K on 16 GB carries through the entire depth sweep instead of degrading past 96K. No config change needed; the log line confirms which mode was picked:

llama_kv_cache: TCQ auto-selected mode 1 (KV 1510 MiB, free 28432 MiB, margin 1024 MiB)

If you run this fork on a bigger card, please open a PR or issue with llama-bench -d 0,16384,32768,65536,98304,131072 for whichever model you tested and I'll roll the numbers into this README.

Performance (RTX 5090, Qwen 3.5 27B Q6_K)

Type	Bits/Value	Compression	Short Decode	32K Decode	PPL ctx=512	PPL ctx=2048
q8_0	8.5	1.88x	63.40 tok/s	55.60	6.759	5.674
turbo4	4.25	3.76x	63.70	56.73	6.825 (+0.97%)	5.694
turbo3	3.125	5.12x	63.55	55.84	6.852 (+1.38%)	5.674 (=q8_0)
turbo2	2.125	7.53x	65.50	58.61	7.121 (+5.35%)	5.873
turbo1.5	2.00	8.0x	63.13	55.16	7.312 (+8.18%)	6.103

Speed measured with llama-bench -d 32768 (tg128 @ depth), ±0.3% variance. PPL from wikitext-2, 8 chunks.

Key takeaways from this table:

turbo2 at 32K beats q8_0 by 5.4% (58.61 vs 55.60) — the long-context champion at 7.5x compression
turbo4 at 32K beats q8_0 by 2.0% (56.73 vs 55.60) at 3.76x compression, best quality
turbo3 PPL at ctx=2048 equals q8_0 (5.674 = 5.674) — lossless quality at 5.1x compression
All types match or beat q8_0 at short context — turbo2 +3.3%, others within 1%

More highlights across models and contexts:

Result	Numbers
turbo2 32K decode	58.61 tok/s — 5.4% faster than q8_0 at 7.5x compression
turbo2 at 256K tokens (Q4_K_M)	42.57 tok/s — consumer GPU, 8x cheaper KV than f16
Kernel optimization impact (4 GPUs)	+13-69% at 32K vs base implementation, confirmed on 5090/3090 Ti/3090/4090M
NIAH retrieval (4 GPUs)	q8_0/turbo3/turbo2 100% on 5090, all types 92% on 3090 Ti
Stability across 4 GPUs	1,351+ iterations, 0 failures, PPL bit-exact

Quality (Perplexity)

Type	bpv	PPL ctx=512	vs q8_0	PPL ctx=2048	vs q8_0
q8_0	8.5	6.759	—	5.674	—
turbo4	4.25	6.825	+0.97%	5.694	+0.34%
turbo3	3.125	6.852	+1.38%	5.674	0.00%
turbo2	2.125	7.121	+5.35%	5.873	+3.50%
turbo1.5	2.0	7.312	+8.18%	6.103	+7.55%

Which Mode Should I Use?

Your priority	Mode	Why	Command
Best balance	turbo3	q8_0 quality at 5.1x compression	`-ctk turbo3 -ctv turbo3`
Long context	turbo2	32K champion (+5.4% vs q8_0), 42 tok/s at 256K, 7.5x compression	`-ctk turbo2 -ctv turbo2`
Best quality	turbo4	+0.97% PPL at 3.76x compression	`-ctk turbo4 -ctv turbo4`
Maximum compression	turbo1.5	8x compression, 212 tok/s MoE	`-ctk turbo1.5 -ctv turbo1.5`

Q4_K_M Weight Quantization (Speed Champion)

Combining Q4_K_M weight quantization with turbo KV cache compression enables extreme context lengths. Decode speed measured with llama-bench -d [depth] (tg128 @ depth):

KV Type	bpv	32K	65K	131K	256K
turbo4	4.25	66.33	60.41	49.06	OOM
turbo3	3.125	66.88	58.37	47.36	35.38
turbo2	2.125	70.65	63.94	51.23	42.57
turbo1.5	2.00	64.77	57.99	46.38	33.40

turbo2 is the long-context champion at every depth. At 256K, turbo2 generates 42+ tok/s on a consumer 5090 — a context length where q8_0 would OOM.

PPL impact: Q4_K_M + turbo3 = 7.127 (+1.39% vs q8_0 = 7.030). Safe on 27B+ models.

Warning: Small Q4_K_M models (<10B) may have catastrophic PPL with symmetric turbo K. Use asymmetric (-ctk q8_0 -ctv turbo3) for safety. See TheTom's research.

Recommended Configurations

Goal	Config	Command
Maximum short-ctx speed	Q4_K_M weights + turbo3 KV	`-m model-Q4_K_M.gguf -ctk turbo3 -ctv turbo3 -fa`
Maximum long-ctx speed	Q4_K_M weights + turbo2 KV	`-m model-Q4_K_M.gguf -ctk turbo2 -ctv turbo2 -fa`
Best quality	Q6_K weights + turbo4 KV	`-m model-Q6_K.gguf -ctk turbo4 -ctv turbo4 -fa`
Quality-optimal asymmetric	Q6_K weights + K=turbo4/V=q8_0	`-m model-Q6_K.gguf -ctk turbo4 -ctv q8_0 -fa`
Maximum compression	Q4_K_M weights + turbo1.5 KV	`-m model-Q4_K_M.gguf -ctk turbo1.5 -ctv turbo1.5 -fa`
Boundary V protection	turbo2 V (auto-enabled)	`-m model.gguf -ctk turbo3 -ctv turbo2 -fa` (Boundary V activates automatically)

Building

The fork ships three PowerShell scripts at the repo root (compile.ps1, qwen-turbo.ps1, qwen-moe-turbo.ps1) that capture the exact configuration I run. They are starting points — adapt paths and flags for your system.

Path A — Windows + RTX 5080 / RTX 5090 (sm_120, the validated path)

# Defaults: CUDA 12.9, sm_120, Ninja, parallel=4
.\compile.ps1

# Force a clean rebuild (e.g. after changing CUDA version)
.\compile.ps1 -Clean

You'll need Ninja, CMake ≥ 3.18, the MSVC build tools, and CUDA Toolkit 12.9.x. Edit the top of compile.ps1 if your nvcc lives somewhere else.

Path B — Linux + Blackwell (sm_120) / RTX 5090

cmake -B build -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="120-real" \
  -DGGML_CUDA_FA=ON \
  -DGGML_CUDA_F16=ON \
  -DGGML_CUDA_NO_MXFP4=ON \
  -DLLAMA_CURL=OFF \
  -DLLAMA_BUILD_SERVER=ON

cmake --build build --target llama-server llama-cli llama-bench -j$(nproc)

GGML_CUDA_NO_MXFP4=ON is required on sm_120 — the consumer Blackwell silicon does not implement the MXFP4 PTX instructions, so leaving these kernels enabled fails to build (or builds and crashes ptxas on Windows).

Path C — Older CUDA arches (sm_86 Ampere / sm_89 Ada)

The TCQ KV path and the auto-selector work on any CUDA arch the upstream TurboQuant fork supported. Build the same way as Path B, just point CMAKE_CUDA_ARCHITECTURES at your card and drop the MXFP4 gate (it's only needed on sm_120):

# RTX 3090 / 3090 Ti
-DCMAKE_CUDA_ARCHITECTURES="86-real"
# RTX 4090 / 4090M
-DCMAKE_CUDA_ARCHITECTURES="89-real"

The sm_120 ptxas workarounds (the --ptxas-options=-O0 fallback for some turbo3_* TUs in ggml/src/ggml-cuda/CMakeLists.txt) are gated to sm_120 builds and don't slow down older arches.

Path D — CUDA 13 (not yet supported)

Tested CUDA 13.x produces garbage output on sm_120 builds and 13.1 segfaults inside MMQ kernels. Stick to CUDA 12.9.x until upstream nvcc fixes the codegen issues. If you have a working CUDA-13 build on a different arch, please open an issue.

Running

The two launcher scripts at the repo root document the validated runtime configurations on a 16 GB card. Both invoke llama-server on 127.0.0.1:8080 with a Claude-/OpenAI- compatible chat endpoint.

Dense Qwen3.6-27B (long context, agent workflow)

.\qwen-turbo.ps1 -Model path\to\Qwen3.6-27B.gguf -Context 131072

Defaults: --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq, the VRAM-fit auto-selector picks TURBO_LAYER_ADAPTIVE mode, attention sinks on, prompt cache enabled. Override with -Fit if you want llama.cpp's automatic CPU-offload-on-overflow behaviour; otherwise the script forces -ngl 999 so OOM is the hard signal that you need a smaller context.

Sparse-MoE Qwen3.6-35B-A3B (16 GB ship config)

.\qwen-moe-turbo.ps1 -Model path\to\Qwen3.6-35B-A3B-APEX-I-Compact.gguf -NCpuMoE 8

Pick -NCpuMoE to match your GGUF size on 16 GB:

GGUF size	`-NCpuMoE`	Notes
~16 GB Q4 (e.g. APEX-I-Compact)	8	validated SHIP, 30+ t/s @ d=128K
~21 GB Q4_K (e.g. UD-Q4_K_XL)	16	sweet spot for the heavier file
~21 GB Q6_K (e.g. APEX-I-Quality)	20	fits but no quality win on shared harness

The cliff is sharp. Going one step lower than the matched value spills VRAM and decode collapses (e.g. ncmoe=8 on a 21 GB file → ~6 t/s).

Plain `llama-server` / `llama-cli` (any system)

If you'd rather skip the launcher scripts and call llama.cpp directly:

# Dense, TCQ KV with auto-selector
./build/bin/llama-server -m model.gguf \
  -ngl 999 -c 131072 \
  --flash-attn on \
  --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq \
  --batch-size 2048 --ubatch-size 1024 \
  --no-mmap --jinja --port 8080

# MoE with expert offload
./build/bin/llama-server -m model.gguf \
  -ngl 999 --n-cpu-moe 8 -c 131072 \
  --flash-attn on \
  --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq \
  --no-mmap --port 8080

Environment variables

Optional knobs (set before launching the server):

Var	Default	Purpose
`TURBO_LAYER_ADAPTIVE`	auto-selected	Force a specific layer-adaptive mode (override the auto-selector). `0`=disable, `1`=K&V first4+last4 q8_0, `7`=K-only last8 q8_0, `13`=V-only first2+last2 q8_0
`TURBO_SINK_SIZE`	0	Number of leading tokens kept at fp16 as attention sinks (use `4` for chat templates with system tokens)
`TURBO_NORM_ALPHA_V`	1.04	TurboQuant V-cache norm scaling (KLD-optimal for Qwen3 27B)
`TURBO_TCQ_ALPHA_V`	1.04	TCQ-specific V-cache norm scaling
`TURBO_INNERQ` / `TURBO_INNERQ_STRENGTH`	4096 / 1.0	InnerQ per-channel calibration window and mix

Look for llama_kv_cache: TCQ auto-selected mode N (KV X MiB, free Y MiB, margin 1024 MiB) in the server log to confirm the auto-selector picked a mode.

Claude Code / Anthropic-compatible clients

The server speaks Anthropic's /v1/messages endpoint. Point any client that accepts ANTHROPIC_BASE_URL at it:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_API_KEY=anything
claude            # or your Anthropic-SDK app

OpenAI-compatible (/v1/chat/completions) also works — see the existing llama.cpp server docs further down this README.

Quick Start

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120"
cmake --build build -j$(nproc)

# turbo3 (best balance — matches q8_0 quality at 5.1x compression)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo3 -ctv turbo3 -fa -ngl 99

# turbo2 (long-context champion — beats q8_0 speed at 32K)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo2 -ctv turbo2 -fa -ngl 99

# turbo1.5 (8x compression, maximum memory savings)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo1.5 -ctv turbo1.5 -fa -ngl 99

# Server mode
./build/bin/llama-server -hf your-model-GGUF -ctk turbo3 -ctv turbo3 -fa -ngl 99 --port 8080

# Asymmetric (different K and V types)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo4 -ctv turbo3 -fa -ngl 99

Notes:

-fa enables Flash Attention (required for native turbo decode)
Use --no-mmap on WSL2 to disable mmap (avoids GPU stalls from page cache)
Adjust -DCMAKE_CUDA_ARCHITECTURES for your GPU: 86 (3090 Ti), 89 (4090), 120 (5090)

Multi-Model Validation

Tested across 5 model architectures with head dimensions D=64, 96, 128, 256 on RTX 5090:

Model	Params	D	GQA	Status	turbo3 tok/s	q8_0 tok/s	Prefill tok/s
Llama-3.2-1B	1.24B	64	4:1	PASS	672	691	38,930
Phi-3.5-mini	3.82B	96	1:1	FALLBACK	221*	247 (f16)	N/A
Phi-4-mini	3.84B	128	3:1	PASS	274	275	18,433
Llama-3.3-8B	8.03B	128	4:1	PASS	177	181	10,558
Gemma-3-12B	12.2B	256	2:1	PASS	106	91	6,632

* D=96: graceful fallback to non-FA attention. Slower but correct — not a crash.

Supported Head Dimensions

The VEC Flash Attention kernel supports D=64, D=128, D=256 (D % 64 == 0 required). Models with other head dimensions (e.g., D=96) fall back to standard mul_mat attention automatically — slower but fully functional.

Cross-GPU Validation

Validated on 4 NVIDIA GPUs across 3 architecture generations, 1,351+ total stability iterations, zero failures:

GPU	SM	VRAM	Stability	PPL Drift	turbo2 > q8_0 at 32K?
RTX 5090	SM120	32 GB	340+ iterations	None	Yes (58.61 vs 55.60)
RTX 3090 Ti (OC)	SM86	24 GB	486+ iterations, 48 PPL checks	Bit-exact	Yes (81.58 vs 77.44)
RTX 3090	SM86	24 GB	100+ iterations	PPL bit-exact	Yes (63.12 vs 61.0)
RTX 4090M	SM89	16 GB	425+ iterations, 14+ PPL checks	Bit-exact	Yes (52.7 vs 52.0)

RTX 3090 Ti (SM86, 24 GB GDDR6X, OC +2200 mem, Qwen 3.5 9B Q8_0)

Type	bpv	Short	32K	64K	PPL ctx=512
q8_0	8.5	91.01	77.44	OOM	8.525
turbo4	4.25	90.03	75.55	OOM	8.634
turbo3	3.125	90.35	75.01	61.47	8.624
turbo2	2.125	90.75	81.58	72.79	8.747
turbo1.5	2.00	90.13	74.85	63.44	9.402

turbo2 at 32K = 81.58 tok/s — beats q8_0 (77.44) by 5.3% at 7.5x compression. turbo2 64K = 72.79 tok/s where q8_0 OOMs. K=turbo3/V=q8_0 PPL (8.515) beats pure q8_0 (8.525) — K compression is free. OC: +100 core, +2200 mem (golden sample), 516W. Speed measured with -d flag (tg128 @ depth), ±0.3% variance.

NIAH (25 tests, 4K-64K, max_tokens=4000): q8_0=turbo3=turbo2=92%, turbo1.5=100%. With sufficient token budget, all types converge — remaining failures at 32K/64K depth 10% are model-specific, not turbo degradation.

RTX 4090M Laptop (SM89, 16 GB GDDR6, Qwen 3.5 9B Q8_0)

Type	bpv	Short	32K	PPL ctx=512
q8_0	8.5	55.5	52.0	9.374
turbo4	4.25	55.9	52.4	9.535
turbo3	3.125	55.7	49.0	9.683
turbo2	2.125	55.9	52.7	9.584
turbo1.5	2.00	55.7	48.3	10.394

All types ~55-56 tok/s at short context. turbo2 at 32K matches q8_0 (52.7 vs 52.0) on a 16GB laptop GPU. Max context capped at 32K (65K crashes WSL2 OOM). Speed measured with -d flag (tg128 @ depth). NIAH (max_tokens=4000): q8_0=turbo3=100%, turbo2=95%, turbo1.5=50%.

32K Context — turbo2 Beats q8_0 on ALL Models (RTX 5090)

Model	Params	D	turbo2 32K	q8_0 32K	Advantage
Phi-4-mini	3.84B	128	182.50	139.72	+31%
Llama-3.3-8B	8.03B	128	131.64	117.73	+12%
Gemma-3-12B	12.2B	256	104.50	95.76	+9%
Qwen 27B	26.9B	256	58.61	55.60	+5%

turbo2 advantage scales with bandwidth-boundedness: smaller models benefit more.

KL Divergence vs f16 (RTX 5090, 27B Q6_K, 100 prompts)

Type	KL Divergence	Top-1 Agreement	Delta-p RMS
q8_0	0.000408	100.0%	0.0153
turbo4	0.006485	99.0%	0.0488
turbo3	0.012495	93.0%	0.0664
turbo2	0.032700	91.0%	0.1146
turbo1.5	0.062681	88.0%	0.1502

Prefill Context Scaling (RTX 5090, 27B Q6_K, tok/s)

Context	q8_0	turbo4	turbo3	turbo2	turbo1.5
pp512	3,512	3,548	3,547	3,649	3,577
pp4096	3,457	3,494	3,495	3,452	3,467
pp8192	3,390	3,390	3,414	3,394	3,394
pp16384	3,347	3,304	3,304	3,304	3,304
pp32768	2,839	2,815	2,801	2,805	2,808

Prefill auto-dequants turbo→fp16 and uses MMA/TILE kernels. All types track q8_0 with negligible overhead.

Sparse V Skip — Zero Quality Cost, Free Speed

Metric	Sparse V ON	Sparse V OFF	Delta
turbo3 PPL ctx=512	6.7251	6.7251	0.000
turbo3 32K speed	+4.6%	baseline	+4.6%

Sparse V skips V dequantization for attention positions with negligible weight. Proven zero quality impact via controlled A/B test (PPL bit-identical). Type-adaptive thresholds: 5e-3 for turbo3/turbo4, 1e-2 for turbo2/turbo1.5.

Asymmetric K/V Quality Matrix (PPL ctx=512, 27B Q6_K, wikitext-103 50ch)

K \ V	q8_0	turbo4	turbo3	turbo2
q8_0	6.6395	6.6935	6.6885	6.8630
turbo4	6.6580	6.7102	6.7088	6.8821
turbo3	6.6698	6.7259	6.7251	6.8849
turbo2	6.8168	6.8687	6.8429	7.0396

V type dominates PPL (columns vary more than rows). K compression is nearly free — K=turbo3/V=q8_0 is almost identical to q8_0/q8_0.

Tips

Best quality-per-bit: K=turbo4/V=q8_0 asymmetric config actually beats pure q8_0 PPL (6.155 vs 6.162 at ctx=2048 on 9B) while using less memory.
Layer-adaptive mode 2: TURBO_LAYER_ADAPTIVE=2 closes 40% of the turbo3-to-q8_0 PPL gap at zero performance cost.
Boundary V protection: Auto-enabled when using -ctv turbo2 (mode 12). Protects first4+last4 layers with q8_0-V, recovers 37-91% of the turbo2-to-turbo3 quality gap. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Q4_K_M stacking: Safe on 27B+ models (PPL +1.39%). For small Q4_K_M models (<10B), use -ctk q8_0 -ctv turbo3 to avoid catastrophic PPL from double quantization noise in K.

Limitations

Head dimension: Only D∈{64, 128, 256} use native Flash Attention. D=80, D=96, D=112, and others gracefully fall back to mul_mat attention (slower but correct).
SM120 D=256 LUT: Due to a confirmed NVIDIA compiler bug (NVBUG 5218000, NVBUG 5288270), the LUT scoring optimization is automatically disabled for D=256 models on SM120 (RTX 5090). The VEC kernel uses vec_dot scoring instead — same speed, correct output, zero PPL impact. D=64 and D=128 models use LUT normally. Tested across CUDA 12.8 through 13.2 — all affected. Will re-enable when NVIDIA fixes SM120 codegen.
Attention sinks: Implemented but provide 0% PPL improvement across all tested configurations. Warning: TURBO_SINK_SIZE values {1, 4, 16} crash on SM89 (RTX 4090). Sizes {0, 2, 8} work. SM86 and SM120 are unaffected.
V sinks: Dead end — register pressure causes -12.7% speed regression at 32K.
FP4 tensor core acceleration: Not viable. Q values are too small for E2M1 (99.5% map to zero), and no mixed fp16×E2M1 MMA instruction exists on SM120.
Known Gemma 3 issues: Gibberish after context shift and slow quantized KV cache are upstream llama.cpp bugs, not TurboQuant-specific.

Impact of CUDA Kernel Optimizations

Measured by comparing the base TurboQuant implementation against the optimized fork on the same GPU, same model, back-to-back. All speed with -d flag (tg128 @ depth).

RTX 5090 (27B Q6_K)

Type	Before	After	Improvement
Short (all types)	63-65	63-65	~tie
turbo4 32K	38.88	56.73	+45.9%
turbo3 32K	46.62	55.84	+19.8%
turbo2 32K	51.69	58.61	+13.4%

RTX 3090 (9B Q8_0)

Type	Before	After	Improvement
q8_0 32K	56.91	61.0	+7.2%
turbo4 32K	35.63	60.28	+69%
turbo3 32K	44.79	56.82	+27%
turbo2 32K	53.21	63.12	+19%
turbo3 64K	33.43	49.27	+47%
turbo2 64K	42.45	56.91	+34%

RTX 4090M (9B Q8_0)

Type	Before	After	Improvement
Short (all types)	55-56	55-56	~tie
q8_0 32K	48.2	52.0	+8%
turbo4 32K	34.5	52.4	+52%
turbo3 32K	40.3	49.0	+22%
turbo2 32K	44.9	52.7	+17%

Pattern across 4 GPUs: Short context is identical or near-identical (weight-loading bound). Optimizations show at 32K+ where KV bandwidth dominates — LUT scoring, nthreads_KQ=8, and sparse V skip reduce per-token KV access cost. turbo4 benefits most (+46-68%) because its larger KV amplifies the unoptimized dequant cost. Advantage grows with context depth: 32K → 64K shows +34-47% on the 3090.

Quality (wikitext-2, 8 chunks)

Metric	Before	After	Delta
q8_0 PPL 512	6.7590	6.7590	identical
turbo3 PPL 512	6.8380	6.8522	+0.2%
turbo3 PPL 2048	5.6997	5.6744 (=q8_0)	-0.4% (better)

q8_0 identical. Optimized turbo3 at ctx=2048 equals q8_0 exactly (5.6744 = 5.6744).

Acknowledgments and Contributions

This Layer — Adaptive Blackwell (@craftogrammer)

Tuning + integration on top of Madreag's TurboQuant CUDA fork, focused on consumer Blackwell (sm_120, RTX 5080 16 GB) and long-context coding-agent workflow:

Blackwell silicon support:

sm_120 + Windows nvcc 12.9 ptxas-crash workarounds (__noinline__ on q4_0 / turbo3_tcq helpers; --ptxas-options=-O0 fallback for turbo3_0 and turbo3_tcq TUs; MXFP4 paths gated behind GGML_CUDA_NO_MXFP4)
wgmma / setmaxnreg confirmed unavailable on consumer Blackwell; cp.async, mbarrier, TMA, and prefetch.global.L2 (lowers to CCTL.E.PF2 SASS) verified available
Pinned to 120-real (avoid silent 12X→12Xa coercion that targets datacenter-only ops)

TCQ KV path:

turbo3_tcq cache type integrated as a same-type and mixed-pair (turbo3_tcq ↔ q8_0) attention path; D=128/256 dispatch; FWHT groups + attention-sink capture
Inline V dequantization + byte-pair vectorization in the same-type FA TU (cumulative +5.1% / +9.9% / +13.0% TG at d=16K / 32K / 64K)
K_set_rows backtrace in dynamic SMEM (drops a 128 MiB scratch alloc)

Auto-selection + adaptive layout:

VRAM-fit auto-selector in llama-kv-cache.cpp — probes ggml_backend_dev_memory, estimates per-mode KV bytes with the same ggml_row_size formula the allocator uses, picks the most aggressive TURBO_LAYER_ADAPTIVE mode that fits under free VRAM minus 1 GiB compute-peak margin; predicted-vs-actual 1510 / 1509.88 MiB at d=65K
Mode 1 (K&V first-4 + last-4 q8_0) → mode 7 (K-only last-8 q8_0) → mode 13 (V-only first-2 + last-2) → off cascade

MoE offload tuning:

--n-cpu-moe sweep methodology validated for Qwen3.6-35B-A3B on 16 GB; APEX-I-Compact (16 GB Q4) at ncmoe=8 is the SHIP MoE config (~30 t/s @ d=128K)

Validation:

Long-context depth-sweep harness at d=0/16K/32K/65K/128K (rather than the d=0-only numbers most posts report)
ncu-profiled the SHIP decode path: mul_mat_q<IQ3_S> is register-bound (254 regs/thread, ~12.5% theoretical occupancy) — validated that cp.async / prefetch tricks don't help
Dropped optimizations that didn't survive clean rebench (e.g. TURBO_SPARSE_V_THRESHOLD runtime knob caused a 32% decode regression — reverted to constexpr 1e-6f)

Parent Fork (Madreag) — TurboQuant CUDA

CUDA kernel optimizations, cross-GPU validation, and quality testing by @Madreag:

Kernel Optimizations:

8-wide LUT scoring for turbo3/turbo2 — 2 qs bytes per iteration, +4.7% at 32K
Half-precision shared memory LUT (float→half) — halves shmem bandwidth, +2.45% at 32K
__expf fast-math softmax — all 5 sites in VEC kernel, +3.69% at 32K, PPL bit-exact
nthreads_KQ=8 for all turbo types — 4 interleaved dots/warp, up to +17.7% at 32K
static constexpr __device__ centroid arrays — register-allocated, 0 latency
L2 prefetch hints in VEC decode loop — +2.9% at 32K
__launch_bounds__(128, 3) occupancy fix — 2→3 blocks/SM, +7-13% at 32K
Sparse V threshold escalation (1e-6→5e-3/1e-2) — type-adaptive, +5-28% at 32K, PPL bit-exact
D=256 LUT disable for SM120 — workaround for NVIDIA codegen bug (NVBUG 5218000/5288270)
Block-128 CUDA validation — turbo3 5.12x compression, turbo2 7.53x

Architecture & Features:

All 4 turbo types ported to CUDA (turbo4, turbo3, turbo2, turbo1.5)
36 asymmetric K×V combinations with full VEC template instances
15 layer-adaptive modes (KV ordinal-based, hybrid architecture compatible)
Graph-compatible attention sinks (__device__ + cudaMemcpyAsync)
D=64/128/256 FA dispatch with graceful D=96 fallback

Validation:

1,351+ stability iterations across 4 NVIDIA GPUs (SM86×2/SM89/SM120), zero failures
5-model architecture sweep (D=64/96/128/256, GQA 1:1 to 4:1)
NIAH quality testing across 4 GPUs (4K-64K): q8_0/turbo3 100% on 5090, 3090, 4090M; all types 92% on 3090 Ti
Extreme context: turbo2 at 256K = 42.57 tok/s on consumer RTX 5090

Upstream Contributors

TheTom — Metal implementation, turbo4 resurrection (7 bugs fixed), asymmetric K/V discovery, turbo3 norm correction, block-128 storage research, sparse V concept, quality validation methodology
signalnine — Original CUDA port of TurboQuant for llama.cpp (PR #3 to TheTom's repo), InnerQ per-channel equalization
spiritbuun — turbo4 norm correction (separate CUDA fork), inverse FWHT prefill optimization
HyperionMS2040 — Block-128 SET_ROWS warp-to-block mapping fix (7cb6edb), validated PPL-identical on SM86

Paper

TurboQuant: Online Vector Quantization for KV Cache Compression — Google Research, ICLR 2026.

Below is the original llama.cpp README.

llama.cpp

Manifesto / ggml / ops

LLM inference in C/C++

Recent API changes

Hot topics

Hugging Face cache migration: models downloaded with -hf are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.
guide : using the new WebUI of llama.cpp
guide : running gpt-oss with llama.cpp
[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗
Support for the gpt-oss model with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment
Multimodal support arrived in llama-server: #12898 | documentation
VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
Hugging Face Inference Endpoints now support GGUF out of the box! ggml-org#9669
Hugging Face GGUF editor: discussion | tool

Quick start

Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:

Install llama.cpp using brew, nix or winget
Run with Docker - see our Docker documentation
Download pre-built binaries from the releases page
Build from source by cloning this repository - check out our build guide

Once installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more.

Example command:

# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

Plain C/C++ implementation without any dependencies
Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
AVX, AVX2, AVX512 and AMX support for x86 architectures
RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
Vulkan and SYCL backend support
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

The llama.cpp project is the main playground for developing new features for the ggml library.

Models

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: HOWTO-add-model.md

Text-only

Multimodal

Bindings

Python: ddh0/easy-llama
Python: abetlen/llama-cpp-python
Go: go-skynet/go-llama.cpp
Node.js: withcatai/node-llama-cpp
JS/TS (llama.cpp server client): lgrammel/modelfusion
JS/TS (Programmable Prompt Engine CLI): offline-ai/cli
JavaScript/Wasm (works in browser): tangledgroup/llama-cpp-wasm
Typescript/Wasm (nicer API, available on npm): ngxson/wllama
Ruby: yoshoku/llama_cpp.rb
Rust (more features): edgenai/llama_cpp-rs
Rust (nicer API): mdrokz/rust-llama.cpp
Rust (more direct bindings): utilityai/llama-cpp-rs
Rust (automated build from crates.io): ShelbyJenkins/llm_client
C#/.NET: SciSharp/LLamaSharp
C#/VB.NET (more features - community license): LM-Kit.NET
Scala 3: donderom/llm4s
Clojure: phronmophobic/llama.clj
React Native: mybigday/llama.rn
Java: kherud/java-llama.cpp
Java: QuasarByte/llama-cpp-jna
Zig: deins/llama.cpp.zig
Flutter/Dart: netdur/llama_cpp_dart
Flutter: xuegao-tzx/Fllama
PHP (API bindings and features built on top of llama.cpp): distantmagic/resonance (more info)
Guile Scheme: guile_llama_cpp
Swift srgtuszy/llama-cpp-swift
Swift ShenghaiWang/SwiftLlama
Delphi Embarcadero/llama-cpp-delphi
Go (no CGo needed): hybridgroup/yzma
Android: llama.android

UIs

(to have a project listed here, it should clearly state that it depends on llama.cpp)

AI Sublime Text plugin (MIT)
BonzAI App (proprietary)
cztomsik/ava (MIT)
Dot (GPL)
eva (MIT)
iohub/collama (Apache-2.0)
janhq/jan (AGPL)
johnbean393/Sidekick (MIT)
KanTV (Apache-2.0)
KodiBot (GPL)
llama.vim (MIT)
LARS (AGPL)
Llama Assistant (GPL)
LlamaLib (Apache-2.0)
LLMFarm (MIT)
LLMUnity (MIT)
LMStudio (proprietary)
LocalAI (MIT)
LostRuins/koboldcpp (AGPL)
MindMac (proprietary)
MindWorkAI/AI-Studio (FSL-1.1-MIT)
Mobile-Artificial-Intelligence/maid (MIT)
Mozilla-Ocho/llamafile (Apache-2.0)
nat/openplayground (MIT)
nomic-ai/gpt4all (MIT)
ollama/ollama (MIT)
oobabooga/text-generation-webui (AGPL)
PocketPal AI (MIT)
psugihara/FreeChat (MIT)
ptsochantaris/emeltal (MIT)
pythops/tenere (AGPL)
ramalama (MIT)
semperai/amica (MIT)
withcatai/catai (MIT)
Autopen (GPL)

Tools

akx/ggify – download PyTorch models from Hugging Face Hub and convert them to GGML
akx/ollama-dl – download models from the Ollama library to be used directly with llama.cpp
crashr/gppm – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
gpustack/gguf-parser - review/check the GGUF file and estimate the memory usage
Styled Lines (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
unslothai/unsloth – 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)

Infrastructure

Paddler - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
GPUStack - Manage GPU clusters for running LLMs
llama_cpp_canister - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
llama-swap - transparent proxy that adds automatic model switching with llama-server
Kalavai - Crowdsource end to end LLM deployment at any scale
llmaz - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
LLMKube - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal support"

Games

Lucy's Labyrinth - A simple maze game where agents controlled by an AI model will try to trick you.

Supported backends

Backend	Target devices
Metal	Apple Silicon
BLAS	All
BLIS	All
SYCL	Intel and Nvidia GPU
OpenVINO [In Progress]	Intel CPUs, GPUs, and NPUs
MUSA	Moore Threads GPU
CUDA	Nvidia GPU
HIP	AMD GPU
ZenDNN	AMD CPU
Vulkan	GPU
CANN	Ascend NPU
OpenCL	Adreno GPU
IBM zDNN	IBM Z & LinuxONE
WebGPU [In Progress]	All
RPC	All
Hexagon [In Progress]	Snapdragon
VirtGPU	VirtGPU APIR

Obtaining and quantizing models

The Hugging Face platform hosts a number of LLMs compatible with llama.cpp:

You can either manually download the GGUF file or directly use any llama.cpp-compatible models from Hugging Face or other model hosting sites, by using this CLI argument: -hf <user>/<model>[:quant]. For example:

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable MODEL_ENDPOINT. The MODEL_ENDPOINT must point to a Hugging Face compatible API endpoint.

After downloading a model, use the CLI tools to run it locally - see below.

llama.cpp requires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.

The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp:

Use the GGUF-my-repo space to convert to GGUF format and quantize model weights to smaller sizes
Use the GGUF-my-LoRA space to convert LoRA adapters to GGUF format (more info: ggml-org#10123)
Use the GGUF-editor space to edit GGUF meta data in the browser (more info: ggml-org#9268)
Use the Inference Endpoints to directly host llama.cpp in the cloud (more info: ggml-org#9669)

To learn more about model quantization, read this documentation

`llama-cli`

A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.

Run in conversation mode

Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding -cnv and specifying a suitable chat template with --chat-template NAME

llama-cli -m model.gguf

# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!

Run in conversation mode with custom chat template

# use the "chatml" template (use -h to see the list of supported templates)
llama-cli -m model.gguf -cnv --chat-template chatml

# use a custom template
llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'

Constrain the output with a custom grammar
```
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'

# {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
```
The grammars/ folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.

For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/

`llama-server`

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

Start a local HTTP server with default configuration on port 8080

llama-server -m model.gguf --port 8080

# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions

Support multiple-users and parallel decoding

# up to 4 concurrent requests, each with 4096 max context
llama-server -m model.gguf -c 16384 -np 4

Enable speculative decoding

# the draft.gguf model should be a small variant of the target model.gguf
llama-server -m model.gguf -md draft.gguf

Serve an embedding model

# use the /embedding endpoint
llama-server -m model.gguf --embedding --pooling cls -ub 8192

Serve a reranking model

# use the /reranking endpoint
llama-server -m model.gguf --reranking

Constrain all outputs with a grammar

# custom grammar
llama-server -m model.gguf --grammar-file grammar.gbnf

# JSON
llama-server -m model.gguf --grammar-file grammars/json.gbnf

`llama-perplexity`

A tool for measuring the perplexity ¹ (and other quality metrics) of a model over a given text.

Measure the perplexity over a text file

llama-perplexity -m model.gguf -f file.txt

# [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
# Final estimate: PPL = 5.4007 +/- 0.67339

Measure KL divergence
```
# TODO
```

`llama-bench`

Benchmark the performance of the inference for various parameters.

Run default benchmark

llama-bench -m model.gguf

# Output:
# | model               |       size |     params | backend    | threads |          test |                  t/s |
# | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
# | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
# | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |
#
# build: 3e0ba0e60 (4229)

`llama-simple`

A minimal example for implementing apps with `llama.cpp`. Useful for developers.

Basic text completion

llama-simple -m model.gguf

# Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of

Contributing

Contributors can open PRs
Collaborators will be invited based on contributions
Maintainers can push to branches in the llama.cpp repo and merge PRs into the master branch
Any help with managing issues, PRs and projects is very appreciated!
See good first issues for tasks suitable for first contributions
Read the CONTRIBUTING.md for more information
Make sure to read this: Inference at the edge
A bit of backstory for those who are interested: Changelog podcast

XCFramework

The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS, and macOS. It can be used in Swift projects without the need to compile the library from source. For example:

// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.

import PackageDescription

let package = Package(
    name: "MyLlamaPackage",
    targets: [
        .executableTarget(
            name: "MyLlamaPackage",
            dependencies: [
                "LlamaFramework"
            ]),
        .binaryTarget(
            name: "LlamaFramework",
            url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip",
            checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab"
        )
    ]
)

The above example is using an intermediate build b5046 of the library. This can be modified to use a different version by changing the URL and checksum.

Completions

Command-line completion is available for some environments.

Bash Completion

$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

Optionally this can be added to your .bashrc or .bash_profile to load it automatically. For example:

$ echo "source ~/.llama-completion.bash" >> ~/.bashrc

Dependencies

yhirose/cpp-httplib - Single-header HTTP server, used by llama-server - MIT license
stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain
subprocess.h - Single-header process launching solution for C and C++ - Public domain

https://huggingface.co/docs/transformers/perplexity ↩

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches		benches
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
quality-tests		quality-tests
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
bench-smem-m5-baseline.txt		bench-smem-m5-baseline.txt
bench-smem-m5-smem.txt		bench-smem-m5-smem.txt
build-xcframework.sh		build-xcframework.sh
compile.ps1		compile.ps1
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
qwen-moe-turbo.ps1		qwen-moe-turbo.ps1
qwen-turbo.ps1		qwen-turbo.ps1
requirements.txt		requirements.txt
ty.toml		ty.toml

Folders and files

Latest commit

History

Repository files navigation