Skip to content

craftogrammer/llama.cpp-adaptive-turboquant

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama.cpp-adaptive-turboquant

A downstream fork of Madreag/turbo3-cuda and TheTom/llama-cpp-turboquant, focused on adaptive KV layout selection, MoE partial offload, and long-context local coding workloads on consumer Blackwell GPUs.

Lineage: TurboQuant paper (Google Research) → TheTomsignalnine@Madreag → this fork. What this layer adds:

  • sm_120 (consumer Blackwell) ptxas-crash workarounds for Windows nvcc 12.9
  • TCQ (Trellis Coded Quantization) integrated as a turbo3_tcq KV type
  • A VRAM-fit auto-selector that probes free GPU memory and picks the most aggressive layer-adaptive K/V promotion mode that fits (mode 1 → 7 → 13 → off)
  • MoE offload tuning--n-cpu-moe sweep methodology and validated 16 GB configs for Qwen3.6-35B-A3B
  • Long-context depth-sweep validation at d=0/16K/32K/65K/128K rather than d=0 only

I tuned this for one specific stack (RTX 5080 16 GB / Ryzen 9700X / DDR5 / Windows 11), but the code paths apply to any sm_120 setup, and the build / run instructions below cover other system configurations.

The original TurboQuant CUDA work — what makes the turbo* cache types fast at all — isn't mine. See Acknowledgments.

CUDA toolchain: sm_120 builds require CUDA 12.9.x. I tested 13.x and it produced garbage output; 13.1 segfaulted in MMQ kernels. CUDA 13 may fix this in future releases — until then, pin 12.9.

Why TurboQuant?

CUDA implementation of TurboQuant (ICLR 2026) KV cache compression for llama.cpp, targeting NVIDIA GPUs (SM86+).

The KV cache is the memory bottleneck for long-context LLM inference. At 32K+ tokens, the KV cache can exceed the model weights in size, consuming VRAM and bandwidth. TurboQuant compresses KV values from 8.5 bits (q8_0) down to 2-4 bits — slashing memory 4-8x while maintaining quality. The result: longer context, more concurrent users, and on bandwidth-limited GPUs, faster decode.

The 4 Turbo Types at a Glance

Type Bits/Value Compression Best For Trade-off
turbo4 4.25 3.76x Best quality +0.97% PPL, lowest KL divergence
turbo3 3.125 5.12x Best balance +1.38% PPL at ctx=512, equals q8_0 at ctx=2048
turbo2 2.125 7.53x Long context / speed +5.35% PPL, but fastest at 32K+ on all GPUs
turbo1.5 2.00 8x Maximum compression +8.18% PPL, most memory savings

What This Fork Adds (over TheTom's base implementation)

This fork by @Madreag adds aggressive CUDA kernel optimizations that improve turbo decode by 13-69% at 32K context over the base implementation (verified on 4 GPUs: 5090, 3090 Ti, 3090, 4090M):

Optimization Impact
8-wide LUT scoring (turbo3/turbo2) +4.7% at 32K
nthreads_KQ=8 for all types up to +17.7% at 32K
Sparse V skip (type-adaptive thresholds) +4.6% at 32K, zero PPL cost
__launch_bounds__(128, 3) occupancy +7-13% at 32K
Half-precision LUT, __expf softmax, L2 prefetch cumulative ~9%

At short context, both builds are identical or near-identical. The advantage shows at 32K+ where KV bandwidth dominates — the bigger the context, the larger the gain.

Built on signalnine's pre-rotate-queries architecture with parallel SET_ROWS, native Flash Attention vec_dot, and MMA prefill. All 4 turbo types with 36 asymmetric K/V combinations. Validated across 5 models, 4 GPUs, 1,351+ stability iterations with zero failures.

RTX 5080 16GB — What This Fork Was Tuned For

All numbers below are from my own measurements on a single RTX 5080 16GB / Ryzen 9700X / 96 GB DDR5 / Windows 11 / CUDA 12.9.1 box. Measured with llama-bench -d <depth> (decode tg128 at the listed prompt depth), 3 reps each.

Dense Qwen3.6-27B (NEO-CODE IQ3_M, 12.0 GiB on disk)

A coding-tuned IQ3_M quant of Qwen3.6-27B that fits comfortably at 128K context on 16 GB. KV layout is turbo3_tcq for K and V; the auto-selector picks TURBO_LAYER_ADAPTIVE per depth. Strong decode rate at low-to-mid context (where most editor sessions run) and graceful degrade past 96K — the second daily-driver alongside the MoE path; launch with qwen-turbo.ps1.

Context depth Old path (mode 13) Auto-selector (this fork) Auto picks
0 40.5 ~40 mode 1
16K 17.4 ~26 mode 1
32K 10.6 ~19 mode 1
65K 6.0 17.15 (+186%) mode 1
90K 13.56 mode 1
131K 3.2 7.30 (+128%) mode 13 (auto falls back; mode 1 would PCIe-spill)

The big depth-jump win is the VRAM-fit auto-selector: at d=65K it correctly picks mode 1 (K&V first-4 + last-4 promoted to q8_0), which is +34.7% over the prior SHIP mode-13 baseline at the same depth. At d=131K mode 1 would PCIe-spill, so the auto-selector falls back to mode 13 — graceful degrade rather than a cliff.

Estimate-vs-actual at d=65K: 1510 MiB predicted, 1509.88 MiB allocated (the auto-selector uses the same ggml_row_size formula the allocator uses, so the budgeted size matches reality to ~0.01 MiB).

VRAM at d=128K decode: ~14.4 / 16.0 GB (model 12.0 + KV 1.5 + compute peak ~1.0).

Sparse-MoE Qwen3.6-35B-A3B (APEX-I-Compact, 16.10 GiB on disk)

35B total / 3B active, with --n-cpu-moe 8 keeping the upper layers' experts on GPU and offloading the first 8 expert layers to CPU. KV layout is turbo3_tcq with the auto-selector enabled.

Context depth Decode (t/s) vs dense 27B at same depth
0 92.3 +128%
16K 75.9 +193%
32K 64.2 +238%
65K 48.0 +180%
128K 31.3 +329%

This is the daily-driver config. ~30 t/s sustained at 128K context is what makes long-context coding-agent workflow actually usable on a 16GB card.

VRAM at d=128K decode: ~13.3 / 16.0 GB (model on-GPU portion + KV + compute peak). PCIe Gen 5 x16 sits at ~89% saturation during decode (56–61 GB/s of 63 GB/s theoretical).

Why APEX-I-Compact is the SHIP MoE

--n-cpu-moe has sharp phase cliffs. Sweep on UD-Q4_K_XL (20.81 GiB) at d=16K:

ncmoe tg32 (t/s) Notes
40 (all CPU) 36.4 baseline
20 53.2 safe
16 58.9 sweet spot for the 21 GB file
12 36.1 hit VRAM cliff
8 5.9 catastrophic spill

APEX-I-Compact's smaller file (16.10 GiB vs 20.81 GiB) lets ncmoe=8 fit, which reduces PCIe traffic enough to hit the higher decode rate above. APEX-I-Quality (Q6_K, 21.25 GiB) needed ncmoe=20 and showed no quality win on a shared 11-test coding harness — dropped from rotation.

What Actually Fits on 16GB at 131K Context (dense 27B)

Quant File size Fits at 131K?
NEO-CODE IQ3_M 12.0 GiB ✅ comfortably (~14.4 GB total with KV)
UD-Q3_K_XL 13.5 GiB ✅ tight
IQ4_XS 14.3 GiB ❌ ~1.6 GiB over
Q4_K_S 14.8 GiB
IQ4_NL 15.0 GiB
Q4_K_M 15.7 GiB
Q5 / Q6 19+ GiB ❌ (5090 territory)

Every Q4-class quant and above is out of reach on dense 27B at usable 128K context on 16GB. IQ4_XS would need ~7 layers offloaded to CPU which kills decode to ~5 t/s. NEO-CODE IQ3_M was the dense-path ship pick; the 35B-A3B MoE path was added alongside it for cases where I wanted higher t/s at deep context.

Hardware Ceiling

PCIe Gen 5 x16 hits ~89% saturation during MoE decode (56–61 GB/s burst against ~63 GB/s theoretical). SM utilization sits at 93–97%. Decode is bound by PCIe traffic from CPU-resident expert weights, not GPU compute. Getting past ~50 t/s sustained at long context on this stack would need more VRAM (fewer experts on CPU = less PCIe traffic), not more clever kernels.

I also profiled the dense 27B SHIP path with ncu: mul_mat_q<IQ3_S> is the hot kernel and is register-bound (254 regs/thread, ~12.5% theoretical occupancy, DRAM throughput <7%). Validated that cp.async / prefetch tricks don't help in this regime — they address memory latency that doesn't exist here.

Scaling Up to More VRAM (24 GB / 32 GB / 48 GB / 80 GB)

The fork is tuned on 16 GB but the auto-selector and MoE-offload paths scale cleanly upward. I haven't measured these myself — the numbers below are expected behavior based on how the auto-selector and --n-cpu-moe cliff work. PRs with measured runs from bigger cards are welcome.

Dense 27B / 32B with bigger quants. On 16 GB, IQ3_M was the largest dense quant that fits at 128K. With more VRAM the quant ladder opens up:

Card Dense quant headroom at 128K
RTX 5080 16 GB IQ3_M (12.0 GiB), UD-Q3_K_XL (13.5 GiB) — current ship
RTX 5090 32 GB / RTX 4090 24 GB Q4_K_M (15.7 GiB) and IQ4_XS (14.3 GiB) become comfortable; Q5_K_S / Q5_K_M plausible at 128K
RTX 6000 Ada 48 GB / A6000 48 GB Q6_K (~22 GiB) at 128K with full headroom; Q8_0 plausible at 64K
A100 80 GB / H100 80 GB Q8_0 dense at 128K with room for compute peak

MoE 35B-A3B with less expert offload. --n-cpu-moe is the bandwidth lever — every layer you keep on GPU eliminates that layer's PCIe traffic. APEX-I-Compact (16.10 GiB) on 16 GB needed ncmoe=8. With more VRAM:

Card APEX-I-Compact ncmoe Expected decode gain
16 GB 8 (current ship) baseline
24 GB 0–4 ~30–50% higher decode at deep context (less PCIe traffic)
32 GB+ 0 (fully on GPU) PCIe stops being the bottleneck; you're back in pure compute regime

UD-Q4_K_XL (20.81 GiB) at ncmoe=0 likewise becomes viable on a 24 GB card with room for KV at 128K, and on 32 GB with margin.

Auto-selector goes more aggressive automatically. The TURBO_LAYER_ADAPTIVE auto-selector probes free VRAM at startup and picks the most aggressive K/V promotion mode that fits. On a 16 GB card it falls back to mode 13 past 96K; on 24/32/80 GB cards it should pick mode 1 (K&V first-4 + last-4 q8_0) at every depth — meaning the +35% TG win it delivers at d=65K on 16 GB carries through the entire depth sweep instead of degrading past 96K. No config change needed; the log line confirms which mode was picked:

llama_kv_cache: TCQ auto-selected mode 1 (KV 1510 MiB, free 28432 MiB, margin 1024 MiB)

If you run this fork on a bigger card, please open a PR or issue with llama-bench -d 0,16384,32768,65536,98304,131072 for whichever model you tested and I'll roll the numbers into this README.

Performance (RTX 5090, Qwen 3.5 27B Q6_K)

Type Bits/Value Compression Short Decode 32K Decode PPL ctx=512 PPL ctx=2048
q8_0 8.5 1.88x 63.40 tok/s 55.60 6.759 5.674
turbo4 4.25 3.76x 63.70 56.73 6.825 (+0.97%) 5.694
turbo3 3.125 5.12x 63.55 55.84 6.852 (+1.38%) 5.674 (=q8_0)
turbo2 2.125 7.53x 65.50 58.61 7.121 (+5.35%) 5.873
turbo1.5 2.00 8.0x 63.13 55.16 7.312 (+8.18%) 6.103

Speed measured with llama-bench -d 32768 (tg128 @ depth), ±0.3% variance. PPL from wikitext-2, 8 chunks.

Key takeaways from this table:

  • turbo2 at 32K beats q8_0 by 5.4% (58.61 vs 55.60) — the long-context champion at 7.5x compression
  • turbo4 at 32K beats q8_0 by 2.0% (56.73 vs 55.60) at 3.76x compression, best quality
  • turbo3 PPL at ctx=2048 equals q8_0 (5.674 = 5.674) — lossless quality at 5.1x compression
  • All types match or beat q8_0 at short context — turbo2 +3.3%, others within 1%

More highlights across models and contexts:

Result Numbers
turbo2 32K decode 58.61 tok/s — 5.4% faster than q8_0 at 7.5x compression
turbo2 at 256K tokens (Q4_K_M) 42.57 tok/s — consumer GPU, 8x cheaper KV than f16
Kernel optimization impact (4 GPUs) +13-69% at 32K vs base implementation, confirmed on 5090/3090 Ti/3090/4090M
NIAH retrieval (4 GPUs) q8_0/turbo3/turbo2 100% on 5090, all types 92% on 3090 Ti
Stability across 4 GPUs 1,351+ iterations, 0 failures, PPL bit-exact

Quality (Perplexity)

Type bpv PPL ctx=512 vs q8_0 PPL ctx=2048 vs q8_0
q8_0 8.5 6.759 5.674
turbo4 4.25 6.825 +0.97% 5.694 +0.34%
turbo3 3.125 6.852 +1.38% 5.674 0.00%
turbo2 2.125 7.121 +5.35% 5.873 +3.50%
turbo1.5 2.0 7.312 +8.18% 6.103 +7.55%

Which Mode Should I Use?

Your priority Mode Why Command
Best balance turbo3 q8_0 quality at 5.1x compression -ctk turbo3 -ctv turbo3
Long context turbo2 32K champion (+5.4% vs q8_0), 42 tok/s at 256K, 7.5x compression -ctk turbo2 -ctv turbo2
Best quality turbo4 +0.97% PPL at 3.76x compression -ctk turbo4 -ctv turbo4
Maximum compression turbo1.5 8x compression, 212 tok/s MoE -ctk turbo1.5 -ctv turbo1.5

Q4_K_M Weight Quantization (Speed Champion)

Combining Q4_K_M weight quantization with turbo KV cache compression enables extreme context lengths. Decode speed measured with llama-bench -d [depth] (tg128 @ depth):

KV Type bpv 32K 65K 131K 256K
turbo4 4.25 66.33 60.41 49.06 OOM
turbo3 3.125 66.88 58.37 47.36 35.38
turbo2 2.125 70.65 63.94 51.23 42.57
turbo1.5 2.00 64.77 57.99 46.38 33.40

turbo2 is the long-context champion at every depth. At 256K, turbo2 generates 42+ tok/s on a consumer 5090 — a context length where q8_0 would OOM.

PPL impact: Q4_K_M + turbo3 = 7.127 (+1.39% vs q8_0 = 7.030). Safe on 27B+ models.

Warning: Small Q4_K_M models (<10B) may have catastrophic PPL with symmetric turbo K. Use asymmetric (-ctk q8_0 -ctv turbo3) for safety. See TheTom's research.

Recommended Configurations

Goal Config Command
Maximum short-ctx speed Q4_K_M weights + turbo3 KV -m model-Q4_K_M.gguf -ctk turbo3 -ctv turbo3 -fa
Maximum long-ctx speed Q4_K_M weights + turbo2 KV -m model-Q4_K_M.gguf -ctk turbo2 -ctv turbo2 -fa
Best quality Q6_K weights + turbo4 KV -m model-Q6_K.gguf -ctk turbo4 -ctv turbo4 -fa
Quality-optimal asymmetric Q6_K weights + K=turbo4/V=q8_0 -m model-Q6_K.gguf -ctk turbo4 -ctv q8_0 -fa
Maximum compression Q4_K_M weights + turbo1.5 KV -m model-Q4_K_M.gguf -ctk turbo1.5 -ctv turbo1.5 -fa
Boundary V protection turbo2 V (auto-enabled) -m model.gguf -ctk turbo3 -ctv turbo2 -fa (Boundary V activates automatically)

Building

The fork ships three PowerShell scripts at the repo root (compile.ps1, qwen-turbo.ps1, qwen-moe-turbo.ps1) that capture the exact configuration I run. They are starting points — adapt paths and flags for your system.

Path A — Windows + RTX 5080 / RTX 5090 (sm_120, the validated path)

# Defaults: CUDA 12.9, sm_120, Ninja, parallel=4
.\compile.ps1

# Force a clean rebuild (e.g. after changing CUDA version)
.\compile.ps1 -Clean

You'll need Ninja, CMake ≥ 3.18, the MSVC build tools, and CUDA Toolkit 12.9.x. Edit the top of compile.ps1 if your nvcc lives somewhere else.

Path B — Linux + Blackwell (sm_120) / RTX 5090

cmake -B build -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="120-real" \
  -DGGML_CUDA_FA=ON \
  -DGGML_CUDA_F16=ON \
  -DGGML_CUDA_NO_MXFP4=ON \
  -DLLAMA_CURL=OFF \
  -DLLAMA_BUILD_SERVER=ON

cmake --build build --target llama-server llama-cli llama-bench -j$(nproc)

GGML_CUDA_NO_MXFP4=ON is required on sm_120 — the consumer Blackwell silicon does not implement the MXFP4 PTX instructions, so leaving these kernels enabled fails to build (or builds and crashes ptxas on Windows).

Path C — Older CUDA arches (sm_86 Ampere / sm_89 Ada)

The TCQ KV path and the auto-selector work on any CUDA arch the upstream TurboQuant fork supported. Build the same way as Path B, just point CMAKE_CUDA_ARCHITECTURES at your card and drop the MXFP4 gate (it's only needed on sm_120):

# RTX 3090 / 3090 Ti
-DCMAKE_CUDA_ARCHITECTURES="86-real"
# RTX 4090 / 4090M
-DCMAKE_CUDA_ARCHITECTURES="89-real"

The sm_120 ptxas workarounds (the --ptxas-options=-O0 fallback for some turbo3_* TUs in ggml/src/ggml-cuda/CMakeLists.txt) are gated to sm_120 builds and don't slow down older arches.

Path D — CUDA 13 (not yet supported)

Tested CUDA 13.x produces garbage output on sm_120 builds and 13.1 segfaults inside MMQ kernels. Stick to CUDA 12.9.x until upstream nvcc fixes the codegen issues. If you have a working CUDA-13 build on a different arch, please open an issue.

Running

The two launcher scripts at the repo root document the validated runtime configurations on a 16 GB card. Both invoke llama-server on 127.0.0.1:8080 with a Claude-/OpenAI- compatible chat endpoint.

Dense Qwen3.6-27B (long context, agent workflow)

.\qwen-turbo.ps1 -Model path\to\Qwen3.6-27B.gguf -Context 131072

Defaults: --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq, the VRAM-fit auto-selector picks TURBO_LAYER_ADAPTIVE mode, attention sinks on, prompt cache enabled. Override with -Fit if you want llama.cpp's automatic CPU-offload-on-overflow behaviour; otherwise the script forces -ngl 999 so OOM is the hard signal that you need a smaller context.

Sparse-MoE Qwen3.6-35B-A3B (16 GB ship config)

.\qwen-moe-turbo.ps1 -Model path\to\Qwen3.6-35B-A3B-APEX-I-Compact.gguf -NCpuMoE 8

Pick -NCpuMoE to match your GGUF size on 16 GB:

GGUF size -NCpuMoE Notes
~16 GB Q4 (e.g. APEX-I-Compact) 8 validated SHIP, 30+ t/s @ d=128K
~21 GB Q4_K (e.g. UD-Q4_K_XL) 16 sweet spot for the heavier file
~21 GB Q6_K (e.g. APEX-I-Quality) 20 fits but no quality win on shared harness

The cliff is sharp. Going one step lower than the matched value spills VRAM and decode collapses (e.g. ncmoe=8 on a 21 GB file → ~6 t/s).

Plain llama-server / llama-cli (any system)

If you'd rather skip the launcher scripts and call llama.cpp directly:

# Dense, TCQ KV with auto-selector
./build/bin/llama-server -m model.gguf \
  -ngl 999 -c 131072 \
  --flash-attn on \
  --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq \
  --batch-size 2048 --ubatch-size 1024 \
  --no-mmap --jinja --port 8080

# MoE with expert offload
./build/bin/llama-server -m model.gguf \
  -ngl 999 --n-cpu-moe 8 -c 131072 \
  --flash-attn on \
  --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq \
  --no-mmap --port 8080

Environment variables

Optional knobs (set before launching the server):

Var Default Purpose
TURBO_LAYER_ADAPTIVE auto-selected Force a specific layer-adaptive mode (override the auto-selector). 0=disable, 1=K&V first4+last4 q8_0, 7=K-only last8 q8_0, 13=V-only first2+last2 q8_0
TURBO_SINK_SIZE 0 Number of leading tokens kept at fp16 as attention sinks (use 4 for chat templates with system tokens)
TURBO_NORM_ALPHA_V 1.04 TurboQuant V-cache norm scaling (KLD-optimal for Qwen3 27B)
TURBO_TCQ_ALPHA_V 1.04 TCQ-specific V-cache norm scaling
TURBO_INNERQ / TURBO_INNERQ_STRENGTH 4096 / 1.0 InnerQ per-channel calibration window and mix

Look for llama_kv_cache: TCQ auto-selected mode N (KV X MiB, free Y MiB, margin 1024 MiB) in the server log to confirm the auto-selector picked a mode.

Claude Code / Anthropic-compatible clients

The server speaks Anthropic's /v1/messages endpoint. Point any client that accepts ANTHROPIC_BASE_URL at it:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_API_KEY=anything
claude            # or your Anthropic-SDK app

OpenAI-compatible (/v1/chat/completions) also works — see the existing llama.cpp server docs further down this README.

Quick Start

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120"
cmake --build build -j$(nproc)

# turbo3 (best balance — matches q8_0 quality at 5.1x compression)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo3 -ctv turbo3 -fa -ngl 99

# turbo2 (long-context champion — beats q8_0 speed at 32K)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo2 -ctv turbo2 -fa -ngl 99

# turbo1.5 (8x compression, maximum memory savings)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo1.5 -ctv turbo1.5 -fa -ngl 99

# Server mode
./build/bin/llama-server -hf your-model-GGUF -ctk turbo3 -ctv turbo3 -fa -ngl 99 --port 8080

# Asymmetric (different K and V types)
./build/bin/llama-cli -hf your-model-GGUF -ctk turbo4 -ctv turbo3 -fa -ngl 99

Notes:

  • -fa enables Flash Attention (required for native turbo decode)
  • Use --no-mmap on WSL2 to disable mmap (avoids GPU stalls from page cache)
  • Adjust -DCMAKE_CUDA_ARCHITECTURES for your GPU: 86 (3090 Ti), 89 (4090), 120 (5090)

Multi-Model Validation

Tested across 5 model architectures with head dimensions D=64, 96, 128, 256 on RTX 5090:

Model Params D GQA Status turbo3 tok/s q8_0 tok/s Prefill tok/s
Llama-3.2-1B 1.24B 64 4:1 PASS 672 691 38,930
Phi-3.5-mini 3.82B 96 1:1 FALLBACK 221* 247 (f16) N/A
Phi-4-mini 3.84B 128 3:1 PASS 274 275 18,433
Llama-3.3-8B 8.03B 128 4:1 PASS 177 181 10,558
Gemma-3-12B 12.2B 256 2:1 PASS 106 91 6,632

* D=96: graceful fallback to non-FA attention. Slower but correct — not a crash.

Supported Head Dimensions

The VEC Flash Attention kernel supports D=64, D=128, D=256 (D % 64 == 0 required). Models with other head dimensions (e.g., D=96) fall back to standard mul_mat attention automatically — slower but fully functional.

Cross-GPU Validation

Validated on 4 NVIDIA GPUs across 3 architecture generations, 1,351+ total stability iterations, zero failures:

GPU SM VRAM Stability PPL Drift turbo2 > q8_0 at 32K?
RTX 5090 SM120 32 GB 340+ iterations None Yes (58.61 vs 55.60)
RTX 3090 Ti (OC) SM86 24 GB 486+ iterations, 48 PPL checks Bit-exact Yes (81.58 vs 77.44)
RTX 3090 SM86 24 GB 100+ iterations PPL bit-exact Yes (63.12 vs 61.0)
RTX 4090M SM89 16 GB 425+ iterations, 14+ PPL checks Bit-exact Yes (52.7 vs 52.0)

RTX 3090 Ti (SM86, 24 GB GDDR6X, OC +2200 mem, Qwen 3.5 9B Q8_0)

Type bpv Short 32K 64K PPL ctx=512
q8_0 8.5 91.01 77.44 OOM 8.525
turbo4 4.25 90.03 75.55 OOM 8.634
turbo3 3.125 90.35 75.01 61.47 8.624
turbo2 2.125 90.75 81.58 72.79 8.747
turbo1.5 2.00 90.13 74.85 63.44 9.402

turbo2 at 32K = 81.58 tok/s — beats q8_0 (77.44) by 5.3% at 7.5x compression. turbo2 64K = 72.79 tok/s where q8_0 OOMs. K=turbo3/V=q8_0 PPL (8.515) beats pure q8_0 (8.525) — K compression is free. OC: +100 core, +2200 mem (golden sample), 516W. Speed measured with -d flag (tg128 @ depth), ±0.3% variance.

NIAH (25 tests, 4K-64K, max_tokens=4000): q8_0=turbo3=turbo2=92%, turbo1.5=100%. With sufficient token budget, all types converge — remaining failures at 32K/64K depth 10% are model-specific, not turbo degradation.

RTX 4090M Laptop (SM89, 16 GB GDDR6, Qwen 3.5 9B Q8_0)

Type bpv Short 32K PPL ctx=512
q8_0 8.5 55.5 52.0 9.374
turbo4 4.25 55.9 52.4 9.535
turbo3 3.125 55.7 49.0 9.683
turbo2 2.125 55.9 52.7 9.584
turbo1.5 2.00 55.7 48.3 10.394

All types ~55-56 tok/s at short context. turbo2 at 32K matches q8_0 (52.7 vs 52.0) on a 16GB laptop GPU. Max context capped at 32K (65K crashes WSL2 OOM). Speed measured with -d flag (tg128 @ depth). NIAH (max_tokens=4000): q8_0=turbo3=100%, turbo2=95%, turbo1.5=50%.

32K Context — turbo2 Beats q8_0 on ALL Models (RTX 5090)

Model Params D turbo2 32K q8_0 32K Advantage
Phi-4-mini 3.84B 128 182.50 139.72 +31%
Llama-3.3-8B 8.03B 128 131.64 117.73 +12%
Gemma-3-12B 12.2B 256 104.50 95.76 +9%
Qwen 27B 26.9B 256 58.61 55.60 +5%

turbo2 advantage scales with bandwidth-boundedness: smaller models benefit more.

KL Divergence vs f16 (RTX 5090, 27B Q6_K, 100 prompts)

Type KL Divergence Top-1 Agreement Delta-p RMS
q8_0 0.000408 100.0% 0.0153
turbo4 0.006485 99.0% 0.0488
turbo3 0.012495 93.0% 0.0664
turbo2 0.032700 91.0% 0.1146
turbo1.5 0.062681 88.0% 0.1502

Prefill Context Scaling (RTX 5090, 27B Q6_K, tok/s)

Context q8_0 turbo4 turbo3 turbo2 turbo1.5
pp512 3,512 3,548 3,547 3,649 3,577
pp4096 3,457 3,494 3,495 3,452 3,467
pp8192 3,390 3,390 3,414 3,394 3,394
pp16384 3,347 3,304 3,304 3,304 3,304
pp32768 2,839 2,815 2,801 2,805 2,808

Prefill auto-dequants turbo→fp16 and uses MMA/TILE kernels. All types track q8_0 with negligible overhead.

Sparse V Skip — Zero Quality Cost, Free Speed

Metric Sparse V ON Sparse V OFF Delta
turbo3 PPL ctx=512 6.7251 6.7251 0.000
turbo3 32K speed +4.6% baseline +4.6%

Sparse V skips V dequantization for attention positions with negligible weight. Proven zero quality impact via controlled A/B test (PPL bit-identical). Type-adaptive thresholds: 5e-3 for turbo3/turbo4, 1e-2 for turbo2/turbo1.5.

Asymmetric K/V Quality Matrix (PPL ctx=512, 27B Q6_K, wikitext-103 50ch)

K \ V q8_0 turbo4 turbo3 turbo2
q8_0 6.6395 6.6935 6.6885 6.8630
turbo4 6.6580 6.7102 6.7088 6.8821
turbo3 6.6698 6.7259 6.7251 6.8849
turbo2 6.8168 6.8687 6.8429 7.0396

V type dominates PPL (columns vary more than rows). K compression is nearly free — K=turbo3/V=q8_0 is almost identical to q8_0/q8_0.

Tips

  • Best quality-per-bit: K=turbo4/V=q8_0 asymmetric config actually beats pure q8_0 PPL (6.155 vs 6.162 at ctx=2048 on 9B) while using less memory.
  • Layer-adaptive mode 2: TURBO_LAYER_ADAPTIVE=2 closes 40% of the turbo3-to-q8_0 PPL gap at zero performance cost.
  • Boundary V protection: Auto-enabled when using -ctv turbo2 (mode 12). Protects first4+last4 layers with q8_0-V, recovers 37-91% of the turbo2-to-turbo3 quality gap. Opt-out: TURBO_LAYER_ADAPTIVE=0.
  • Q4_K_M stacking: Safe on 27B+ models (PPL +1.39%). For small Q4_K_M models (<10B), use -ctk q8_0 -ctv turbo3 to avoid catastrophic PPL from double quantization noise in K.

Limitations

  • Head dimension: Only D∈{64, 128, 256} use native Flash Attention. D=80, D=96, D=112, and others gracefully fall back to mul_mat attention (slower but correct).
  • SM120 D=256 LUT: Due to a confirmed NVIDIA compiler bug (NVBUG 5218000, NVBUG 5288270), the LUT scoring optimization is automatically disabled for D=256 models on SM120 (RTX 5090). The VEC kernel uses vec_dot scoring instead — same speed, correct output, zero PPL impact. D=64 and D=128 models use LUT normally. Tested across CUDA 12.8 through 13.2 — all affected. Will re-enable when NVIDIA fixes SM120 codegen.
  • Attention sinks: Implemented but provide 0% PPL improvement across all tested configurations. Warning: TURBO_SINK_SIZE values {1, 4, 16} crash on SM89 (RTX 4090). Sizes {0, 2, 8} work. SM86 and SM120 are unaffected.
  • V sinks: Dead end — register pressure causes -12.7% speed regression at 32K.
  • FP4 tensor core acceleration: Not viable. Q values are too small for E2M1 (99.5% map to zero), and no mixed fp16×E2M1 MMA instruction exists on SM120.
  • Known Gemma 3 issues: Gibberish after context shift and slow quantized KV cache are upstream llama.cpp bugs, not TurboQuant-specific.

Impact of CUDA Kernel Optimizations

Measured by comparing the base TurboQuant implementation against the optimized fork on the same GPU, same model, back-to-back. All speed with -d flag (tg128 @ depth).

RTX 5090 (27B Q6_K)

Type Before After Improvement
Short (all types) 63-65 63-65 ~tie
turbo4 32K 38.88 56.73 +45.9%
turbo3 32K 46.62 55.84 +19.8%
turbo2 32K 51.69 58.61 +13.4%

RTX 3090 (9B Q8_0)

Type Before After Improvement
q8_0 32K 56.91 61.0 +7.2%
turbo4 32K 35.63 60.28 +69%
turbo3 32K 44.79 56.82 +27%
turbo2 32K 53.21 63.12 +19%
turbo3 64K 33.43 49.27 +47%
turbo2 64K 42.45 56.91 +34%

RTX 4090M (9B Q8_0)

Type Before After Improvement
Short (all types) 55-56 55-56 ~tie
q8_0 32K 48.2 52.0 +8%
turbo4 32K 34.5 52.4 +52%
turbo3 32K 40.3 49.0 +22%
turbo2 32K 44.9 52.7 +17%

Pattern across 4 GPUs: Short context is identical or near-identical (weight-loading bound). Optimizations show at 32K+ where KV bandwidth dominates — LUT scoring, nthreads_KQ=8, and sparse V skip reduce per-token KV access cost. turbo4 benefits most (+46-68%) because its larger KV amplifies the unoptimized dequant cost. Advantage grows with context depth: 32K → 64K shows +34-47% on the 3090.

Quality (wikitext-2, 8 chunks)

Metric Before After Delta
q8_0 PPL 512 6.7590 6.7590 identical
turbo3 PPL 512 6.8380 6.8522 +0.2%
turbo3 PPL 2048 5.6997 5.6744 (=q8_0) -0.4% (better)

q8_0 identical. Optimized turbo3 at ctx=2048 equals q8_0 exactly (5.6744 = 5.6744).

Acknowledgments and Contributions

This Layer — Adaptive Blackwell (@craftogrammer)

Tuning + integration on top of Madreag's TurboQuant CUDA fork, focused on consumer Blackwell (sm_120, RTX 5080 16 GB) and long-context coding-agent workflow:

Blackwell silicon support:

  • sm_120 + Windows nvcc 12.9 ptxas-crash workarounds (__noinline__ on q4_0 / turbo3_tcq helpers; --ptxas-options=-O0 fallback for turbo3_0 and turbo3_tcq TUs; MXFP4 paths gated behind GGML_CUDA_NO_MXFP4)
  • wgmma / setmaxnreg confirmed unavailable on consumer Blackwell; cp.async, mbarrier, TMA, and prefetch.global.L2 (lowers to CCTL.E.PF2 SASS) verified available
  • Pinned to 120-real (avoid silent 12X→12Xa coercion that targets datacenter-only ops)

TCQ KV path:

  • turbo3_tcq cache type integrated as a same-type and mixed-pair (turbo3_tcqq8_0) attention path; D=128/256 dispatch; FWHT groups + attention-sink capture
  • Inline V dequantization + byte-pair vectorization in the same-type FA TU (cumulative +5.1% / +9.9% / +13.0% TG at d=16K / 32K / 64K)
  • K_set_rows backtrace in dynamic SMEM (drops a 128 MiB scratch alloc)

Auto-selection + adaptive layout:

  • VRAM-fit auto-selector in llama-kv-cache.cpp — probes ggml_backend_dev_memory, estimates per-mode KV bytes with the same ggml_row_size formula the allocator uses, picks the most aggressive TURBO_LAYER_ADAPTIVE mode that fits under free VRAM minus 1 GiB compute-peak margin; predicted-vs-actual 1510 / 1509.88 MiB at d=65K
  • Mode 1 (K&V first-4 + last-4 q8_0) → mode 7 (K-only last-8 q8_0) → mode 13 (V-only first-2 + last-2) → off cascade

MoE offload tuning:

  • --n-cpu-moe sweep methodology validated for Qwen3.6-35B-A3B on 16 GB; APEX-I-Compact (16 GB Q4) at ncmoe=8 is the SHIP MoE config (~30 t/s @ d=128K)

Validation:

  • Long-context depth-sweep harness at d=0/16K/32K/65K/128K (rather than the d=0-only numbers most posts report)
  • ncu-profiled the SHIP decode path: mul_mat_q<IQ3_S> is register-bound (254 regs/thread, ~12.5% theoretical occupancy) — validated that cp.async / prefetch tricks don't help
  • Dropped optimizations that didn't survive clean rebench (e.g. TURBO_SPARSE_V_THRESHOLD runtime knob caused a 32% decode regression — reverted to constexpr 1e-6f)

Parent Fork (Madreag) — TurboQuant CUDA

CUDA kernel optimizations, cross-GPU validation, and quality testing by @Madreag:

Kernel Optimizations:

  • 8-wide LUT scoring for turbo3/turbo2 — 2 qs bytes per iteration, +4.7% at 32K
  • Half-precision shared memory LUT (float→half) — halves shmem bandwidth, +2.45% at 32K
  • __expf fast-math softmax — all 5 sites in VEC kernel, +3.69% at 32K, PPL bit-exact
  • nthreads_KQ=8 for all turbo types — 4 interleaved dots/warp, up to +17.7% at 32K
  • static constexpr __device__ centroid arrays — register-allocated, 0 latency
  • L2 prefetch hints in VEC decode loop — +2.9% at 32K
  • __launch_bounds__(128, 3) occupancy fix — 2→3 blocks/SM, +7-13% at 32K
  • Sparse V threshold escalation (1e-6→5e-3/1e-2) — type-adaptive, +5-28% at 32K, PPL bit-exact
  • D=256 LUT disable for SM120 — workaround for NVIDIA codegen bug (NVBUG 5218000/5288270)
  • Block-128 CUDA validation — turbo3 5.12x compression, turbo2 7.53x

Architecture & Features:

  • All 4 turbo types ported to CUDA (turbo4, turbo3, turbo2, turbo1.5)
  • 36 asymmetric K×V combinations with full VEC template instances
  • 15 layer-adaptive modes (KV ordinal-based, hybrid architecture compatible)
  • Graph-compatible attention sinks (__device__ + cudaMemcpyAsync)
  • D=64/128/256 FA dispatch with graceful D=96 fallback

Validation:

  • 1,351+ stability iterations across 4 NVIDIA GPUs (SM86×2/SM89/SM120), zero failures
  • 5-model architecture sweep (D=64/96/128/256, GQA 1:1 to 4:1)
  • NIAH quality testing across 4 GPUs (4K-64K): q8_0/turbo3 100% on 5090, 3090, 4090M; all types 92% on 3090 Ti
  • Extreme context: turbo2 at 256K = 42.57 tok/s on consumer RTX 5090

Upstream Contributors

  • TheTom — Metal implementation, turbo4 resurrection (7 bugs fixed), asymmetric K/V discovery, turbo3 norm correction, block-128 storage research, sparse V concept, quality validation methodology
  • signalnine — Original CUDA port of TurboQuant for llama.cpp (PR #3 to TheTom's repo), InnerQ per-channel equalization
  • spiritbuun — turbo4 norm correction (separate CUDA fork), inverse FWHT prefill optimization
  • HyperionMS2040 — Block-128 SET_ROWS warp-to-block mapping fix (7cb6edb), validated PPL-identical on SM86

Paper

TurboQuant: Online Vector Quantization for KV Cache Compression — Google Research, ICLR 2026.


Below is the original llama.cpp README.


llama.cpp

llama

License: MIT Release Server

Manifesto / ggml / ops

LLM inference in C/C++

Recent API changes

Hot topics


Quick start

Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:

Once installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more.

Example command:

# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

  • Plain C/C++ implementation without any dependencies
  • Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
  • AVX, AVX2, AVX512 and AMX support for x86 architectures
  • RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
  • 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
  • Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
  • Vulkan and SYCL backend support
  • CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

The llama.cpp project is the main playground for developing new features for the ggml library.

Models

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: HOWTO-add-model.md

Text-only

Multimodal

Bindings
UIs

(to have a project listed here, it should clearly state that it depends on llama.cpp)

Tools
  • akx/ggify – download PyTorch models from Hugging Face Hub and convert them to GGML
  • akx/ollama-dl – download models from the Ollama library to be used directly with llama.cpp
  • crashr/gppm – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
  • gpustack/gguf-parser - review/check the GGUF file and estimate the memory usage
  • Styled Lines (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
  • unslothai/unsloth – 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)
Infrastructure
  • Paddler - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
  • GPUStack - Manage GPU clusters for running LLMs
  • llama_cpp_canister - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
  • llama-swap - transparent proxy that adds automatic model switching with llama-server
  • Kalavai - Crowdsource end to end LLM deployment at any scale
  • llmaz - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
  • LLMKube - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal support"
Games
  • Lucy's Labyrinth - A simple maze game where agents controlled by an AI model will try to trick you.

Supported backends

Backend Target devices
Metal Apple Silicon
BLAS All
BLIS All
SYCL Intel and Nvidia GPU
OpenVINO [In Progress] Intel CPUs, GPUs, and NPUs
MUSA Moore Threads GPU
CUDA Nvidia GPU
HIP AMD GPU
ZenDNN AMD CPU
Vulkan GPU
CANN Ascend NPU
OpenCL Adreno GPU
IBM zDNN IBM Z & LinuxONE
WebGPU [In Progress] All
RPC All
Hexagon [In Progress] Snapdragon
VirtGPU VirtGPU APIR

Obtaining and quantizing models

The Hugging Face platform hosts a number of LLMs compatible with llama.cpp:

You can either manually download the GGUF file or directly use any llama.cpp-compatible models from Hugging Face or other model hosting sites, by using this CLI argument: -hf <user>/<model>[:quant]. For example:

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable MODEL_ENDPOINT. The MODEL_ENDPOINT must point to a Hugging Face compatible API endpoint.

After downloading a model, use the CLI tools to run it locally - see below.

llama.cpp requires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.

The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp:

To learn more about model quantization, read this documentation

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

  • Run in conversation mode

    Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding -cnv and specifying a suitable chat template with --chat-template NAME

    llama-cli -m model.gguf
    
    # > hi, who are you?
    # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
    #
    # > what is 1+1?
    # Easy peasy! The answer to 1+1 is... 2!
  • Run in conversation mode with custom chat template
    # use the "chatml" template (use -h to see the list of supported templates)
    llama-cli -m model.gguf -cnv --chat-template chatml
    
    # use a custom template
    llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
  • Constrain the output with a custom grammar
    llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
    
    # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}

    The grammars/ folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.

    For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

  • Start a local HTTP server with default configuration on port 8080
    llama-server -m model.gguf --port 8080
    
    # Basic web UI can be accessed via browser: http://localhost:8080
    # Chat completion endpoint: http://localhost:8080/v1/chat/completions
  • Support multiple-users and parallel decoding
    # up to 4 concurrent requests, each with 4096 max context
    llama-server -m model.gguf -c 16384 -np 4
  • Enable speculative decoding
    # the draft.gguf model should be a small variant of the target model.gguf
    llama-server -m model.gguf -md draft.gguf
  • Serve an embedding model
    # use the /embedding endpoint
    llama-server -m model.gguf --embedding --pooling cls -ub 8192
  • Serve a reranking model
    # use the /reranking endpoint
    llama-server -m model.gguf --reranking
  • Constrain all outputs with a grammar
    # custom grammar
    llama-server -m model.gguf --grammar-file grammar.gbnf
    
    # JSON
    llama-server -m model.gguf --grammar-file grammars/json.gbnf

A tool for measuring the perplexity 1 (and other quality metrics) of a model over a given text.

  • Measure the perplexity over a text file
    llama-perplexity -m model.gguf -f file.txt
    
    # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
    # Final estimate: PPL = 5.4007 +/- 0.67339
  • Measure KL divergence
    # TODO

Benchmark the performance of the inference for various parameters.

  • Run default benchmark
    llama-bench -m model.gguf
    
    # Output:
    # | model               |       size |     params | backend    | threads |          test |                  t/s |
    # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |
    #
    # build: 3e0ba0e60 (4229)

A minimal example for implementing apps with llama.cpp. Useful for developers.

  • Basic text completion
    llama-simple -m model.gguf
    
    # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of

Contributing

  • Contributors can open PRs
  • Collaborators will be invited based on contributions
  • Maintainers can push to branches in the llama.cpp repo and merge PRs into the master branch
  • Any help with managing issues, PRs and projects is very appreciated!
  • See good first issues for tasks suitable for first contributions
  • Read the CONTRIBUTING.md for more information
  • Make sure to read this: Inference at the edge
  • A bit of backstory for those who are interested: Changelog podcast

Other documentation

Development documentation

Seminal papers and background on the models

If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:

XCFramework

The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS, and macOS. It can be used in Swift projects without the need to compile the library from source. For example:

// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.

import PackageDescription

let package = Package(
    name: "MyLlamaPackage",
    targets: [
        .executableTarget(
            name: "MyLlamaPackage",
            dependencies: [
                "LlamaFramework"
            ]),
        .binaryTarget(
            name: "LlamaFramework",
            url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip",
            checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab"
        )
    ]
)

The above example is using an intermediate build b5046 of the library. This can be modified to use a different version by changing the URL and checksum.

Completions

Command-line completion is available for some environments.

Bash Completion

$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

Optionally this can be added to your .bashrc or .bash_profile to load it automatically. For example:

$ echo "source ~/.llama-completion.bash" >> ~/.bashrc

Dependencies

  • yhirose/cpp-httplib - Single-header HTTP server, used by llama-server - MIT license
  • stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
  • nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
  • miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain
  • subprocess.h - Single-header process launching solution for C and C++ - Public domain

Footnotes

  1. https://huggingface.co/docs/transformers/perplexity

About

Downstream llama.cpp TurboQuant CUDA fork with adaptive KV layout selection for long-context inference on consumer Blackwell GPUs.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • C++ 52.5%
  • C 18.1%
  • Python 7.1%
  • Cuda 6.2%
  • HTML 3.5%
  • TypeScript 2.6%
  • Other 10.0%