feat: Vulkan compute shader support for turbo3 (experimental) by apollosenvy · Pull Request #33 · TheTom/llama-cpp-turboquant

apollosenvy · 2026-03-31T02:02:22Z

Summary

Vulkan backend support for turbo3 KV cache quantization. Experimental -- works on AMD 7900 XTX via RADV.

New files:

dequant_turbo3_0.comp -- standalone dequant shader (3-bit index from 2-bit qs + 1-bit signs)
dequant_funcs.glsl -- inline dequant/dequant4/get_dm for get_rows/mul_mat
dequant_funcs_cm2.glsl -- cooperative matrix 2 FA path
copy_to_quant.comp -- quantize with norm correction
types.glsl -- block_turbo3_0 struct
vulkan-shaders-gen.cpp -- turbo3_0 type registration
ggml-vulkan.cpp -- pipeline creation + supports_op

Benchmark (AMD 7900 XTX, RADV, Vulkan)

Test	F16 KV	turbo3 KV	Ratio
pp128	748	264	35%
tg32	36.0	27.4	76%
tg128	33.3	26.4	79%

The tg numbers are already faster than ROCm HIP (25.2 t/s). The pp gap is from the standalone dequant path -- inline FA dequant would close it.

Status

Quantize/dequant: working
get_rows: working
set_rows: working
Flash attention: works via dequant-to-F16 path (not inline turbo3 FA)
coopmat2 FA: shader compiles, untested on hardware

Marked experimental. Tested on RADV only.

🤖 Generated with Claude Code

Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes #33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Titaniumtown · 2026-04-07T23:24:56Z

Needs to be rebased to support gemma4 btw. At least I think so:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'

TheTom · 2026-04-07T23:29:20Z

hey @apollosenvy . can you please rebase and test with HEAD. lots has changed and I apologize this fell through he crack.

Titaniumtown · 2026-04-07T23:34:15Z

I've been personally eyeing this PR because I have a small intel dgpu in my homeserver and want to use that for small llms.

I have been eyeing turboquant (and this PR) extensively :)

Full turbo3 quantize/dequant pipeline for Vulkan backend: - types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4]) - dequant_turbo3_0.comp: standalone dequant shader (3-bit index reconstruction from 2-bit qs + 1-bit signs, centroid lookup) - dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths - dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support - copy_to_quant.comp: quantize function with norm correction - vulkan-shaders-gen.cpp: turbo3_0 type registration - ggml-vulkan.cpp: pipeline creation and supports_op dispatch Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

apollosenvy · 2026-04-08T01:57:35Z

Rebased onto current feature/turboquant-kv-cache HEAD (a4e8af44). Clean cherry-pick, no conflicts.

Build: Vulkan with GL_KHR_cooperative_matrix — compiles clean.

Benchmark (AMD 7900 XTX, RADV NAVI31, Vulkan):

Model	KV	pp128 (t/s)	tg128 (t/s)
Qwen3-8B Q4_K_M	f16 baseline	2780.81	120.85
Qwen3-8B Q4_K_M	q8_0-K / turbo3-V	393.85	44.52
Mistral-24B Q4_K_S	q8_0-K / turbo3-V	301.10	27.12

tg works well. The pp gap remains — this is the standalone dequant-to-F16 path, not inline turbo3 FA. Inline FA dequant would close it (noted in the original PR).

Quantize, dequant, get_rows, set_rows, and flash attention (via dequant path) all functional. Tested on RADV only.

TheTom

Built and tested on M5 Max (Metal4). No regression on Metal path:

Test	tok/s	Config
pp128	1359	Qwen3.5-35B-A3B ConfigI, q8_0-K/turbo3-V
tg32	40.2	same

All 7 changed files are Vulkan-only — zero changes to Metal, CUDA, or core ggml. Clean compile, no warnings.

Can't validate Vulkan shaders locally (no Vulkan GPU on Mac), but the code looks correct — PolarQuant centroids match, norm correction included, bit packing matches the Metal kernel layout.

Ship it.

TheTom · 2026-04-08T02:16:31Z

And thank you @apollosenvy

apollosenvy · 2026-04-08T02:36:11Z

Update: fused flash attention for turbo3 on Vulkan

Added inline turbo3 dequant directly in the FA compute shaders (scalar + coopmat1 paths). The standalone dequant-to-F16 indirection is gone for symmetric turbo3/turbo3.

Benchmark (AMD 7900 XTX, RADV NAVI31, Vulkan, turbo3/turbo3):

Model	Metric	Before (dequant path)	After (fused FA)	Gain
Qwen3-8B Q4_K_M	pp128	393.85 t/s	958.67 t/s	+143%
Qwen3-8B Q4_K_M	tg128	44.52 t/s	51.20 t/s	+15%
Mistral-24B Q4_K_S	pp128	301.10 t/s	481.24 t/s	+60%
Mistral-24B Q4_K_S	tg128	27.12 t/s	29.50 t/s	+9%

What changed:

ggml-vulkan.cpp: CREATE_FA pipeline registration for GGML_TYPE_TURBO3_0 (scalar + cm1)
vulkan-shaders-gen.cpp: emit turbo3_0 variants for both FA shader types
flash_attn.comp / flash_attn_cm1.comp: enable FLASH_ATTN_DIRECT_TURBO3 define
flash_attn_base.glsl: inline dequantize4() with Lloyd-Max centroid lookup

Known limitations:

Asymmetric q8_0-K/turbo3-V not yet supported in fused path (falls back to dequant route). The Vulkan FA infrastructure requires K==V type; supporting asymmetric needs dual-dequant shader variants.
Remaining gap to HIP/ROCm (99% of f16) is hardware-level: HIP uses v_dot2_f32_f16 inline ASM and sparse V skipping, neither accessible from Vulkan GLSL.

Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Titaniumtown · 2026-04-08T17:37:53Z

I found some issues on intel gpus I believe. I'm going to work on this with claude to see if I get anywhere with it.

Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes #33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

@apollosenvy

Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR #33 + #87) to the other two turbo types. Reported by @dpblnt in #50 with a clean matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4 V abort with: pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0) that cannot run the operation (SET_ROWS) at llama_context::sched_reserve() time, before any compute runs. Mechanical port across 4 files: - vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct declarations matching the C side (ggml-common.h). - vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4 (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported from CENTROIDS_2BIT and CENTROIDS_4BIT in ggml-turbo-quant.c. - vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added to the set_rows iteration list at line ~789. - ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op switch + dispatch element-count all extended with TURBO2_0 and TURBO4_0 cases. ## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet) Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe (normally filtered out as eCpu); patch reverted before commit. The SET_ROWS abort is a backend-capability check at graph build time so it fires regardless of GPU vs CPU Vulkan backend. | ctk / ctv | tg16 (t/s) | status | |-------------------|-----------:|---------------| | q4_0 / q4_0 | 17.68 | baseline | | q4_0 / turbo3 | 5.91 | already worked| | q4_0 / turbo4 | 6.14 | was aborting | | q4_0 / turbo2 | 5.65 | was aborting | llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they are reported here only to confirm the abort is gone and the kernels run end-to-end without divergence. ## Needs GPU validation Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV VF does not expose itself to RADV/amdvlk on cloud). Specifically: - Subgroup shuffle / ballot behavior on real GPU subgroup sizes - Shader compilation under non-llvmpipe Vulkan drivers - PPL / quality on the actual quantization math @dpblnt @apollosenvy if either of you has cycles, would appreciate a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm: 1. The SET_ROWS abort that triggered #50 is gone 2. Output coherence on turbo4 V (not garbage tokens) 3. PPL stays in the expected ballpark vs the CUDA / Metal implementations of the same quants Closes #50. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added ggml Vulkan labels Mar 31, 2026

TheTom force-pushed the feature/turboquant-kv-cache branch from 63b832b to e9c54d5 Compare April 3, 2026 16:14

apollosenvy force-pushed the pr/vulkan-turbo3 branch from 9e80e93 to c2a70f3 Compare April 8, 2026 01:57

TheTom approved these changes Apr 8, 2026

View reviewed changes

TheTom merged commit eea498c into TheTom:feature/turboquant-kv-cache Apr 8, 2026
12 of 47 checks passed

Titaniumtown mentioned this pull request Apr 9, 2026

vulkan: fix and complete turbo3 KV cache support #62

Merged

apollosenvy mentioned this pull request Apr 17, 2026

feat: ROCm/HIP support for turbo3 KV cache (gfx1100/RDNA3) #5

Closed

7 tasks

This was referenced May 2, 2026

vulkan: add SET_ROWS support for turbo2_0 and turbo4_0 (#50) #118

Merged

Eval bug: pre-allocated tensor (cache_k_l3 (view)) in a buffer that cannot run the operation (SET_ROWS) #50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Vulkan compute shader support for turbo3 (experimental)#33

feat: Vulkan compute shader support for turbo3 (experimental)#33
TheTom merged 1 commit intoTheTom:feature/turboquant-kv-cachefrom
apollosenvy:pr/vulkan-turbo3

apollosenvy commented Mar 31, 2026

Uh oh!

Titaniumtown commented Apr 7, 2026 •

edited

Loading

Uh oh!

TheTom commented Apr 7, 2026

Uh oh!

Titaniumtown commented Apr 7, 2026

Uh oh!

apollosenvy commented Apr 8, 2026

Uh oh!

TheTom left a comment

Uh oh!

Uh oh!

TheTom commented Apr 8, 2026

Uh oh!

apollosenvy commented Apr 8, 2026

Uh oh!

Titaniumtown commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

apollosenvy commented Mar 31, 2026

Summary

Benchmark (AMD 7900 XTX, RADV, Vulkan)

Status

Uh oh!

Titaniumtown commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTom commented Apr 7, 2026

Uh oh!

Titaniumtown commented Apr 7, 2026

Uh oh!

apollosenvy commented Apr 8, 2026

Uh oh!

TheTom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TheTom commented Apr 8, 2026

Uh oh!

apollosenvy commented Apr 8, 2026

Uh oh!

Titaniumtown commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Titaniumtown commented Apr 7, 2026 •

edited

Loading