Skip to content

feat: Vulkan compute shader support for turbo3 (experimental)#33

Merged
TheTom merged 1 commit intoTheTom:feature/turboquant-kv-cachefrom
apollosenvy:pr/vulkan-turbo3
Apr 8, 2026
Merged

feat: Vulkan compute shader support for turbo3 (experimental)#33
TheTom merged 1 commit intoTheTom:feature/turboquant-kv-cachefrom
apollosenvy:pr/vulkan-turbo3

Conversation

@apollosenvy
Copy link
Copy Markdown

Summary

Vulkan backend support for turbo3 KV cache quantization. Experimental -- works on AMD 7900 XTX via RADV.

New files:

  • dequant_turbo3_0.comp -- standalone dequant shader (3-bit index from 2-bit qs + 1-bit signs)
  • dequant_funcs.glsl -- inline dequant/dequant4/get_dm for get_rows/mul_mat
  • dequant_funcs_cm2.glsl -- cooperative matrix 2 FA path
  • copy_to_quant.comp -- quantize with norm correction
  • types.glsl -- block_turbo3_0 struct
  • vulkan-shaders-gen.cpp -- turbo3_0 type registration
  • ggml-vulkan.cpp -- pipeline creation + supports_op

Benchmark (AMD 7900 XTX, RADV, Vulkan)

Test F16 KV turbo3 KV Ratio
pp128 748 264 35%
tg32 36.0 27.4 76%
tg128 33.3 26.4 79%

The tg numbers are already faster than ROCm HIP (25.2 t/s). The pp gap is from the standalone dequant path -- inline FA dequant would close it.

Status

  • Quantize/dequant: working
  • get_rows: working
  • set_rows: working
  • Flash attention: works via dequant-to-F16 path (not inline turbo3 FA)
  • coopmat2 FA: shader compiles, untested on hardware

Marked experimental. Tested on RADV only.

🤖 Generated with Claude Code

terrysimons pushed a commit to terrysimons/llama-cpp-turboquant that referenced this pull request Mar 31, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
TheTom added a commit that referenced this pull request Apr 2, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes #33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
TheTom added a commit that referenced this pull request Apr 2, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes #33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
@TheTom TheTom force-pushed the feature/turboquant-kv-cache branch from 63b832b to e9c54d5 Compare April 3, 2026 16:14
spiritbuun referenced this pull request in spiritbuun/buun-llama-cpp Apr 6, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes #33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
@Titaniumtown
Copy link
Copy Markdown

Titaniumtown commented Apr 7, 2026

Needs to be rebased to support gemma4 btw. At least I think so:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'

@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Apr 7, 2026

hey @apollosenvy . can you please rebase and test with HEAD. lots has changed and I apologize this fell through he crack.

@Titaniumtown
Copy link
Copy Markdown

I've been personally eyeing this PR because I have a small intel dgpu in my homeserver and want to use that for small llms.

I have been eyeing turboquant (and this PR) extensively :)

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@apollosenvy
Copy link
Copy Markdown
Author

Rebased onto current feature/turboquant-kv-cache HEAD (a4e8af44). Clean cherry-pick, no conflicts.

Build: Vulkan with GL_KHR_cooperative_matrix — compiles clean.

Benchmark (AMD 7900 XTX, RADV NAVI31, Vulkan):

Model KV pp128 (t/s) tg128 (t/s)
Qwen3-8B Q4_K_M f16 baseline 2780.81 120.85
Qwen3-8B Q4_K_M q8_0-K / turbo3-V 393.85 44.52
Mistral-24B Q4_K_S q8_0-K / turbo3-V 301.10 27.12

tg works well. The pp gap remains — this is the standalone dequant-to-F16 path, not inline turbo3 FA. Inline FA dequant would close it (noted in the original PR).

Quantize, dequant, get_rows, set_rows, and flash attention (via dequant path) all functional. Tested on RADV only.

Copy link
Copy Markdown
Owner

@TheTom TheTom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Built and tested on M5 Max (Metal4). No regression on Metal path:

Test tok/s Config
pp128 1359 Qwen3.5-35B-A3B ConfigI, q8_0-K/turbo3-V
tg32 40.2 same

All 7 changed files are Vulkan-only — zero changes to Metal, CUDA, or core ggml. Clean compile, no warnings.

Can't validate Vulkan shaders locally (no Vulkan GPU on Mac), but the code looks correct — PolarQuant centroids match, norm correction included, bit packing matches the Metal kernel layout.

Ship it.

@TheTom TheTom merged commit eea498c into TheTom:feature/turboquant-kv-cache Apr 8, 2026
12 of 47 checks passed
@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Apr 8, 2026

And thank you @apollosenvy

@apollosenvy
Copy link
Copy Markdown
Author

Update: fused flash attention for turbo3 on Vulkan

Added inline turbo3 dequant directly in the FA compute shaders (scalar + coopmat1 paths). The standalone dequant-to-F16 indirection is gone for symmetric turbo3/turbo3.

Benchmark (AMD 7900 XTX, RADV NAVI31, Vulkan, turbo3/turbo3):

Model Metric Before (dequant path) After (fused FA) Gain
Qwen3-8B Q4_K_M pp128 393.85 t/s 958.67 t/s +143%
Qwen3-8B Q4_K_M tg128 44.52 t/s 51.20 t/s +15%
Mistral-24B Q4_K_S pp128 301.10 t/s 481.24 t/s +60%
Mistral-24B Q4_K_S tg128 27.12 t/s 29.50 t/s +9%

What changed:

  • ggml-vulkan.cpp: CREATE_FA pipeline registration for GGML_TYPE_TURBO3_0 (scalar + cm1)
  • vulkan-shaders-gen.cpp: emit turbo3_0 variants for both FA shader types
  • flash_attn.comp / flash_attn_cm1.comp: enable FLASH_ATTN_DIRECT_TURBO3 define
  • flash_attn_base.glsl: inline dequantize4() with Lloyd-Max centroid lookup

Known limitations:

  • Asymmetric q8_0-K/turbo3-V not yet supported in fused path (falls back to dequant route). The Vulkan FA infrastructure requires K==V type; supporting asymmetric needs dual-dequant shader variants.
  • Remaining gap to HIP/ROCm (99% of f16) is hardware-level: HIP uses v_dot2_f32_f16 inline ASM and sparse V skipping, neither accessible from Vulkan GLSL.

iamwavecut pushed a commit to iamwavecut/llama-cpp-turboquant that referenced this pull request Apr 8, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
@Titaniumtown
Copy link
Copy Markdown

I found some issues on intel gpus I believe. I'm going to work on this with claude to see if I get anywhere with it.

KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 9, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 10, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 13, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 14, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 15, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
TheTom added a commit that referenced this pull request Apr 15, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes #33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 22, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 23, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 27, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
TheTom added a commit that referenced this pull request May 3, 2026
Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR #33 + #87)
to the other two turbo types. Reported by @dpblnt in #50 with a clean
matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4
V abort with:

  pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0)
  that cannot run the operation (SET_ROWS)

at llama_context::sched_reserve() time, before any compute runs.

Mechanical port across 4 files:

- vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct
  declarations matching the C side (ggml-common.h).

- vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks
  for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4
  (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and
  reduction structure identical to turbo3 (QK = 128 across all three).
  Centroid + midpoint tables ported from CENTROIDS_2BIT and
  CENTROIDS_4BIT in ggml-turbo-quant.c.

- vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added
  to the set_rows iteration list at line ~789.

- ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op
  switch + dispatch element-count all extended with TURBO2_0 and
  TURBO4_0 cases.

## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet)

Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe
(normally filtered out as eCpu); patch reverted before commit. The
SET_ROWS abort is a backend-capability check at graph build time so
it fires regardless of GPU vs CPU Vulkan backend.

| ctk / ctv         | tg16 (t/s) | status        |
|-------------------|-----------:|---------------|
| q4_0 / q4_0       | 17.68      | baseline      |
| q4_0 / turbo3     | 5.91       | already worked|
| q4_0 / turbo4     | 6.14       | was aborting  |
| q4_0 / turbo2     | 5.65       | was aborting  |

llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they
are reported here only to confirm the abort is gone and the kernels
run end-to-end without divergence.

## Needs GPU validation

Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV
VF does not expose itself to RADV/amdvlk on cloud). Specifically:
- Subgroup shuffle / ballot behavior on real GPU subgroup sizes
- Shader compilation under non-llvmpipe Vulkan drivers
- PPL / quality on the actual quantization math

@dpblnt @apollosenvy if either of you has cycles, would appreciate
a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm:
1. The SET_ROWS abort that triggered #50 is gone
2. Output coherence on turbo4 V (not garbage tokens)
3. PPL stays in the expected ballpark vs the CUDA / Metal
   implementations of the same quants

Closes #50.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants