feat: Vulkan compute shader support for turbo3 (experimental)#33
Conversation
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes #33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes #33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
63b832b to
e9c54d5
Compare
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes #33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
|
Needs to be rebased to support gemma4 btw. At least I think so:
|
|
hey @apollosenvy . can you please rebase and test with HEAD. lots has changed and I apologize this fell through he crack. |
|
I've been personally eyeing this PR because I have a small intel dgpu in my homeserver and want to use that for small llms. I have been eyeing turboquant (and this PR) extensively :) |
Full turbo3 quantize/dequant pipeline for Vulkan backend: - types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4]) - dequant_turbo3_0.comp: standalone dequant shader (3-bit index reconstruction from 2-bit qs + 1-bit signs, centroid lookup) - dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths - dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support - copy_to_quant.comp: quantize function with norm correction - vulkan-shaders-gen.cpp: turbo3_0 type registration - ggml-vulkan.cpp: pipeline creation and supports_op dispatch Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9e80e93 to
c2a70f3
Compare
|
Rebased onto current Build: Vulkan with Benchmark (AMD 7900 XTX, RADV NAVI31, Vulkan):
tg works well. The pp gap remains — this is the standalone dequant-to-F16 path, not inline turbo3 FA. Inline FA dequant would close it (noted in the original PR). Quantize, dequant, get_rows, set_rows, and flash attention (via dequant path) all functional. Tested on RADV only. |
TheTom
left a comment
There was a problem hiding this comment.
Built and tested on M5 Max (Metal4). No regression on Metal path:
| Test | tok/s | Config |
|---|---|---|
| pp128 | 1359 | Qwen3.5-35B-A3B ConfigI, q8_0-K/turbo3-V |
| tg32 | 40.2 | same |
All 7 changed files are Vulkan-only — zero changes to Metal, CUDA, or core ggml. Clean compile, no warnings.
Can't validate Vulkan shaders locally (no Vulkan GPU on Mac), but the code looks correct — PolarQuant centroids match, norm correction included, bit packing matches the Metal kernel layout.
Ship it.
eea498c
into
TheTom:feature/turboquant-kv-cache
|
And thank you @apollosenvy |
|
Update: fused flash attention for turbo3 on Vulkan Added inline turbo3 dequant directly in the FA compute shaders (scalar + coopmat1 paths). The standalone dequant-to-F16 indirection is gone for symmetric turbo3/turbo3. Benchmark (AMD 7900 XTX, RADV NAVI31, Vulkan, turbo3/turbo3):
What changed:
Known limitations:
|
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
|
I found some issues on intel gpus I believe. I'm going to work on this with claude to see if I get anywhere with it. |
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes #33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR #33 + #87) to the other two turbo types. Reported by @dpblnt in #50 with a clean matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4 V abort with: pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0) that cannot run the operation (SET_ROWS) at llama_context::sched_reserve() time, before any compute runs. Mechanical port across 4 files: - vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct declarations matching the C side (ggml-common.h). - vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4 (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported from CENTROIDS_2BIT and CENTROIDS_4BIT in ggml-turbo-quant.c. - vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added to the set_rows iteration list at line ~789. - ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op switch + dispatch element-count all extended with TURBO2_0 and TURBO4_0 cases. ## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet) Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe (normally filtered out as eCpu); patch reverted before commit. The SET_ROWS abort is a backend-capability check at graph build time so it fires regardless of GPU vs CPU Vulkan backend. | ctk / ctv | tg16 (t/s) | status | |-------------------|-----------:|---------------| | q4_0 / q4_0 | 17.68 | baseline | | q4_0 / turbo3 | 5.91 | already worked| | q4_0 / turbo4 | 6.14 | was aborting | | q4_0 / turbo2 | 5.65 | was aborting | llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they are reported here only to confirm the abort is gone and the kernels run end-to-end without divergence. ## Needs GPU validation Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV VF does not expose itself to RADV/amdvlk on cloud). Specifically: - Subgroup shuffle / ballot behavior on real GPU subgroup sizes - Shader compilation under non-llvmpipe Vulkan drivers - PPL / quality on the actual quantization math @dpblnt @apollosenvy if either of you has cycles, would appreciate a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm: 1. The SET_ROWS abort that triggered #50 is gone 2. Output coherence on turbo4 V (not garbage tokens) 3. PPL stays in the expected ballpark vs the CUDA / Metal implementations of the same quants Closes #50. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Vulkan backend support for turbo3 KV cache quantization. Experimental -- works on AMD 7900 XTX via RADV.
New files:
dequant_turbo3_0.comp-- standalone dequant shader (3-bit index from 2-bit qs + 1-bit signs)dequant_funcs.glsl-- inline dequant/dequant4/get_dm for get_rows/mul_matdequant_funcs_cm2.glsl-- cooperative matrix 2 FA pathcopy_to_quant.comp-- quantize with norm correctiontypes.glsl-- block_turbo3_0 structvulkan-shaders-gen.cpp-- turbo3_0 type registrationggml-vulkan.cpp-- pipeline creation + supports_opBenchmark (AMD 7900 XTX, RADV, Vulkan)
The tg numbers are already faster than ROCm HIP (25.2 t/s). The pp gap is from the standalone dequant path -- inline FA dequant would close it.
Status
Marked experimental. Tested on RADV only.
🤖 Generated with Claude Code