keeping up to date with turboquant on main v3 by InfernalDread · Pull Request #23 · InfernalDread/llama.cpp

InfernalDread · 2026-04-10T23:13:06Z

No description provided.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* hexagon: introduce op request batching and rewrite buffer managment The host now prepares batches of requests and dispatches them via a single dspqueue message. Buffers are mapped explicitly by NPU while processing batches. * hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops * hex-utils: add explicit l2flush and l2clear helpers * hex-opreq: use fine-grain per tensor l2 management * hex-opreq: avoid redundant invalidates for tensors we already flushed * hex-opreq: update debug messages * htp-opreq: reuse ops_context * hex-opreq: do not flush or invalidate cache lines beyond buffer boundry * hex-opreq: fix errors in log message * Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry" This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d. * hexagon: limit l2 flushes to 1MB which covers l2 cache * hex-opreq: limit cache flush to 4MB Looks like 4MB cont. vitual space should cover the 1MB cache. * hexagon: drop cache flush size to 2MB * hex-opreq: start reworking opreq packing * hex-opreq: introduce new way of packing opbatch where tensors are stored separately * hex-opreq: add a simple fastrpc call to force unmap all buffers * hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size * hex-opreq: bump opreq batch size to 256 * hex-mm: place src1 spad at the top of vtcm for easy reuse * hex-ops: introduce internal types and disable src1 reuse for now Nothing new just formalizing the repack / qyn.quant types we've been using. * htp-opreq: use tensor pointers instead of copies * hex-opreq: introduce more robust way for tracking vtcm/spad reuse This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops. * hex-cumsum: fix error post opreq merge * hex-opreq: move request batch handling into the session Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner. * hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx * hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers * hex-buf: add support for allocating shared/pinned buffer for opreqs * hex-opbatch: make opbatches configurable * hex-naming: better name for ggml_hexagon_shared_buffer * hex-naming: add session->c_name() helper * hex-opbatch: start using shm but still copy for now * hex-opbatch: use shared buffer for packing opbatch * hex-opbatch: beter naming for opbatch related classes and code * hex-opbatch: reuse batched tensors with same data/dims/strides * hex-opbatch: update logging * hex-opbatch: add support for vmem limit for op batching * hex-opbatch: update htp side to properly support dynamic mmap/unmap * hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing * hex-opbatch: fixed src1 handling in act ops * hex-act: fix empty src1 handling in swiglu and friends Simplify preamble macro while at it * hex-mm: minor fix vtcm and dma handling in matmul cleaning up some left-overs from merges * hex-opbatch: allocate extra 1KB for dspqueue overhead * hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc * hex-mm: properly handle hmx_disabled flag * hex-ops: update comments * hex-ops: add debug output for get/set-rows * hex-mmap: optimize un/mapping of buffers * hex-opreq: global cache flush and invalidate beyond 128KB threshold * hex-ops: add super simple opfilter regex for debugging If an Op matches the regex hex backend will reject it. * hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future * hexagon: improved vtcm acquision to remove inter-op overhead Fully compatible with QNN-HTP coex * hex-mm: fixed hvx fallback path * hex-mm: lower the vmem threshold a bit further to ~3GB * hexagon: update debug & error logs This also fixes an issue with newer llvm merging repack and non-repack functions. We use those pointer to distinguish between buffer types. * hexagon: move ops context into main context Just a cleanup. We don't need separate contexts at this point. * hex-opbatch: cleanup naming and headers for opbatch and related descriptors * hex-fa: it's now better to enable FA during TG to reduce graph splits * hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops if needed for debugging or validation. * hexagon: fixed editorconfig check * Update ggml/src/ggml-hexagon/ggml-hexagon.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* hexagon: add support for debian on ex2 * hexagon: add -fvectotize to c/c++ cmake flags * hexagon: remove trailing white space * update onboarding steps * hexagon: update linux setup documentation * hexagon: update intallation scripts * Hexagon: update docs * hexagon: update onboarding scripts --------- Co-authored-by: Zack Li <zackli@qti.qualcomm.com>

Codex review found: 1. Stale duplicate code in dequantize_turbo3_0_t4 (compile would fail) 2. thread static is risky/non-portable in MSL Fixed: removed thread static caching, using plain thread locals. Speed unchanged (2.4 tok/s) — the static caching wasn't actually working on Metal. True optimization needs architectural change in flash attention kernel to dequantize once per block, not per chunk. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause analysis: 8-32× redundant full-block dequantize per block from flash attention template. Four approaches documented with expected speedups and risk levels. Plan: D (reduce overhead) → A/B (eliminate redundant calls) Target: 2.4 tok/s → 20-40 tok/s Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…23 No-op dequant test: even returning all zeros from dequantize, turbo3 runs at 2.4 tok/s (same as with full WHT rotation). The bottleneck is NOT in the attention dequantize path. New hypothesis: the SET_ROWS (quantize) path is the bottleneck. The Metal quantize_turbo3_0 function does 3 WHT rotations per KV write, totaling ~3200 ops per block × 224 blocks per token. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CRITICAL BUG: The #include "turbo-wht.h" caused Metal JIT compilation to fail at runtime. The model silently fell back to CPU for ALL ops. ALL previous benchmarks (2.4 tok/s) were measuring CPU, not Metal GPU. After inlining the header: - MoE gen: 2.4 → 10.7 tok/s (4.5× improvement, now actually on Metal) - MoE prompt: 4.2 → 60.9 tok/s (14.5× improvement) Remaining gap vs q8_0: 85 → 10.7 tok/s (8× slower, down from 35×) This is the SAME bug we hit with turbo-matrices.h earlier. Rule: NEVER use #include in ggml-metal.metal — always inline. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous 2.4 tok/s was CPU fallback. Real Metal numbers: MoE: 10.7 tok/s gen (8× slower than q8_0, was thought to be 35×) Qwopus: 5.3 tok/s gen (3.3× slower than q8_0) Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full investigation log with all tests, results, and the root cause. Upstream TurboQuant activity tracked in #27. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…in) #23 Removing WHT rotation from dequant (quality broken, speed test only): gen: 10.7 → 49.1 tok/s (4.6× improvement, 57% of q8_0) prompt: 67.3 → 162.6 tok/s Confirms pre-rotate-queries would deliver ~49 tok/s. Remaining gap (49 vs 85) is block size + QJL overhead. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of inverse-rotating every K during dequant, rotate Q once before attention. Math: <q, R^T*c[idx]> = <R*q, c[idx]>. Changes: - Store rotation matrix (R^T) in KV cache, filled after buffer clear - Apply ggml_mul_mat(R_T, q) in build_attn_mha after permute - Strip turbo_rotate_inverse from Metal dequant - Dynamic cast to access rotation from mctx Results: - MoE gen: 10.7 → 51.4 tok/s (4.8× speedup) - MoE prompt: 67.3 → 160.3 tok/s (2.4× speedup) - Now at 60% of q8_0 speed with 4.9× compression - Model produces coherent output Codex review: fixed buffer clear ordering (was zeroing rotation after init). Verified: rotation point is correct (after 4d reshape + permute, ne[0]=128). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…23 Full investigation log documenting every test, every dead end, and every breakthrough. 21× total improvement from CPU fallback to pre-rotate-queries. Key lessons: no #include in Metal, no-op testing, pre-rotate-queries, buffer clear ordering, codex+roast catch real bugs. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Validated on real Qwen3 KV tensors: cosine sim 0.9508 → 0.9831 (+3.2%) MSE-only better on 99.3% of vectors including p1 tails. 3-bit index split: lower 2 bits in qs[], upper 1 bit in signs[]. No QJL stage in quantize or dequant. Results: - MoE gen: 51.4 → 62.2 tok/s (73% of q8_0, was 60%) - MoE prompt: 160 → 200 tok/s (90% of q8_0) - Qwopus gen: 14.6 → 15.5 tok/s (88% of q8_0, was 83%) - Qwopus prompt: 67 → 83 tok/s (100% of q8_0!) Codex verified: bit packing correct, quantize/dequant consistent. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MoonRide303 and others added 5 commits April 10, 2026 21:45

model : make Gemma 4 shared-KV tail attn_k tensors optional on load (#…

e62fa13

…21739)

common : add callback interface for download progress (#21735)

05b3caa

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

common : better align to the updated official gemma4 template (#21704)

3fc6506

InfernalDread merged commit 5b30f46 into InfernalDread:turboquant_kv_cache_updated_v2 Apr 10, 2026
81 of 144 checks passed

github-actions Bot added documentation Improvements or additions to documentation examples ggml script build Hexagon labels Apr 10, 2026

InfernalDread pushed a commit that referenced this pull request Apr 23, 2026

docs: log simd_broadcast attempt — no speed improvement #23

8cf626a

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

InfernalDread pushed a commit that referenced this pull request Apr 23, 2026

docs: log threadgroup attempt — no speed improvement, rethinking #23

2448468

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keeping up to date with turboquant on main v3#23

keeping up to date with turboquant on main v3#23
InfernalDread merged 5 commits intoInfernalDread:turboquant_kv_cache_updated_v2from
ggml-org:master

InfernalDread commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

InfernalDread commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants