Skip to content

keeping up to date with turboquant on main v3#23

Merged
InfernalDread merged 5 commits intoInfernalDread:turboquant_kv_cache_updated_v2from
ggml-org:master
Apr 10, 2026
Merged

keeping up to date with turboquant on main v3#23
InfernalDread merged 5 commits intoInfernalDread:turboquant_kv_cache_updated_v2from
ggml-org:master

Conversation

@InfernalDread
Copy link
Copy Markdown
Owner

No description provided.

MoonRide303 and others added 5 commits April 10, 2026 21:45
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* hexagon: introduce op request batching and rewrite buffer managment

The host now prepares batches of requests and dispatches them via a single dspqueue message.

Buffers are mapped explicitly by NPU while processing batches.

* hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops

* hex-utils: add explicit l2flush and l2clear helpers

* hex-opreq: use fine-grain per tensor l2 management

* hex-opreq: avoid redundant invalidates for tensors we already flushed

* hex-opreq: update debug messages

* htp-opreq: reuse ops_context

* hex-opreq: do not flush or invalidate cache lines beyond buffer boundry

* hex-opreq: fix errors in log message

* Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"

This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.

* hexagon: limit l2 flushes to 1MB which covers l2 cache

* hex-opreq: limit cache flush to 4MB

Looks like 4MB cont. vitual space should cover the 1MB cache.

* hexagon: drop cache flush size to 2MB

* hex-opreq: start reworking opreq packing

* hex-opreq: introduce new way of packing opbatch where tensors are stored separately

* hex-opreq: add a simple fastrpc call to force unmap all buffers

* hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size

* hex-opreq: bump opreq batch size to 256

* hex-mm: place src1 spad at the top of vtcm for easy reuse

* hex-ops: introduce internal types and disable src1 reuse for now

Nothing new just formalizing the repack / qyn.quant types we've been using.

* htp-opreq: use tensor pointers instead of copies

* hex-opreq: introduce more robust way for tracking vtcm/spad reuse

This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.

* hex-cumsum: fix error post opreq merge

* hex-opreq: move request batch handling into the session

Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.

* hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx

* hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers

* hex-buf: add support for allocating shared/pinned buffer for opreqs

* hex-opbatch: make opbatches configurable

* hex-naming: better name for ggml_hexagon_shared_buffer

* hex-naming: add session->c_name() helper

* hex-opbatch: start using shm but still copy for now

* hex-opbatch: use shared buffer for packing opbatch

* hex-opbatch: beter naming for opbatch related classes and code

* hex-opbatch: reuse batched tensors with same data/dims/strides

* hex-opbatch: update logging

* hex-opbatch: add support for vmem limit for op batching

* hex-opbatch: update htp side to properly support dynamic mmap/unmap

* hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing

* hex-opbatch: fixed src1 handling in act ops

* hex-act: fix empty src1 handling in swiglu and friends

Simplify preamble macro while at it

* hex-mm: minor fix vtcm and dma handling in matmul

cleaning up some left-overs from merges

* hex-opbatch: allocate extra 1KB for dspqueue overhead

* hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc

* hex-mm: properly handle hmx_disabled flag

* hex-ops: update comments

* hex-ops: add debug output for get/set-rows

* hex-mmap: optimize un/mapping of buffers

* hex-opreq: global cache flush and invalidate beyond 128KB threshold

* hex-ops: add super simple opfilter regex for debugging

If an Op matches the regex hex backend will reject it.

* hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future

* hexagon: improved vtcm acquision to remove inter-op overhead

Fully compatible with QNN-HTP coex

* hex-mm: fixed hvx fallback path

* hex-mm: lower the vmem threshold a bit further to ~3GB

* hexagon: update debug & error logs

This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.

* hexagon: move ops context into main context

Just a cleanup. We don't need separate contexts at this point.

* hex-opbatch: cleanup naming and headers for opbatch and related descriptors

* hex-fa: it's now better to enable FA during TG to reduce graph splits

* hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var

It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.

* hexagon: fixed editorconfig check

* Update ggml/src/ggml-hexagon/ggml-hexagon.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* hexagon: add support for debian on ex2

* hexagon: add -fvectotize to c/c++ cmake flags

* hexagon: remove trailing white space

* update onboarding steps

* hexagon: update linux setup documentation

* hexagon: update intallation scripts

* Hexagon: update docs

* hexagon: update onboarding scripts

---------

Co-authored-by: Zack Li <zackli@qti.qualcomm.com>
@InfernalDread InfernalDread merged commit 5b30f46 into InfernalDread:turboquant_kv_cache_updated_v2 Apr 10, 2026
81 of 144 checks passed
@github-actions github-actions Bot added documentation Improvements or additions to documentation examples ggml script build Hexagon labels Apr 10, 2026
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
Codex review found:
1. Stale duplicate code in dequantize_turbo3_0_t4 (compile would fail)
2. thread static is risky/non-portable in MSL

Fixed: removed thread static caching, using plain thread locals.
Speed unchanged (2.4 tok/s) — the static caching wasn't actually working
on Metal. True optimization needs architectural change in flash attention
kernel to dequantize once per block, not per chunk.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
Root cause analysis: 8-32× redundant full-block dequantize per block
from flash attention template. Four approaches documented with expected
speedups and risk levels.

Plan: D (reduce overhead) → A/B (eliminate redundant calls)
Target: 2.4 tok/s → 20-40 tok/s

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
…23

No-op dequant test: even returning all zeros from dequantize, turbo3
runs at 2.4 tok/s (same as with full WHT rotation). The bottleneck is
NOT in the attention dequantize path.

New hypothesis: the SET_ROWS (quantize) path is the bottleneck. The
Metal quantize_turbo3_0 function does 3 WHT rotations per KV write,
totaling ~3200 ops per block × 224 blocks per token.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
CRITICAL BUG: The #include "turbo-wht.h" caused Metal JIT compilation
to fail at runtime. The model silently fell back to CPU for ALL ops.
ALL previous benchmarks (2.4 tok/s) were measuring CPU, not Metal GPU.

After inlining the header:
- MoE gen: 2.4 → 10.7 tok/s (4.5× improvement, now actually on Metal)
- MoE prompt: 4.2 → 60.9 tok/s (14.5× improvement)

Remaining gap vs q8_0: 85 → 10.7 tok/s (8× slower, down from 35×)

This is the SAME bug we hit with turbo-matrices.h earlier.
Rule: NEVER use #include in ggml-metal.metal — always inline.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
Previous 2.4 tok/s was CPU fallback. Real Metal numbers:
MoE: 10.7 tok/s gen (8× slower than q8_0, was thought to be 35×)
Qwopus: 5.3 tok/s gen (3.3× slower than q8_0)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
Full investigation log with all tests, results, and the root cause.
Upstream TurboQuant activity tracked in #27.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
…in) #23

Removing WHT rotation from dequant (quality broken, speed test only):
  gen: 10.7 → 49.1 tok/s (4.6× improvement, 57% of q8_0)
  prompt: 67.3 → 162.6 tok/s

Confirms pre-rotate-queries would deliver ~49 tok/s.
Remaining gap (49 vs 85) is block size + QJL overhead.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
Instead of inverse-rotating every K during dequant, rotate Q once
before attention. Math: <q, R^T*c[idx]> = <R*q, c[idx]>.

Changes:
- Store rotation matrix (R^T) in KV cache, filled after buffer clear
- Apply ggml_mul_mat(R_T, q) in build_attn_mha after permute
- Strip turbo_rotate_inverse from Metal dequant
- Dynamic cast to access rotation from mctx

Results:
- MoE gen: 10.7 → 51.4 tok/s (4.8× speedup)
- MoE prompt: 67.3 → 160.3 tok/s (2.4× speedup)
- Now at 60% of q8_0 speed with 4.9× compression
- Model produces coherent output

Codex review: fixed buffer clear ordering (was zeroing rotation after init).
Verified: rotation point is correct (after 4d reshape + permute, ne[0]=128).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
…23

Full investigation log documenting every test, every dead end, and every
breakthrough. 21× total improvement from CPU fallback to pre-rotate-queries.

Key lessons: no #include in Metal, no-op testing, pre-rotate-queries,
buffer clear ordering, codex+roast catch real bugs.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread pushed a commit that referenced this pull request Apr 23, 2026
Validated on real Qwen3 KV tensors: cosine sim 0.9508 → 0.9831 (+3.2%)
MSE-only better on 99.3% of vectors including p1 tails.

3-bit index split: lower 2 bits in qs[], upper 1 bit in signs[].
No QJL stage in quantize or dequant.

Results:
- MoE gen: 51.4 → 62.2 tok/s (73% of q8_0, was 60%)
- MoE prompt: 160 → 200 tok/s (90% of q8_0)
- Qwopus gen: 14.6 → 15.5 tok/s (88% of q8_0, was 83%)
- Qwopus prompt: 67 → 83 tok/s (100% of q8_0!)

Codex verified: bit packing correct, quantize/dequant consistent.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build documentation Improvements or additions to documentation examples ggml Hexagon script

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants