turboquant_kv_cache_updated_v9 by InfernalDread · Pull Request #31 · InfernalDread/llama.cpp

InfernalDread · 2026-04-17T20:56:35Z

No description provided.

* optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit cde679e. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip

…dreno (#21938) * opencl: refactor q8_0 gemm/gemv Adreno dispatch * opencl: refactor q8_0 set_tensor * opencl: fix whitespace

* model : Gemma4 model type detection * model : Gemma4 model type detection

* cmake : allow libcommon to be shared * cmake : rename libcommon to libllama-common * cont : set -fPIC for httplib * cont : export all symbols * cont : fix build_info exports * libs : add libllama-common-base * log : add common_log_get_verbosity_thold()

* server: respect the ignore eos flag * ci: add android arm64 build and release * patch * pin android-setup actions to v4 * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * lf in the suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

…ng (#21052) * Update workflows to remove dependence on llvmpipe * Try setting Dawn_DIR * remove c++20 initializers * Move to proper guid * Try avoiding segfaults on vulkan backend process exit * Remove compiler warnings on parameter casting * Fix soft_max and update reg_tile accumulation to f32 for better precision * Refactor flash_attn a bit * remove c++20 initializers and format * Increase div precision for NVIDIA * revert div precision and comment out ggml-ci node for now * Formatting * Try debugging on a failing CI node * Revert "Try debugging on a failing CI node" This reverts commit 1971e33.

…31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chraac and others added 10 commits April 16, 2026 13:48

cmake: use glob to collect src/models sources (#22005)

089dd41

cli : use get_media_marker (#22017)

30dce2c

opencl: refactor q8_0 set_tensor and mul_mat host side dispatch for A…

5e6c0e1

…dreno (#21938) * opencl: refactor q8_0 gemm/gemv Adreno dispatch * opencl: refactor q8_0 set_tensor * opencl: fix whitespace

model : Gemma4 model type detection (#22027)

fcc7508

* model : Gemma4 model type detection * model : Gemma4 model type detection

mtmd: add missing struct tag (#22023)

268d61e

CUDA: use LRU based eviction for cuda graphs (#21611)

b94050e

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

InfernalDread merged commit 62150cb into InfernalDread:turboquant_kv_cache_updated_v9 Apr 17, 2026
92 of 158 checks passed

github-actions Bot added examples server Nvidia GPU ggml testing devops build Hexagon OpenCL SYCL WebGPU labels Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turboquant_kv_cache_updated_v9#31

turboquant_kv_cache_updated_v9#31
InfernalDread merged 10 commits intoInfernalDread:turboquant_kv_cache_updated_v9from
ggml-org:master

InfernalDread commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

InfernalDread commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants