turboquant_kv_cache_updated_v8 by InfernalDread · Pull Request #30 · InfernalDread/llama.cpp

InfernalDread · 2026-04-16T20:12:50Z

No description provided.

* server: use random media marker * nits * remove legacy <__image__> token * revert special char in random

…21638) * [SYCL] Fix Q8_0 reorder: add missing dequantize path for GEMM The Q8_0 reorder optimization (#21527) was missing a reorder-aware dequantizer for the GEMM code path used during prompt processing. After token generation reordered Q8_0 weights (via DMMV/MMVQ), the next prompt processing pass would read them with the standard dequantizer, producing garbage output. Add dequantize_block_q8_0_reorder() and wire it into both ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the pattern already used by Q4_0, Q4_K, and Q6_K. Fixes #21589 AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. * SYCL: fix reorder crash when device memory is full The reorder optimization allocates a temporary buffer the full size of the weight tensor on the device. When VRAM is nearly full (large models on a single GPU), this allocation fails and the subsequent memcpy crashes on a NULL pointer. Fix: try device allocation first, fall back to host memory if device memory is full. The reorder kernel still works correctly reading from host memory over PCIe. This is slower for the one-time reorder (~21 t/s vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for all subsequent inference. If both device and host allocation fail, skip the reorder and fall back to the unoptimized kernel path. Also fixes a bug where opt_for_reorder() marked tensors as reordered even when the reorder was skipped due to allocation failure. This caused DMMV/MMVQ kernels to read the original AoS data as if it were SoA, producing garbage output or NaN results. Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was AI-assisted (Claude), reviewed and tested on hardware by a human. Fixes #20478 * SYCL: add RAII temp buffer class + macro guard for host fallback Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free functions with sycl_reorder_temp_buffer RAII class. The host_fallback bool is now a private member, and cleanup happens automatically at scope exit. Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard the host memory fallback code path. Device access to host memory requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it. Addresses arthw's review on PR #21638. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: document GGML_SYCL_HOST_MEM_FALLBACK build option in SYCL.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add reorder-aware DMMV dequantizers for Q4_K and Q6_K Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not DMMV. When the DMMV path encountered reordered data it would abort. Add DMMV kernels that read from the SOA reorder layout for both types. Same math as the non-reorder versions, different memory access pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…21873) * Update register tiling matmul to use f32 accumulation * fix profiling code * Fix register tiling matmul for chrome, i'm blaming dawn * Update batch tuning value for iOS * compile fix * Fix use of new load function * Move to a single query set for GPU profiling * Move to batching compute passes when not profiling * Refactor build_multi * remove iOS throttling now that we're batching compute passes

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

…20633) * ggml-cpu: add 128-bit impls for i-quants, ternary quants * ggml-cpu: add 128-bit impls for iq2_xs, iq3_s, iq3_xxs, tq2_0 Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: refactor; add rvv checks --------- Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai> Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* nix: support unified apple-sdk * Impl roll op for Metal * Revert "nix: support unified apple-sdk" This reverts commit abfa473. * update ops.md * update op docs

* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid

* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns * fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code

* support nvfp4 tensors for Gemma4 * add wo_s to build_attn * add wo_s to build_attn * fix glm4

…ers (#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s

…#21980) * server: tests: fetch random media marker via /apply-template (#21962 fix) * server: allow pinning media marker via LLAMA_MEDIA_MARKER env var get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it as-is if set, falling back to the random marker otherwise. Tests no longer need to fetch the marker dynamically via /apply-template: the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts work as before. Address review feedback from ngxson * server: make get_media_marker() thread-safe via magic statics Use a C++11 static local with a lambda initializer instead of a global static with an empty-check. The runtime guarantees initialization exactly once without explicit locking. Address review feedback from ggerganov * nits * nits

* model: using single llm_build per arch * fix merge * nits

Perplexity benchmarking reveals catastrophic quality failure: - f16: 6.121, q8_0: 6.111, q4_0: 6.142 - turbo3: 165.6 (27× worse) Speed benchmarks were meaningless — fast garbage. Root cause investigation needed before any quality claims. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. V cache returns rotated-space values (cosine=0.02 vs correct 0.987) 2. dynamic_cast to llama_kv_cache_context fails for MoE models (uses llama_memory_hybrid_context, not kv_cache_context) → Q rotation and V inverse rotation NEVER executed Fix: store rotation tensors in llm_graph_context, not KV cache. Or access through hybrid memory interface. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Quality confirmed: PPL 6.194 (+1.4% of q8_0) Speed: 10.7 tok/s (inverse rotation in dequant, no pre-rotate-queries) Previous speed claims (51-77 tok/s) were invalid — measured garbage output speed. Key lessons documented for future reference. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ngxson and others added 16 commits April 15, 2026 23:52

server: use random media marker (#21962)

408225b

* server: use random media marker * nits * remove legacy <__image__> token * revert special char in random

ci : Use ggml-org/ccache-action on RISC-V as well (#21632)

8612ed1

devops : added spirv-headers to nix (#21965)

90fb96a

ggml : implemented simd_gemm kernel for riscv vector extension (#20627)

5637536

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

metal: Implement ROLL op (#21946)

ae2d348

* nix: support unified apple-sdk * Impl roll op for Metal * Revert "nix: support unified apple-sdk" This reverts commit abfa473. * update ops.md * update op docs

Convert: Fix NemotronH Config Parsing (#21664)

03b3d07

* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns * fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code

codeowners: add team member comments (#21714)

b572d1e

model : support NVFP4 tensors for Gemma4 (#21971)

f772f6e

* support nvfp4 tensors for Gemma4 * add wo_s to build_attn * add wo_s to build_attn * fix glm4

model : refactor QKV into common build_qkv and create_tensor_qkv help…

9db77a0

…ers (#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s

opencl: add q5_K gemm and gemv kernels for Adreno (#21595)

e45dbde

model: using single llm_build per arch (#21970)

4fbdabd

* model: using single llm_build per arch * fix merge * nits

InfernalDread merged commit c2c4f14 into InfernalDread:turboquant_kv_cache_updated_v8 Apr 16, 2026
102 of 169 checks passed

github-actions Bot added documentation Improvements or additions to documentation examples server Nvidia GPU ggml model python devops Apple Metal OpenCL SYCL WebGPU nix labels Apr 16, 2026

InfernalDread pushed a commit that referenced this pull request Apr 23, 2026

docs: perplexity 6.194 confirmed — 1.4% of q8_0 #30

091e12a

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turboquant_kv_cache_updated_v8#30

turboquant_kv_cache_updated_v8#30
InfernalDread merged 16 commits intoInfernalDread:turboquant_kv_cache_updated_v8from
ggml-org:master

InfernalDread commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

InfernalDread commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants