ggml: ARM NEON kernels for tbq3_0 and tbq4_0 + arch-fallback.h fix by CuriosityQuantified · Pull Request #1 · elusznik/llama.cpp

CuriosityQuantified · 2026-03-30T14:24:45Z

Summary

Two changes bundled together since the second depends on the first:

1. `arch-fallback.h` ARM64 build fix

ggml_vec_dot_tbq3_0_q8_K_generic and ggml_vec_dot_tbq4_0_q8_K_generic were missing from the __aarch64__ section of arch-fallback.h. This causes a linker failure on ARM64 builds (M-series Macs, ARM Linux). Added the missing defines to unblock native builds.

2. ARM NEON SIMD kernels for tbq3_0 and tbq4_0

ggml-cpu/quants.c — NEON path for ggml_vec_dot_tbq3_0_q8_K_generic and ggml_vec_dot_tbq4_0_q8_K_generic:

8-wide inner loop using vld1_s8 / vmovl / vcvtq_f32_s32 / vfmaq_f32
Dual accumulator pattern to reduce dependency stalls
Scalar fallback retained for tail elements

ggml-turboq.c — NEON path for matvec_row and matvec_t:

Added after existing AVX2 block, same guard (#elif defined(__ARM_NEON))
8-wide + 4-wide cleanup loops

Note on architecture: TBQ's random rotation matrix means vec_dot cannot use int8 accumulation shortcuts — the inverse rotation requires float32 dequantization. The 2× 128×128 rotation matmuls per block are the irreducible bottleneck; beating q4_0 would require a Walsh-Hadamard/butterfly transform (algorithmic change, not kernel-level).

Benchmarks

Hardware: M4 Mac mini, 16GB unified memory, 4 threads, -nkvo 1 (Metal incompatibility with TBQ SET_ROWS)
Model: Qwen3.5-4B-Q4_K_M, 8K context

type	pp t/s (before)	pp t/s (after)	gain
tbq4_0	258	276	+7%
tbq3_0	253	274	+8%
q4_0 (ref)	291	—	—

Gap to q4_0 closed from ~50 t/s → ~16 t/s.

Build: cmake -DCMAKE_BUILD_TYPE=Release, Metal enabled, flash attn on.

- ggml-cpu/quants.c: NEON 8-wide inner loop for ggml_vec_dot_tbq3_0_q8_K_generic and ggml_vec_dot_tbq4_0_q8_K_generic using vld1_s8/vmovl/vcvtq_f32_s32/vfmaq_f32 dual accumulator pattern; scalar fallback retained for tail elements - ggml-turboq.c: NEON path for matvec_row and matvec_t, added after existing AVX2 block with 8-wide + 4-wide cleanup loops - arch-fallback.h: add missing ggml_vec_dot_tbq3_0_q8_K_generic and ggml_vec_dot_tbq4_0_q8_K_generic defines in __aarch64__ section (linker failure without this on ARM64 builds) Benchmarks (8K ctx, Qwen3.5-4B-Q4_K_M, M4 Mac mini, 4 threads, -nkvo 1): tbq4_0: 258 -> 276 t/s pp (+7%), 3.9x compression tbq3_0: 253 -> 274 t/s pp (+8%), 5.2x compression Tested with: cmake -DCMAKE_BUILD_TYPE=Release, Metal enabled, flash attn on

gemini-code-assist

Code Review

This pull request introduces TurboQuant (TBQ), a new quantization scheme, by adding GGML_TYPE_TBQ3_0 and GGML_TYPE_TBQ4_0 support. The implementation includes block definitions, reference quantization/dequantization logic using rotation matrices, and optimized CPU kernels with ARM NEON SIMD support. The changes also integrate these types into the KV cache, attention graph, and CLI tools, supported by new unit and backend tests. Feedback was provided to correct an inaccurate comment in the architecture fallback header that incorrectly stated a lack of native ARM implementations.

gemini-code-assist · 2026-03-30T14:26:58Z

+// quants.c — TurboQuant vec_dot (no native ARM impl yet)
+#define ggml_vec_dot_tbq3_0_q8_K_generic ggml_vec_dot_tbq3_0_q8_K
+#define ggml_vec_dot_tbq4_0_q8_K_generic ggml_vec_dot_tbq4_0_q8_K


The comment "// quants.c — TurboQuant vec_dot (no native ARM impl yet)" appears to be inaccurate. The pull request description explicitly states that ARM NEON SIMD kernels for tbq3_0 and tbq4_0 have been added to ggml-cpu/quants.c. Please update or remove this comment to reflect the presence of native ARM implementations.

elusznik and others added 5 commits March 27, 2026 22:58

feat: add CPU TurboQuant KV cache types

d5a7164

ggml : limit the first TurboQuant CPU PR to TBQ

f96df92

Add flex post

5f89fd4

Remove flex post

17f3bc7

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

CuriosityQuantified mentioned this pull request Mar 30, 2026

ggml : add CPU TurboQuant KV cache types (TBQ3_0 / TBQ4_0) ggml-org/llama.cpp#21089

Open

fix: update arch-fallback.h comment to reflect NEON kernels

0459da3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: ARM NEON kernels for tbq3_0 and tbq4_0 + arch-fallback.h fix#1

ggml: ARM NEON kernels for tbq3_0 and tbq4_0 + arch-fallback.h fix#1
CuriosityQuantified wants to merge 6 commits intoelusznik:masterfrom
CuriosityQuantified:neon-arm-optimization

CuriosityQuantified commented Mar 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CuriosityQuantified commented Mar 30, 2026

Summary

1. arch-fallback.h ARM64 build fix

2. ARM NEON SIMD kernels for tbq3_0 and tbq4_0

Benchmarks

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `arch-fallback.h` ARM64 build fix