ggml: ARM NEON kernels for tbq3_0 and tbq4_0 + arch-fallback.h fix#1
ggml: ARM NEON kernels for tbq3_0 and tbq4_0 + arch-fallback.h fix#1CuriosityQuantified wants to merge 6 commits intoelusznik:masterfrom
Conversation
- ggml-cpu/quants.c: NEON 8-wide inner loop for ggml_vec_dot_tbq3_0_q8_K_generic and ggml_vec_dot_tbq4_0_q8_K_generic using vld1_s8/vmovl/vcvtq_f32_s32/vfmaq_f32 dual accumulator pattern; scalar fallback retained for tail elements - ggml-turboq.c: NEON path for matvec_row and matvec_t, added after existing AVX2 block with 8-wide + 4-wide cleanup loops - arch-fallback.h: add missing ggml_vec_dot_tbq3_0_q8_K_generic and ggml_vec_dot_tbq4_0_q8_K_generic defines in __aarch64__ section (linker failure without this on ARM64 builds) Benchmarks (8K ctx, Qwen3.5-4B-Q4_K_M, M4 Mac mini, 4 threads, -nkvo 1): tbq4_0: 258 -> 276 t/s pp (+7%), 3.9x compression tbq3_0: 253 -> 274 t/s pp (+8%), 5.2x compression Tested with: cmake -DCMAKE_BUILD_TYPE=Release, Metal enabled, flash attn on
There was a problem hiding this comment.
Code Review
This pull request introduces TurboQuant (TBQ), a new quantization scheme, by adding GGML_TYPE_TBQ3_0 and GGML_TYPE_TBQ4_0 support. The implementation includes block definitions, reference quantization/dequantization logic using rotation matrices, and optimized CPU kernels with ARM NEON SIMD support. The changes also integrate these types into the KV cache, attention graph, and CLI tools, supported by new unit and backend tests. Feedback was provided to correct an inaccurate comment in the architecture fallback header that incorrectly stated a lack of native ARM implementations.
| // quants.c — TurboQuant vec_dot (no native ARM impl yet) | ||
| #define ggml_vec_dot_tbq3_0_q8_K_generic ggml_vec_dot_tbq3_0_q8_K | ||
| #define ggml_vec_dot_tbq4_0_q8_K_generic ggml_vec_dot_tbq4_0_q8_K |
There was a problem hiding this comment.
The comment "// quants.c — TurboQuant vec_dot (no native ARM impl yet)" appears to be inaccurate. The pull request description explicitly states that ARM NEON SIMD kernels for tbq3_0 and tbq4_0 have been added to ggml-cpu/quants.c. Please update or remove this comment to reflect the presence of native ARM implementations.
Summary
Two changes bundled together since the second depends on the first:
1.
arch-fallback.hARM64 build fixggml_vec_dot_tbq3_0_q8_K_genericandggml_vec_dot_tbq4_0_q8_K_genericwere missing from the__aarch64__section ofarch-fallback.h. This causes a linker failure on ARM64 builds (M-series Macs, ARM Linux). Added the missing defines to unblock native builds.2. ARM NEON SIMD kernels for tbq3_0 and tbq4_0
ggml-cpu/quants.c— NEON path forggml_vec_dot_tbq3_0_q8_K_genericandggml_vec_dot_tbq4_0_q8_K_generic:vld1_s8/vmovl/vcvtq_f32_s32/vfmaq_f32ggml-turboq.c— NEON path formatvec_rowandmatvec_t:#elif defined(__ARM_NEON))Note on architecture: TBQ's random rotation matrix means vec_dot cannot use int8 accumulation shortcuts — the inverse rotation requires float32 dequantization. The 2× 128×128 rotation matmuls per block are the irreducible bottleneck; beating q4_0 would require a Walsh-Hadamard/butterfly transform (algorithmic change, not kernel-level).
Benchmarks
Hardware: M4 Mac mini, 16GB unified memory, 4 threads,
-nkvo 1(Metal incompatibility with TBQ SET_ROWS)Model: Qwen3.5-4B-Q4_K_M, 8K context
Gap to q4_0 closed from ~50 t/s → ~16 t/s.
Build:
cmake -DCMAKE_BUILD_TYPE=Release, Metal enabled, flash attn on.