Skip to content

ggml: ARM NEON kernels for tbq3_0 and tbq4_0 + arch-fallback.h fix#1

Open
CuriosityQuantified wants to merge 6 commits intoelusznik:masterfrom
CuriosityQuantified:neon-arm-optimization
Open

ggml: ARM NEON kernels for tbq3_0 and tbq4_0 + arch-fallback.h fix#1
CuriosityQuantified wants to merge 6 commits intoelusznik:masterfrom
CuriosityQuantified:neon-arm-optimization

Conversation

@CuriosityQuantified
Copy link
Copy Markdown

Summary

Two changes bundled together since the second depends on the first:

1. arch-fallback.h ARM64 build fix

ggml_vec_dot_tbq3_0_q8_K_generic and ggml_vec_dot_tbq4_0_q8_K_generic were missing from the __aarch64__ section of arch-fallback.h. This causes a linker failure on ARM64 builds (M-series Macs, ARM Linux). Added the missing defines to unblock native builds.

2. ARM NEON SIMD kernels for tbq3_0 and tbq4_0

ggml-cpu/quants.c — NEON path for ggml_vec_dot_tbq3_0_q8_K_generic and ggml_vec_dot_tbq4_0_q8_K_generic:

  • 8-wide inner loop using vld1_s8 / vmovl / vcvtq_f32_s32 / vfmaq_f32
  • Dual accumulator pattern to reduce dependency stalls
  • Scalar fallback retained for tail elements

ggml-turboq.c — NEON path for matvec_row and matvec_t:

  • Added after existing AVX2 block, same guard (#elif defined(__ARM_NEON))
  • 8-wide + 4-wide cleanup loops

Note on architecture: TBQ's random rotation matrix means vec_dot cannot use int8 accumulation shortcuts — the inverse rotation requires float32 dequantization. The 2× 128×128 rotation matmuls per block are the irreducible bottleneck; beating q4_0 would require a Walsh-Hadamard/butterfly transform (algorithmic change, not kernel-level).

Benchmarks

Hardware: M4 Mac mini, 16GB unified memory, 4 threads, -nkvo 1 (Metal incompatibility with TBQ SET_ROWS)
Model: Qwen3.5-4B-Q4_K_M, 8K context

type pp t/s (before) pp t/s (after) gain
tbq4_0 258 276 +7%
tbq3_0 253 274 +8%
q4_0 (ref) 291

Gap to q4_0 closed from ~50 t/s → ~16 t/s.

Build: cmake -DCMAKE_BUILD_TYPE=Release, Metal enabled, flash attn on.

elusznik and others added 5 commits March 27, 2026 22:58
- ggml-cpu/quants.c: NEON 8-wide inner loop for ggml_vec_dot_tbq3_0_q8_K_generic
  and ggml_vec_dot_tbq4_0_q8_K_generic using vld1_s8/vmovl/vcvtq_f32_s32/vfmaq_f32
  dual accumulator pattern; scalar fallback retained for tail elements
- ggml-turboq.c: NEON path for matvec_row and matvec_t, added after existing
  AVX2 block with 8-wide + 4-wide cleanup loops
- arch-fallback.h: add missing ggml_vec_dot_tbq3_0_q8_K_generic and
  ggml_vec_dot_tbq4_0_q8_K_generic defines in __aarch64__ section (linker
  failure without this on ARM64 builds)

Benchmarks (8K ctx, Qwen3.5-4B-Q4_K_M, M4 Mac mini, 4 threads, -nkvo 1):
  tbq4_0: 258 -> 276 t/s pp (+7%), 3.9x compression
  tbq3_0: 253 -> 274 t/s pp (+8%), 5.2x compression

Tested with: cmake -DCMAKE_BUILD_TYPE=Release, Metal enabled, flash attn on
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces TurboQuant (TBQ), a new quantization scheme, by adding GGML_TYPE_TBQ3_0 and GGML_TYPE_TBQ4_0 support. The implementation includes block definitions, reference quantization/dequantization logic using rotation matrices, and optimized CPU kernels with ARM NEON SIMD support. The changes also integrate these types into the KV cache, attention graph, and CLI tools, supported by new unit and backend tests. Feedback was provided to correct an inaccurate comment in the architecture fallback header that incorrectly stated a lack of native ARM implementations.

Comment thread ggml/src/ggml-cpu/arch-fallback.h Outdated
Comment on lines +84 to +86
// quants.c — TurboQuant vec_dot (no native ARM impl yet)
#define ggml_vec_dot_tbq3_0_q8_K_generic ggml_vec_dot_tbq3_0_q8_K
#define ggml_vec_dot_tbq4_0_q8_K_generic ggml_vec_dot_tbq4_0_q8_K
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment "// quants.c — TurboQuant vec_dot (no native ARM impl yet)" appears to be inaccurate. The pull request description explicitly states that ARM NEON SIMD kernels for tbq3_0 and tbq4_0 have been added to ggml-cpu/quants.c. Please update or remove this comment to reflect the presence of native ARM implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants