Skip to content

perf: add SIMD-accelerated u8 L2 and cosine distance kernels#6517

Merged
BubbleCal merged 3 commits intolance-format:mainfrom
justinrmiller:feat/u8-l2-cosine-simd
Apr 15, 2026
Merged

perf: add SIMD-accelerated u8 L2 and cosine distance kernels#6517
BubbleCal merged 3 commits intolance-format:mainfrom
justinrmiller:feat/u8-l2-cosine-simd

Conversation

@justinrmiller
Copy link
Copy Markdown
Contributor

@justinrmiller justinrmiller commented Apr 14, 2026

Summary

  • Add hand-written AVX2 and AVX-512 VNNI backends for u8 squared L2 distance (Σ(a-b)²) in new l2_u8.rs
  • Add fused single-pass u8 cosine distance kernel in new cosine_u8.rs — computes dot(a,b), ‖a‖², ‖b‖² simultaneously, halving memory traffic vs the previous 2-3 pass approach
  • Wire both into the L2 for u8 and Cosine for u8 trait impls
  • Add benchmarks comparing scalar vs SIMD for both kernels

Algorithmic approach (adapted from NumKong)

L2 (AVX2): Saturating subtraction for |a-b|, zero-extend u8→i16, VPMADDWD(diff, diff) to square and accumulate into i32. 32 elements/iter.

L2 (AVX-512 VNNI): Same abs-diff approach with VPDPWSSD for fused square-accumulate. 64 elements/iter.

Cosine (AVX2): Zero-extend both vectors to i16, triple VPMADDWD per half (a·b, a·a, b·b). 32 elements/iter, single pass.

Cosine (AVX-512 VNNI): Same three-accumulator approach with VPDPWSSD. 64 elements/iter.

Both kernels use OnceLock-based runtime CPU dispatch, falling back to portable scalar on non-x86 platforms.

Benchmarks

1M × 1024-dim u8 vectors.

x86_64 — AMD Ryzen 5 4500 6-Core (AVX2, no AVX-512)

Kernel Scalar SIMD Speedup
L2(u8) 73.5 ms 58.2 ms 1.26x
Cosine(u8) 122.2 ms 82.1 ms 1.49x

L2 auto-vectorization baseline was 91.5 ms, so SIMD is 1.57x faster than that path.

aarch64 — Apple Silicon M3 Max (no AVX2, scalar fallback)

Kernel Scalar SIMD (dispatch)
L2(u8) 26.8 ms 27.3 ms
Cosine(u8) 90.1 ms 90.4 ms

On aarch64 the SIMD path falls through to scalar (no AVX2), so times are identical — confirms no regression on non-x86 platforms. AVX-512 VNNI systems (Ice Lake+, Zen 4+) should see larger gains.

Test plan

  • All 11 new tests pass: SIMD backends verified against scalar reference across 18 vector sizes (0–4097), boundary values (0/255), alternating patterns, random seeds
  • All 63 existing lance-linalg tests pass (no regressions)
  • Clippy clean, fmt clean
  • Benchmarked on x86_64 AVX2 (AMD Ryzen 5 4500) — L2 1.26x, Cosine 1.49x faster
  • Verify on AVX-512 VNNI system for additional speedup data

🤖 Generated with Claude Code

Add hand-written AVX2 and AVX-512 VNNI backends for u8 squared L2
distance (Σ(a-b)²) and fused single-pass u8 cosine distance.

L2 uses saturating subtraction for absolute difference, then VPMADDWD
to square and accumulate. Cosine maintains three accumulators
(dot_ab, norm_a², norm_b²) in a single pass, halving memory traffic.

Both kernels use runtime CPU detection with OnceLock dispatch,
falling back to portable scalar on non-x86 platforms.

Algorithmic approach adapted from NumKong (github.com/ashvardanian/NumKong).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Newer rustc versions changed the expected pointer type from *const i32
to *const __m512i for AVX-512 load intrinsics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup: switch to the new kernels for SQ

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

@BubbleCal BubbleCal merged commit 5c83b84 into lance-format:main Apr 15, 2026
28 checks passed
justinrmiller added a commit to justinrmiller/lance that referenced this pull request Apr 16, 2026
Follow-up from lance-format#6517. `SQDistCalculator::distance()` and
`distance_all()` were still calling `l2_distance_uint_scalar` on the
`L2 | Cosine` path while `Dot` already dispatched through `dot_u8`
(via trait). Switch L2/Cosine to the SIMD `l2_u8` kernel so x86 users
get the 1.26–1.5× AVX2 / AVX-512 VNNI speedup that lance-format#6517 already added.

On aarch64 the kernel falls through to its scalar path, which is the
same loop as `l2_distance_uint_scalar`, so Apple Silicon performance
is unchanged (bench diff within noise, ±2%).

The `L2 | Cosine` grouping preserves existing SQ behaviour — SQ has
treated Cosine as L2 on the quantized codes for as long as this code
has existed; that's orthogonal to the kernel wiring and can be
revisited separately if the semantics need correcting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants