perf: add SIMD-accelerated u8 L2 and cosine distance kernels by justinrmiller · Pull Request #6517 · lance-format/lance

justinrmiller · 2026-04-14T17:47:38Z

Summary

Add hand-written AVX2 and AVX-512 VNNI backends for u8 squared L2 distance (Σ(a-b)²) in new l2_u8.rs
Add fused single-pass u8 cosine distance kernel in new cosine_u8.rs — computes dot(a,b), ‖a‖², ‖b‖² simultaneously, halving memory traffic vs the previous 2-3 pass approach
Wire both into the L2 for u8 and Cosine for u8 trait impls
Add benchmarks comparing scalar vs SIMD for both kernels

Algorithmic approach (adapted from NumKong)

L2 (AVX2): Saturating subtraction for |a-b|, zero-extend u8→i16, VPMADDWD(diff, diff) to square and accumulate into i32. 32 elements/iter.

L2 (AVX-512 VNNI): Same abs-diff approach with VPDPWSSD for fused square-accumulate. 64 elements/iter.

Cosine (AVX2): Zero-extend both vectors to i16, triple VPMADDWD per half (a·b, a·a, b·b). 32 elements/iter, single pass.

Cosine (AVX-512 VNNI): Same three-accumulator approach with VPDPWSSD. 64 elements/iter.

Both kernels use OnceLock-based runtime CPU dispatch, falling back to portable scalar on non-x86 platforms.

Benchmarks

1M × 1024-dim u8 vectors.

x86_64 — AMD Ryzen 5 4500 6-Core (AVX2, no AVX-512)

Kernel	Scalar	SIMD	Speedup
L2(u8)	73.5 ms	58.2 ms	1.26x
Cosine(u8)	122.2 ms	82.1 ms	1.49x

L2 auto-vectorization baseline was 91.5 ms, so SIMD is 1.57x faster than that path.

aarch64 — Apple Silicon M3 Max (no AVX2, scalar fallback)

Kernel	Scalar	SIMD (dispatch)
L2(u8)	26.8 ms	27.3 ms
Cosine(u8)	90.1 ms	90.4 ms

On aarch64 the SIMD path falls through to scalar (no AVX2), so times are identical — confirms no regression on non-x86 platforms. AVX-512 VNNI systems (Ice Lake+, Zen 4+) should see larger gains.

Test plan

All 11 new tests pass: SIMD backends verified against scalar reference across 18 vector sizes (0–4097), boundary values (0/255), alternating patterns, random seeds
All 63 existing lance-linalg tests pass (no regressions)
Clippy clean, fmt clean
Benchmarked on x86_64 AVX2 (AMD Ryzen 5 4500) — L2 1.26x, Cosine 1.49x faster
Verify on AVX-512 VNNI system for additional speedup data

🤖 Generated with Claude Code

Add hand-written AVX2 and AVX-512 VNNI backends for u8 squared L2 distance (Σ(a-b)²) and fused single-pass u8 cosine distance. L2 uses saturating subtraction for absolute difference, then VPMADDWD to square and accumulate. Cosine maintains three accumulators (dot_ab, norm_a², norm_b²) in a single pass, halving memory traffic. Both kernels use runtime CPU detection with OnceLock dispatch, falling back to portable scalar on non-x86 platforms. Algorithmic approach adapted from NumKong (github.com/ashvardanian/NumKong). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Newer rustc versions changed the expected pointer type from *const i32 to *const __m512i for AVX-512 load intrinsics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BubbleCal

Followup: switch to the new kernels for SQ

codecov · 2026-04-15T06:19:33Z

Codecov Report

❌ Patch coverage is 73.94541% with 105 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-linalg/src/distance/cosine_u8.rs	75.21%	56 Missing and 2 partials ⚠️
rust/lance-linalg/src/distance/l2_u8.rs	73.93%	41 Missing and 2 partials ⚠️
rust/lance-linalg/src/distance/cosine.rs	0.00%	3 Missing ⚠️
rust/lance-linalg/src/distance/l2.rs	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Follow-up from lance-format#6517. `SQDistCalculator::distance()` and `distance_all()` were still calling `l2_distance_uint_scalar` on the `L2 | Cosine` path while `Dot` already dispatched through `dot_u8` (via trait). Switch L2/Cosine to the SIMD `l2_u8` kernel so x86 users get the 1.26–1.5× AVX2 / AVX-512 VNNI speedup that lance-format#6517 already added. On aarch64 the kernel falls through to its scalar path, which is the same loop as `l2_distance_uint_scalar`, so Apple Silicon performance is unchanged (bench diff within noise, ±2%). The `L2 | Cosine` grouping preserves existing SQ behaviour — SQ has treated Cosine as L2 on the quantized codes for as long as this code has existed; that's orthogonal to the kernel wiring and can be revisited separately if the semantics need correcting.

github-actions Bot added the performance label Apr 14, 2026

fix: use *const __m512i for _mm512_loadu_si512 pointer casts

67c9e4a

Newer rustc versions changed the expected pointer type from *const i32 to *const __m512i for AVX-512 load intrinsics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BubbleCal approved these changes Apr 15, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into feat/u8-l2-cosine-simd

f958a89

BubbleCal merged commit 5c83b84 into lance-format:main Apr 15, 2026
28 checks passed

justinrmiller mentioned this pull request Apr 16, 2026

perf: route SQ L2/Cosine distance through SIMD u8 kernel #6550

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: add SIMD-accelerated u8 L2 and cosine distance kernels#6517

perf: add SIMD-accelerated u8 L2 and cosine distance kernels#6517
BubbleCal merged 3 commits intolance-format:mainfrom
justinrmiller:feat/u8-l2-cosine-simd

justinrmiller commented Apr 14, 2026 •

edited

Loading

Uh oh!

BubbleCal left a comment

Uh oh!

codecov Bot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justinrmiller commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Algorithmic approach (adapted from NumKong)

Benchmarks

Test plan

Uh oh!

BubbleCal left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 15, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinrmiller commented Apr 14, 2026 •

edited

Loading