perf: add SIMD-accelerated u8 L2 and cosine distance kernels#6517
Merged
BubbleCal merged 3 commits intolance-format:mainfrom Apr 15, 2026
Merged
perf: add SIMD-accelerated u8 L2 and cosine distance kernels#6517BubbleCal merged 3 commits intolance-format:mainfrom
BubbleCal merged 3 commits intolance-format:mainfrom
Conversation
Add hand-written AVX2 and AVX-512 VNNI backends for u8 squared L2 distance (Σ(a-b)²) and fused single-pass u8 cosine distance. L2 uses saturating subtraction for absolute difference, then VPMADDWD to square and accumulate. Cosine maintains three accumulators (dot_ab, norm_a², norm_b²) in a single pass, halving memory traffic. Both kernels use runtime CPU detection with OnceLock dispatch, falling back to portable scalar on non-x86 platforms. Algorithmic approach adapted from NumKong (github.com/ashvardanian/NumKong). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Newer rustc versions changed the expected pointer type from *const i32 to *const __m512i for AVX-512 load intrinsics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BubbleCal
approved these changes
Apr 15, 2026
Contributor
BubbleCal
left a comment
There was a problem hiding this comment.
Followup: switch to the new kernels for SQ
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
justinrmiller
added a commit
to justinrmiller/lance
that referenced
this pull request
Apr 16, 2026
Follow-up from lance-format#6517. `SQDistCalculator::distance()` and `distance_all()` were still calling `l2_distance_uint_scalar` on the `L2 | Cosine` path while `Dot` already dispatched through `dot_u8` (via trait). Switch L2/Cosine to the SIMD `l2_u8` kernel so x86 users get the 1.26–1.5× AVX2 / AVX-512 VNNI speedup that lance-format#6517 already added. On aarch64 the kernel falls through to its scalar path, which is the same loop as `l2_distance_uint_scalar`, so Apple Silicon performance is unchanged (bench diff within noise, ±2%). The `L2 | Cosine` grouping preserves existing SQ behaviour — SQ has treated Cosine as L2 on the quantized codes for as long as this code has existed; that's orthogonal to the kernel wiring and can be revisited separately if the semantics need correcting.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Σ(a-b)²) in newl2_u8.rscosine_u8.rs— computesdot(a,b),‖a‖²,‖b‖²simultaneously, halving memory traffic vs the previous 2-3 pass approachL2 for u8andCosine for u8trait implsAlgorithmic approach (adapted from NumKong)
L2 (AVX2): Saturating subtraction for
|a-b|, zero-extend u8→i16,VPMADDWD(diff, diff)to square and accumulate into i32. 32 elements/iter.L2 (AVX-512 VNNI): Same abs-diff approach with
VPDPWSSDfor fused square-accumulate. 64 elements/iter.Cosine (AVX2): Zero-extend both vectors to i16, triple
VPMADDWDper half (a·b, a·a, b·b). 32 elements/iter, single pass.Cosine (AVX-512 VNNI): Same three-accumulator approach with
VPDPWSSD. 64 elements/iter.Both kernels use
OnceLock-based runtime CPU dispatch, falling back to portable scalar on non-x86 platforms.Benchmarks
1M × 1024-dim u8 vectors.
x86_64 — AMD Ryzen 5 4500 6-Core (AVX2, no AVX-512)
L2 auto-vectorization baseline was 91.5 ms, so SIMD is 1.57x faster than that path.
aarch64 — Apple Silicon M3 Max (no AVX2, scalar fallback)
On aarch64 the SIMD path falls through to scalar (no AVX2), so times are identical — confirms no regression on non-x86 platforms. AVX-512 VNNI systems (Ice Lake+, Zen 4+) should see larger gains.
Test plan
🤖 Generated with Claude Code