perf: route SQ L2/Cosine distance through SIMD u8 kernel#6550
Open
justinrmiller wants to merge 3 commits intolance-format:mainfrom
Open
perf: route SQ L2/Cosine distance through SIMD u8 kernel#6550justinrmiller wants to merge 3 commits intolance-format:mainfrom
justinrmiller wants to merge 3 commits intolance-format:mainfrom
Conversation
Follow-up from lance-format#6517. `SQDistCalculator::distance()` and `distance_all()` were still calling `l2_distance_uint_scalar` on the `L2 | Cosine` path while `Dot` already dispatched through `dot_u8` (via trait). Switch L2/Cosine to the SIMD `l2_u8` kernel so x86 users get the 1.26–1.5× AVX2 / AVX-512 VNNI speedup that lance-format#6517 already added. On aarch64 the kernel falls through to its scalar path, which is the same loop as `l2_distance_uint_scalar`, so Apple Silicon performance is unchanged (bench diff within noise, ±2%). The `L2 | Cosine` grouping preserves existing SQ behaviour — SQ has treated Cosine as L2 on the quantized codes for as long as this code has existed; that's orthogonal to the kernel wiring and can be revisited separately if the semantics need correcting.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up from #6517, per BubbleCal's review comment ("Followup: switch to the new kernels for SQ").
`SQDistCalculator::distance()` and `distance_all()` were still calling `l2_distance_uint_scalar` on the `L2 | Cosine` path while `Dot` already dispatched through `dot_u8` via the `Dot` trait. This PR switches L2/Cosine to the `l2_u8` SIMD kernel so x86 users actually get the AVX2 / AVX-512 VNNI backends that #6517 already shipped.
Benchmark
`cargo bench -p lance-index --bench sq` on Linux x86 (AVX2):
All changes are statistically significant (p < 0.05). The end-to-end gain is smaller than the raw `l2_u8` kernel speedup (1.26× on Ryzen per #6517) because this bench is dominated by chunk-lookup overhead (`binary_search`, `row_id` resolution, RNG, function-pointer dispatch) — distance computation is a minority of per-call cost.
On aarch64 the kernel falls through to its scalar path (same loop as `l2_distance_uint_scalar`), so Apple Silicon shows noise (±2%).
Scope note
This PR preserves the existing `L2 | Cosine` grouping — SQ has computed L2 on the quantized codes for both distance types as long as this code has existed. Changing that behaviour is orthogonal to the kernel wiring and can be revisited separately.
Test plan