Skip to content

perf: route SQ L2/Cosine distance through SIMD u8 kernel#6550

Open
justinrmiller wants to merge 3 commits intolance-format:mainfrom
justinrmiller:jm/sq-use-u8-kernels
Open

perf: route SQ L2/Cosine distance through SIMD u8 kernel#6550
justinrmiller wants to merge 3 commits intolance-format:mainfrom
justinrmiller:jm/sq-use-u8-kernels

Conversation

@justinrmiller
Copy link
Copy Markdown
Contributor

Summary

Follow-up from #6517, per BubbleCal's review comment ("Followup: switch to the new kernels for SQ").

`SQDistCalculator::distance()` and `distance_all()` were still calling `l2_distance_uint_scalar` on the `L2 | Cosine` path while `Dot` already dispatched through `dot_u8` via the `Dot` trait. This PR switches L2/Cosine to the `l2_u8` SIMD kernel so x86 users actually get the AVX2 / AVX-512 VNNI backends that #6517 already shipped.

Benchmark

`cargo bench -p lance-index --bench sq` on Linux x86 (AVX2):

chunks × 10K before after change
1 315.3 ns 304.8 ns −3.4%
32 390.1 ns 363.1 ns −6.7%
128 499.7 ns 482.2 ns −3.7%
1024 531.3 ns 515.1 ns −3.1%

All changes are statistically significant (p < 0.05). The end-to-end gain is smaller than the raw `l2_u8` kernel speedup (1.26× on Ryzen per #6517) because this bench is dominated by chunk-lookup overhead (`binary_search`, `row_id` resolution, RNG, function-pointer dispatch) — distance computation is a minority of per-call cost.

On aarch64 the kernel falls through to its scalar path (same loop as `l2_distance_uint_scalar`), so Apple Silicon shows noise (±2%).

Scope note

This PR preserves the existing `L2 | Cosine` grouping — SQ has computed L2 on the quantized codes for both distance types as long as this code has existed. Changing that behaviour is orthogonal to the kernel wiring and can be revisited separately.

Test plan

  • `cargo test -p lance-index --lib -- vector::sq` — all 5 tests pass
  • `cargo clippy -p lance-index --tests --benches -- -D warnings` clean
  • Benchmarked on Linux x86 (AVX2) — −3.1% to −6.7% across chunk sizes
  • Benchmarked on aarch64 (Apple Silicon) — within noise, as expected

Follow-up from lance-format#6517. `SQDistCalculator::distance()` and
`distance_all()` were still calling `l2_distance_uint_scalar` on the
`L2 | Cosine` path while `Dot` already dispatched through `dot_u8`
(via trait). Switch L2/Cosine to the SIMD `l2_u8` kernel so x86 users
get the 1.26–1.5× AVX2 / AVX-512 VNNI speedup that lance-format#6517 already added.

On aarch64 the kernel falls through to its scalar path, which is the
same loop as `l2_distance_uint_scalar`, so Apple Silicon performance
is unchanged (bench diff within noise, ±2%).

The `L2 | Cosine` grouping preserves existing SQ behaviour — SQ has
treated Cosine as L2 on the quantized codes for as long as this code
has existed; that's orthogonal to the kernel wiring and can be
revisited separately if the semantics need correcting.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant