perf: route SQ L2/Cosine distance through SIMD u8 kernel by justinrmiller · Pull Request #6550 · lance-format/lance

justinrmiller · 2026-04-16T23:11:25Z

Summary

Follow-up from #6517, per BubbleCal's review comment ("Followup: switch to the new kernels for SQ").

`SQDistCalculator::distance()` and `distance_all()` were still calling `l2_distance_uint_scalar` on the `L2 | Cosine` path while `Dot` already dispatched through `dot_u8` via the `Dot` trait. This PR switches L2/Cosine to the `l2_u8` SIMD kernel so x86 users actually get the AVX2 / AVX-512 VNNI backends that #6517 already shipped.

Benchmark

`cargo bench -p lance-index --bench sq` on Linux x86 (AVX2):

chunks × 10K	before	after	change
1	315.3 ns	304.8 ns	−3.4%
32	390.1 ns	363.1 ns	−6.7%
128	499.7 ns	482.2 ns	−3.7%
1024	531.3 ns	515.1 ns	−3.1%

All changes are statistically significant (p < 0.05). The end-to-end gain is smaller than the raw `l2_u8` kernel speedup (1.26× on Ryzen per #6517) because this bench is dominated by chunk-lookup overhead (`binary_search`, `row_id` resolution, RNG, function-pointer dispatch) — distance computation is a minority of per-call cost.

On aarch64 the kernel falls through to its scalar path (same loop as `l2_distance_uint_scalar`), so Apple Silicon shows noise (±2%).

Scope note

This PR preserves the existing `L2 | Cosine` grouping — SQ has computed L2 on the quantized codes for both distance types as long as this code has existed. Changing that behaviour is orthogonal to the kernel wiring and can be revisited separately.

Test plan

`cargo test -p lance-index --lib -- vector::sq` — all 5 tests pass
`cargo clippy -p lance-index --tests --benches -- -D warnings` clean
Benchmarked on Linux x86 (AVX2) — −3.1% to −6.7% across chunk sizes
Benchmarked on aarch64 (Apple Silicon) — within noise, as expected

Follow-up from lance-format#6517. `SQDistCalculator::distance()` and `distance_all()` were still calling `l2_distance_uint_scalar` on the `L2 | Cosine` path while `Dot` already dispatched through `dot_u8` (via trait). Switch L2/Cosine to the SIMD `l2_u8` kernel so x86 users get the 1.26–1.5× AVX2 / AVX-512 VNNI speedup that lance-format#6517 already added. On aarch64 the kernel falls through to its scalar path, which is the same loop as `l2_distance_uint_scalar`, so Apple Silicon performance is unchanged (bench diff within noise, ±2%). The `L2 | Cosine` grouping preserves existing SQ behaviour — SQ has treated Cosine as L2 on the quantized codes for as long as this code has existed; that's orthogonal to the kernel wiring and can be revisited separately if the semantics need correcting.

codecov · 2026-04-16T23:51:45Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions Bot added the performance label Apr 16, 2026

justinrmiller and others added 2 commits April 16, 2026 16:19

Merge branch 'main' into jm/sq-use-u8-kernels

1c1a03b

style: rustfmt

630e8f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: route SQ L2/Cosine distance through SIMD u8 kernel#6550

perf: route SQ L2/Cosine distance through SIMD u8 kernel#6550
justinrmiller wants to merge 3 commits intolance-format:mainfrom
justinrmiller:jm/sq-use-u8-kernels

justinrmiller commented Apr 16, 2026

Uh oh!

codecov Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justinrmiller commented Apr 16, 2026

Summary

Benchmark

Scope note

Test plan

Uh oh!

codecov Bot commented Apr 16, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant