perf: in-register lookup table & SIMD for 4bit PQ#3178
perf: in-register lookup table & SIMD for 4bit PQ#3178BubbleCal merged 30 commits intolance-format:mainfrom
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3178 +/- ##
==========================================
- Coverage 78.62% 78.51% -0.12%
==========================================
Files 243 244 +1
Lines 82889 83213 +324
Branches 82889 83213 +324
==========================================
+ Hits 65170 65331 +161
- Misses 14933 15099 +166
+ Partials 2786 2783 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
| // let qmax = distance_table | ||
| // .chunks(NUM_CENTROIDS) | ||
| // .tuple_windows() | ||
| // .map(|(a, b)| { |
|
|
||
| // Quantize the distance table to u8 | ||
| // returns quantized_distance_table | ||
| // used for only 4bit PQ so num_centroids must be 16 |
There was a problem hiding this comment.
Can you add comment about what are the returns
| let pq_codes = UInt8Array::from_iter_values(pq_codes); | ||
| let transposed_codes = transpose(&pq_codes, num_vectors, num_sub_vectors); | ||
| let distances = compute_l2_distance( | ||
| let distances = compute_pq_distance( |
There was a problem hiding this comment.
compute_l2_distance and compute_dot_distance are the same, so keep only one.
the diff is at building distance table
| #[derive(Clone, Copy)] | ||
| pub struct u8x16(pub __m128i); | ||
|
|
||
| /// 16 of 32-bit `f32` values. Use 512-bit SIMD if possible. |
| } | ||
|
|
||
| #[inline] | ||
| pub fn right_shift_4(self) -> Self { |
There was a problem hiding this comment.
does this api compatible with portable_simd?
There was a problem hiding this comment.
didn't see bit shifting operation in portable_simd
| } | ||
| #[cfg(target_arch = "loongarch64")] | ||
| unsafe { | ||
| Self(lasx_xvfrsh_b(self.0, 4)) |
There was a problem hiding this comment.
huh you figured out how to use longarch?
There was a problem hiding this comment.
lol no, no way to test it, let me remove all loongarch code for u8x16
| unsafe { | ||
| Self(vandq_u8(self.0, vdupq_n_u8(mask))) | ||
| } | ||
| #[cfg(target_arch = "loongarch64")] |
There was a problem hiding this comment.
Shall we always have a fallback for non simd route?
| fn reduce_min(&self) -> u8 { | ||
| #[cfg(target_arch = "x86_64")] | ||
| unsafe { | ||
| let low = _mm_and_si128(self.0, _mm_set1_epi8(0xFF_u8 as i8)); |
There was a problem hiding this comment.
this is only using sse1? Curious whether there are avx2 related coding to make this even faster.
There was a problem hiding this comment.
didn't find a avx2 intrinsic to do this, but reduce_min is not used for now
There was a problem hiding this comment.
Lets just delete reduce_sum and reduce_min if they are not used.
| #[cfg(target_arch = "aarch64")] | ||
| unsafe { | ||
| Self(vminq_u8(self.0, rhs.0)) | ||
| } |
There was a problem hiding this comment.
lets always have a fallback route
| #[case(4, DistanceType::L2, 0.9)] | ||
| #[case(4, DistanceType::Cosine, 0.9)] | ||
| #[case(4, DistanceType::Dot, 0.8)] | ||
| #[case(4, DistanceType::L2, 0.75)] |
There was a problem hiding this comment.
You mentioned the new algorithm can have decent recall? Should we bump this up
| let (qmin, qmax, distance_table) = quantize_distance_table(distance_table); | ||
| let num_vectors = code.len() * 2 / num_sub_vectors; | ||
| let mut distances = vec![0.0_f32; num_vectors]; | ||
| // store the distances in u32 to avoid overflow |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
| debug_assert_eq!(dist_table.as_array(), origin_dist_table.as_array()); | ||
|
|
||
| // compute next distances | ||
| let next_indices = vec_indices.right_shift_4(); |
There was a problem hiding this comment.
Should we just implement a Shr for u8x16? This interface looks weird.
| fn shuffle(&self, indices: u8x16) -> Self { | ||
| #[cfg(target_arch = "x86_64")] | ||
| unsafe { | ||
| Self(_mm_shuffle_epi8(self.0, indices.0)) |
There was a problem hiding this comment.
Would it be faster if we can use https://doc.rust-lang.org/beta/core/arch/x86_64/fn._mm256_shuffle_epi8.html (u8x32)
There was a problem hiding this comment.
Yeah I believe so,
Chose u8x16 because it fit in arm register size
There was a problem hiding this comment.
we can implement u8x32 as 2 of 128bit register on arm? Just in general this can speed up x86 old cpu a lot, similar to https://github.com/lancedb/lance/blob/main/rust/lance-linalg/src/simd/f32.rs#L462
There was a problem hiding this comment.
doable, let's do this next PR? that would need to change the computation logic as well, because there are only 16 centroids in distance_table for each sub vector.
| .into_iter() | ||
| .zip(distances.iter_mut()) | ||
| .for_each(|(d, sum)| { | ||
| *sum += d as f32; |
There was a problem hiding this comment.
no, sum is distances[i]
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
4bit PQ is 3x faster than before:
post 8bit PQ results here for comparing, in short 4bit PQ is about 2x faster with the same index params: