feat!: support hamming distance & binary vector#3198
feat!: support hamming distance & binary vector#3198BubbleCal merged 15 commits intolance-format:mainfrom
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
| pub fn nearest<T: ArrowPrimitiveType>( | ||
| &mut self, | ||
| column: &str, | ||
| q: &PrimitiveArray<T>, |
There was a problem hiding this comment.
this is a breaking change, code like this can't work since now:
.nearest(column, q.as_primitive(), k)
because the compiler can't infer the type
| } | ||
|
|
||
| #[derive(Debug, Clone, DeepSizeOf)] | ||
| pub struct FlatBinQuantizer { |
There was a problem hiding this comment.
We have to add this rather than reusing FlatQuantizer even flat quantizer does nothing, because we determine the VectorStore type by Quantization
| #[case(4, DistanceType::L2, 1.0)] | ||
| #[case(4, DistanceType::Cosine, 1.0)] | ||
| #[case(4, DistanceType::Dot, 1.0)] | ||
| #[case(4, DistanceType::Hamming, 0.9)] |
There was a problem hiding this comment.
even it's flat search, the recall for hamming can't always hit 100%, because there are many vectors that are at the same distance from the query vector
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3198 +/- ##
==========================================
- Coverage 78.55% 78.54% -0.01%
==========================================
Files 244 244
Lines 83213 84298 +1085
Branches 83213 84298 +1085
==========================================
+ Hits 65370 66216 +846
- Misses 15056 15286 +230
- Partials 2787 2796 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…ector Signed-off-by: BubbleCal <bubble-cal@outlook.com>
westonpace
left a comment
There was a problem hiding this comment.
Might still be good for @chebbyChefNEQ or @eddyxu to look at the indexing code but the rest looks ok to me.
Looks like we needed to duplicate a few structs to have a "Bin" version and a "Float" version?
| // deprecated, use `try_from_batch` instead | ||
| pub fn new(vectors: FixedSizeListArray, distance_type: DistanceType) -> Self { | ||
| let row_ids = Arc::new(UInt64Array::from_iter_values(0..vectors.len() as u64)); | ||
| let vectors = Arc::new(vectors); | ||
|
|
||
| let batch = RecordBatch::try_from_iter_with_nullable(vec![ | ||
| (ROW_ID, row_ids.clone() as ArrayRef, true), | ||
| (FLAT_COLUMN, vectors.clone() as ArrayRef, true), | ||
| ]) | ||
| .unwrap(); | ||
|
|
||
| Self { | ||
| batch, | ||
| distance_type, | ||
| row_ids, | ||
| vectors, | ||
| } | ||
| } |
There was a problem hiding this comment.
Can we avoid this deprecated method in Bin path?
| } else if dt.data_type().is_floating() { | ||
| coerce_float_vector( | ||
| q.as_any().downcast_ref::<Float32Array>().unwrap(), | ||
| FloatType::try_from(dt.data_type())?, | ||
| )? |
There was a problem hiding this comment.
I think the error message isn't quite right here. If dt is not a floating point type and q is a vector, but does not exactly match dt (e.g. UInt16 or something) then the error the user will get is Column {} is not a vector column. However, it would be nicer to give them an error like Column {} has type UInt8 and the query vector is UInt16.
There was a problem hiding this comment.
good point, fixed!
| .try_collect::<Vec<_>>() | ||
| .await?; | ||
| concat_batches(&Arc::new(ArrowSchema::from(&projection)), &batches)? | ||
| Box::pin(scanner.try_into_batch()).await? |
There was a problem hiding this comment.
I wonder if it would be more thorough for try_into_batch to return a boxed future instead of being async?
it's avoidable, by making |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
| resolver = "2" | ||
|
|
||
| [workspace.package] | ||
| version = "0.20.1" |
There was a problem hiding this comment.
this PR introduces a breaking change so bump the version
| let data = generate_random_array_with_seed::<Float32Type>(TOTAL * DIMENSION, SEED); | ||
| let fsl = FixedSizeListArray::try_new_from_values(data, DIMENSION as i32).unwrap(); | ||
| let vectors = Arc::new(FlatStorage::new(fsl.clone(), DistanceType::L2)); | ||
| let vectors = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2)); |
There was a problem hiding this comment.
Ah, do we need FlatStorage to be typed? This is sad
There was a problem hiding this comment.
yeah... unfortunately now we have to do this.
there are 2 ways to avoid this in my mind but they would introduce some new issues as well:
- storing
ArrayRefin dist calculator then cast it to&PrimitiveArray<T>according distance type / column type. This is what we did before, it's slow - just storing everything in bytes ref
&[u8]then do unsafe cast
No description provided.