fix: filter out vectors that could produce infinite l2 distances#4890
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4890 +/- ##
=======================================
Coverage 80.85% 80.86%
=======================================
Files 333 333
Lines 131944 131980 +36
Branches 131944 131980 +36
=======================================
+ Hits 106687 106723 +36
+ Misses 21508 21507 -1
- Partials 3749 3750 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
BubbleCal
left a comment
There was a problem hiding this comment.
Thanks for finding this!
I think we'd still run into this issue with cosine/dot because normalization needs to calculate the length of the vectors which is essentially the same as calculating the distance to zero vector.
PQ is trained on sample data, if the user was running into this consistently, that means most of their vectors are too long, the solution should be to switch data type to f64 vector.
And in the case of that all vectors are big but close, the L2 distance won't be INF, we should keep these vectors, so I don't think we should filter out vectors by abs values.
This issue causes unwrap on None, because argmin implementation sets the initial min value to INF and tries to update the min value and get an index, maybe we can just set the initial value to the first value to avoid the case of all INFs?
related #4842 In #4890, we filter out the large vectors, this PR reverts that and try best to retrieve the large vectors: - if 2 vectors are both large but close to each other, the distance can be finite, so we can retrieve them - if 2 vectors are both large and the distance is infinite, we just return them in random order Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…ce-format#4890) This is a quick fix for lance-format#4842. I think a long-term fix will need to make this aware of the distance type. Or maybe we filter when actually computing distances later, so we don't have false-positives. I know this mostly applies for l2. Cosine (IIRC) normalizes the vectors and doesn't have this problem.
related lance-format#4842 In lance-format#4890, we filter out the large vectors, this PR reverts that and try best to retrieve the large vectors: - if 2 vectors are both large but close to each other, the distance can be finite, so we can retrieve them - if 2 vectors are both large and the distance is infinite, we just return them in random order Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This is a quick fix for #4842.
I think a long-term fix will need to make this aware of the distance type. Or maybe we filter when actually computing distances later, so we don't have false-positives. I know this mostly applies for l2. Cosine (IIRC) normalizes the vectors and doesn't have this problem.