fix: filter out vectors that could produce infinite l2 distances by wjones127 · Pull Request #4890 · lance-format/lance

wjones127 · 2025-10-02T22:51:54Z

This is a quick fix for #4842.

I think a long-term fix will need to make this aware of the distance type. Or maybe we filter when actually computing distances later, so we don't have false-positives. I know this mostly applies for l2. Cosine (IIRC) normalizes the vectors and doesn't have this problem.

codecov-commenter · 2025-10-03T00:52:28Z

Codecov Report

❌ Patch coverage is 97.72727% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 80.86%. Comparing base (e464c0a) to head (cff2da8).

Files with missing lines	Patch %	Lines
rust/lance-index/src/vector/transform.rs	97.50%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4890   +/-   ##
=======================================
  Coverage   80.85%   80.86%           
=======================================
  Files         333      333           
  Lines      131944   131980   +36     
  Branches   131944   131980   +36     
=======================================
+ Hits       106687   106723   +36     
+ Misses      21508    21507    -1     
- Partials     3749     3750    +1

Flag	Coverage Δ
unittests	`80.86% <97.72%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

BubbleCal

Thanks for finding this!

I think we'd still run into this issue with cosine/dot because normalization needs to calculate the length of the vectors which is essentially the same as calculating the distance to zero vector.

PQ is trained on sample data, if the user was running into this consistently, that means most of their vectors are too long, the solution should be to switch data type to f64 vector.

And in the case of that all vectors are big but close, the L2 distance won't be INF, we should keep these vectors, so I don't think we should filter out vectors by abs values.

This issue causes unwrap on None, because argmin implementation sets the initial min value to INF and tries to update the min value and get an index, maybe we can just set the initial value to the first value to avoid the case of all INFs?

westonpace

Nice find

related #4842 In #4890, we filter out the large vectors, this PR reverts that and try best to retrieve the large vectors: - if 2 vectors are both large but close to each other, the distance can be finite, so we can retrieve them - if 2 vectors are both large and the distance is infinite, we just return them in random order Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…ce-format#4890) This is a quick fix for lance-format#4842. I think a long-term fix will need to make this aware of the distance type. Or maybe we filter when actually computing distances later, so we don't have false-positives. I know this mostly applies for l2. Cosine (IIRC) normalizes the vectors and doesn't have this problem.

related lance-format#4842 In lance-format#4890, we filter out the large vectors, this PR reverts that and try best to retrieve the large vectors: - if 2 vectors are both large but close to each other, the distance can be finite, so we can retrieve them - if 2 vectors are both large and the distance is infinite, we just return them in random order Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix: filter out vectors that could produce infinite distances

6f9e3e9

github-actions Bot added the bug Something isn't working label Oct 2, 2025

fix inequality

cff2da8

wjones127 marked this pull request as ready for review October 3, 2025 01:09

BubbleCal reviewed Oct 3, 2025

View reviewed changes

westonpace approved these changes Oct 3, 2025

View reviewed changes

wjones127 merged commit 9994f69 into lance-format:main Oct 3, 2025
29 checks passed

BubbleCal mentioned this pull request Oct 9, 2025

chore: keep the large vectors #4914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: filter out vectors that could produce infinite l2 distances#4890

fix: filter out vectors that could produce infinite l2 distances#4890
wjones127 merged 2 commits intolance-format:mainfrom
wjones127:fix/finite-l2-distance-fitler

wjones127 commented Oct 2, 2025

Uh oh!

codecov-commenter commented Oct 3, 2025

Uh oh!

BubbleCal left a comment

Uh oh!

westonpace left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wjones127 commented Oct 2, 2025

Uh oh!

codecov-commenter commented Oct 3, 2025

Codecov Report

Uh oh!

BubbleCal left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants