Skip to content

fix: filter out vectors that could produce infinite l2 distances#4890

Merged
wjones127 merged 2 commits intolance-format:mainfrom
wjones127:fix/finite-l2-distance-fitler
Oct 3, 2025
Merged

fix: filter out vectors that could produce infinite l2 distances#4890
wjones127 merged 2 commits intolance-format:mainfrom
wjones127:fix/finite-l2-distance-fitler

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

This is a quick fix for #4842.

I think a long-term fix will need to make this aware of the distance type. Or maybe we filter when actually computing distances later, so we don't have false-positives. I know this mostly applies for l2. Cosine (IIRC) normalizes the vectors and doesn't have this problem.

@github-actions github-actions Bot added the bug Something isn't working label Oct 2, 2025
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.72727% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 80.86%. Comparing base (e464c0a) to head (cff2da8).

Files with missing lines Patch % Lines
rust/lance-index/src/vector/transform.rs 97.50% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4890   +/-   ##
=======================================
  Coverage   80.85%   80.86%           
=======================================
  Files         333      333           
  Lines      131944   131980   +36     
  Branches   131944   131980   +36     
=======================================
+ Hits       106687   106723   +36     
+ Misses      21508    21507    -1     
- Partials     3749     3750    +1     
Flag Coverage Δ
unittests 80.86% <97.72%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wjones127 wjones127 marked this pull request as ready for review October 3, 2025 01:09
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this!

I think we'd still run into this issue with cosine/dot because normalization needs to calculate the length of the vectors which is essentially the same as calculating the distance to zero vector.

PQ is trained on sample data, if the user was running into this consistently, that means most of their vectors are too long, the solution should be to switch data type to f64 vector.

And in the case of that all vectors are big but close, the L2 distance won't be INF, we should keep these vectors, so I don't think we should filter out vectors by abs values.

This issue causes unwrap on None, because argmin implementation sets the initial min value to INF and tries to update the min value and get an index, maybe we can just set the initial value to the first value to avoid the case of all INFs?

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find

@wjones127 wjones127 merged commit 9994f69 into lance-format:main Oct 3, 2025
29 checks passed
BubbleCal added a commit that referenced this pull request Oct 9, 2025
related #4842 
In #4890, we filter out the large
vectors, this PR reverts that and try best to retrieve the large
vectors:
- if 2 vectors are both large but close to each other, the distance can
be finite, so we can retrieve them
- if 2 vectors are both large and the distance is infinite, we just
return them in random order

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…ce-format#4890)

This is a quick fix for lance-format#4842.

I think a long-term fix will need to make this aware of the distance
type. Or maybe we filter when actually computing distances later, so we
don't have false-positives. I know this mostly applies for l2. Cosine
(IIRC) normalizes the vectors and doesn't have this problem.
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
related lance-format#4842 
In lance-format#4890, we filter out the large
vectors, this PR reverts that and try best to retrieve the large
vectors:
- if 2 vectors are both large but close to each other, the distance can
be finite, so we can retrieve them
- if 2 vectors are both large and the distance is infinite, we just
return them in random order

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants