fix: fts flat search drops rows when avg_doc_length < 1.0 by wjones127 · Pull Request #5897 · lance-format/lance

wjones127 · 2026-02-05T19:21:09Z

Summary

MemBM25Scorer::avg_doc_length() used integer division (total_tokens / num_docs), truncating values < 1.0 to 0. Changed to float division.
flat_bm25_search_stream reconstructed total_tokens by casting the float avg back to u64 (avg_doc_length() as u64 * num_docs), losing precision. Added total_tokens() accessor to IndexBM25Scorer to pass the exact value through.

Fixes #5871

Test plan

New test test_fts_unindexed_data_with_stop_words — indexes 4 single-word rows (3 stop words) so avg_doc_length = 0.25, appends 10 unindexed rows, and asserts all 10 are returned by FTS query. Verified it fails without the fix (returns 7) and passes with it.
Existing test_fts_unindexed_data still passes
lance-index tests pass

🤖 Generated with Claude Code

github-actions · 2026-02-05T19:21:24Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

github-actions · 2026-02-05T19:22:06Z

Code Review

Summary: Clean bug fix for FTS flat search dropping rows when avg_doc_length < 1.0.

No P0/P1 issues identified. The fix is correct:

Integer division bug (scorer.rs:59): Changing (total_tokens / num_docs) as f32 to total_tokens as f32 / num_docs as f32 correctly preserves fractional averages.
Precision loss bug (index.rs:2541): Adding total_tokens() accessor avoids the lossy round-trip avg_doc_length() as u64 * num_docs.
Test coverage: The new test directly exercises the edge case (avg_doc_length = 0.25) and validates the fix.

LGTM ✓

Two integer arithmetic bugs in BM25 scoring caused scores to be 0 for unindexed rows when indexed data has fractional average document length (e.g. single-word stop words). Rows with score 0 are filtered out, silently dropping results. 1. `MemBM25Scorer::avg_doc_length()` used integer division, truncating values < 1.0 to 0. 2. `flat_bm25_search_stream` reconstructed `total_tokens` by casting the float avg back to u64, losing precision. Fixes lance-format#5871 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov · 2026-02-05T20:02:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions Bot added the bug Something isn't working label Feb 5, 2026

wjones127 force-pushed the fix/fts-flat-search-fractional-avg-doc-length branch from d618829 to 0274ffd Compare February 5, 2026 19:31

wjones127 changed the title ~~fix: FTS flat search drops rows when avg_doc_length < 1.0~~ fix: fts flat search drops rows when avg_doc_length < 1.0 Feb 5, 2026

wjones127 marked this pull request as ready for review February 5, 2026 21:29

wjones127 added the critical-fix Bugs that cause crashes, security vulnerabilities, or incorrect data. label Feb 5, 2026

hamersaw approved these changes Feb 5, 2026

View reviewed changes

jackye1995 approved these changes Feb 5, 2026

View reviewed changes

wjones127 merged commit c432eb1 into lance-format:main Feb 5, 2026
38 of 40 checks passed

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fts flat search drops rows when avg_doc_length < 1.0#5897

fix: fts flat search drops rows when avg_doc_length < 1.0#5897
wjones127 merged 1 commit intolance-format:mainfrom
wjones127:fix/fts-flat-search-fractional-avg-doc-length

wjones127 commented Feb 5, 2026

Uh oh!

github-actions Bot commented Feb 5, 2026

Uh oh!

github-actions Bot commented Feb 5, 2026

Uh oh!

codecov Bot commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wjones127 commented Feb 5, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Feb 5, 2026

Uh oh!

github-actions Bot commented Feb 5, 2026

Code Review

Uh oh!

codecov Bot commented Feb 5, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants