Skip to content

fix: fts flat search drops rows when avg_doc_length < 1.0#5897

Merged
wjones127 merged 1 commit intolance-format:mainfrom
wjones127:fix/fts-flat-search-fractional-avg-doc-length
Feb 5, 2026
Merged

fix: fts flat search drops rows when avg_doc_length < 1.0#5897
wjones127 merged 1 commit intolance-format:mainfrom
wjones127:fix/fts-flat-search-fractional-avg-doc-length

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

Summary

  • MemBM25Scorer::avg_doc_length() used integer division (total_tokens / num_docs), truncating values < 1.0 to 0. Changed to float division.
  • flat_bm25_search_stream reconstructed total_tokens by casting the float avg back to u64 (avg_doc_length() as u64 * num_docs), losing precision. Added total_tokens() accessor to IndexBM25Scorer to pass the exact value through.

Fixes #5871

Test plan

  • New test test_fts_unindexed_data_with_stop_words — indexes 4 single-word rows (3 stop words) so avg_doc_length = 0.25, appends 10 unindexed rows, and asserts all 10 are returned by FTS query. Verified it fails without the fix (returns 7) and passes with it.
  • Existing test_fts_unindexed_data still passes
  • lance-index tests pass

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 5, 2026

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@github-actions github-actions Bot added the bug Something isn't working label Feb 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 5, 2026

Code Review

Summary: Clean bug fix for FTS flat search dropping rows when avg_doc_length < 1.0.

No P0/P1 issues identified. The fix is correct:

  1. Integer division bug (scorer.rs:59): Changing (total_tokens / num_docs) as f32 to total_tokens as f32 / num_docs as f32 correctly preserves fractional averages.

  2. Precision loss bug (index.rs:2541): Adding total_tokens() accessor avoids the lossy round-trip avg_doc_length() as u64 * num_docs.

  3. Test coverage: The new test directly exercises the edge case (avg_doc_length = 0.25) and validates the fix.

LGTM ✓

Two integer arithmetic bugs in BM25 scoring caused scores to be 0 for
unindexed rows when indexed data has fractional average document length
(e.g. single-word stop words). Rows with score 0 are filtered out,
silently dropping results.

1. `MemBM25Scorer::avg_doc_length()` used integer division, truncating
   values < 1.0 to 0.
2. `flat_bm25_search_stream` reconstructed `total_tokens` by casting
   the float avg back to u64, losing precision.

Fixes lance-format#5871

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the fix/fts-flat-search-fractional-avg-doc-length branch from d618829 to 0274ffd Compare February 5, 2026 19:31
@wjones127 wjones127 changed the title fix: FTS flat search drops rows when avg_doc_length < 1.0 fix: fts flat search drops rows when avg_doc_length < 1.0 Feb 5, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@wjones127 wjones127 marked this pull request as ready for review February 5, 2026 21:29
@wjones127 wjones127 added the critical-fix Bugs that cause crashes, security vulnerabilities, or incorrect data. label Feb 5, 2026
@wjones127 wjones127 merged commit c432eb1 into lance-format:main Feb 5, 2026
38 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working critical-fix Bugs that cause crashes, security vulnerabilities, or incorrect data.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FTS search failing to union unindexed data correctly

3 participants