fix: fts flat search drops rows when avg_doc_length < 1.0#5897
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Code ReviewSummary: Clean bug fix for FTS flat search dropping rows when No P0/P1 issues identified. The fix is correct:
LGTM ✓ |
Two integer arithmetic bugs in BM25 scoring caused scores to be 0 for unindexed rows when indexed data has fractional average document length (e.g. single-word stop words). Rows with score 0 are filtered out, silently dropping results. 1. `MemBM25Scorer::avg_doc_length()` used integer division, truncating values < 1.0 to 0. 2. `flat_bm25_search_stream` reconstructed `total_tokens` by casting the float avg back to u64, losing precision. Fixes lance-format#5871 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
d618829 to
0274ffd
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Summary
MemBM25Scorer::avg_doc_length()used integer division (total_tokens / num_docs), truncating values < 1.0 to 0. Changed to float division.flat_bm25_search_streamreconstructedtotal_tokensby casting the float avg back tou64(avg_doc_length() as u64 * num_docs), losing precision. Addedtotal_tokens()accessor toIndexBM25Scorerto pass the exact value through.Fixes #5871
Test plan
test_fts_unindexed_data_with_stop_words— indexes 4 single-word rows (3 stop words) soavg_doc_length = 0.25, appends 10 unindexed rows, and asserts all 10 are returned by FTS query. Verified it fails without the fix (returns 7) and passes with it.test_fts_unindexed_datastill passeslance-indextests pass🤖 Generated with Claude Code