Skip to content

perf: cache global BM25 idf per query#5727

Merged
BubbleCal merged 2 commits intomainfrom
yang/cache-bm25-idf
Jan 16, 2026
Merged

perf: cache global BM25 idf per query#5727
BubbleCal merged 2 commits intomainfrom
yang/cache-bm25-idf

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

this could avoid scanning all partitions for scoring each token, helpful when we have many partitions and long query

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

Summary: This PR caches the BM25 IDF (query weight) per token during search, avoiding redundant computation across partitions. The optimization is straightforward and correct.

Assessment: ✅ LGTM

The change correctly separates the BM25 score computation into its two parts:

  1. Query weight (IDF): Now cached per unique token in idf_cache
  2. Doc weight: Computed per document as before

What's good:

  • The math is preserved: score = query_weight(token) * doc_weight(freq, doc_length) matches the existing Scorer::score implementation
  • The test validates correctness by checking that scores match the expected IDF across partitions
  • Using HashMap<String, f32> for the cache is appropriate given query tokens are typically few

Minor observations (not blocking):

  • The cache is local to the search call, so no memory concerns across requests
  • For queries with many unique tokens, the cache provides linear speedup proportional to partition count

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@BubbleCal BubbleCal merged commit 7428ed4 into main Jan 16, 2026
45 of 48 checks passed
@BubbleCal BubbleCal deleted the yang/cache-bm25-idf branch January 16, 2026 11:04
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
this could avoid scanning all partitions for scoring each token, helpful
when we have many partitions and long query

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
vivek-bharathan pushed a commit to vivek-bharathan/lance that referenced this pull request Feb 2, 2026
this could avoid scanning all partitions for scoring each token, helpful
when we have many partitions and long query

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants