perf: cache global BM25 idf per query by BubbleCal · Pull Request #5727 · lance-format/lance

BubbleCal · 2026-01-16T05:49:58Z

this could avoid scanning all partitions for scoring each token, helpful when we have many partitions and long query

github-actions · 2026-01-16T05:52:08Z

Code Review

Summary: This PR caches the BM25 IDF (query weight) per token during search, avoiding redundant computation across partitions. The optimization is straightforward and correct.

Assessment: ✅ LGTM

The change correctly separates the BM25 score computation into its two parts:

Query weight (IDF): Now cached per unique token in idf_cache
Doc weight: Computed per document as before

What's good:

The math is preserved: score = query_weight(token) * doc_weight(freq, doc_length) matches the existing Scorer::score implementation
The test validates correctness by checking that scores match the expected IDF across partitions
Using HashMap<String, f32> for the cache is appropriate given query tokens are typically few

Minor observations (not blocking):

The cache is local to the search call, so no memory concerns across requests
For queries with many unique tokens, the cache provides linear speedup proportional to partition count

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov · 2026-01-16T10:07:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

this could avoid scanning all partitions for scoring each token, helpful when we have many partitions and long query --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Cache global BM25 idf per query

dc10fa3

github-actions Bot added the performance label Jan 16, 2026

fmt

99a9384

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Xuanwo approved these changes Jan 16, 2026

View reviewed changes

BubbleCal merged commit 7428ed4 into main Jan 16, 2026
45 of 48 checks passed

BubbleCal deleted the yang/cache-bm25-idf branch January 16, 2026 11:04

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cache global BM25 idf per query#5727

perf: cache global BM25 idf per query#5727
BubbleCal merged 2 commits intomainfrom
yang/cache-bm25-idf

BubbleCal commented Jan 16, 2026

Uh oh!

github-actions Bot commented Jan 16, 2026

Uh oh!

codecov Bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BubbleCal commented Jan 16, 2026

Uh oh!

github-actions Bot commented Jan 16, 2026

Code Review

Assessment: ✅ LGTM

Uh oh!

codecov Bot commented Jan 16, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants