perf: materialize the tokens after WAND done by BubbleCal · Pull Request #5572 · lance-format/lance

BubbleCal · 2025-12-24T07:45:34Z

this avoids copying tokens for each candidate during WAND search, these small strings could lead high memory usage if we have many partitions and the queries are with large limit.

verified on a 1M dataset
before:
Peak RSS delta: 497.09 MiB

after:
Peak RSS delta: 222.34 MiB

this also improves the FTS performance by ~10% for such queries

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions · 2025-12-24T07:48:36Z

Code Review - PR optimizes WAND search by deferring token materialization, reducing memory from 497 to 222 MiB. P1: Bounds check in index.rs silently skips invalid indices - consider making this a hard error. P1: Add unit test for token-to-index-to-token roundtrip. Note: Benchmark scripts useful but consider gitignore if not for CI.

github-actions · 2025-12-24T07:48:58Z

Code Review

Summary: This PR optimizes memory usage in WAND FTS search by deferring token string materialization until after candidate selection. Instead of copying token strings for each candidate during the search, it stores term indices and builds a token lookup table per partition.

Observations

P1 - Potential out-of-bounds access in flat_search path

The flat_search method in wand.rs was not updated to use the new iter_term_freqs pattern. It still uses iter_token_freqs and collects (token.to_owned(), freq) tuples (lines 397-400 and 525-528). This creates a mismatch: the DocCandidate::freqs type has changed to Vec(u32, u32) but flat_search tries to store (String, u32).

This should cause a compilation error - please verify the code compiles and that flat_search was correctly updated. If I am reading the diff incorrectly and the current code compiles, please ignore this comment.

Minor Observations (non-blocking)

The BinaryHeap::with_capacity optimization (line 374-378) is a nice touch for reducing allocations when limit is small.
The debug_assert at line 310 in index.rs is good for catching out-of-bounds issues during development.

Overall the approach is sound - using term indices instead of strings during the hot path and materializing tokens once per partition is a clean optimization with significant memory savings (497MB to 222MB as reported).

Xuanwo · 2025-12-24T08:21:13Z

+        let mut candidates = if limit <= BLOCK_SIZE * 10 {
+            BinaryHeap::with_capacity(limit)
+        } else {
+            BinaryHeap::new()


Even if the limit is quite large, starting with BinaryHeap::with_capacity(BLOCK_SIZE * 10) is still a good approach.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov · 2025-12-24T09:05:52Z

Codecov Report

❌ Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/inverted/index.rs	94.44%	0 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted/wand.rs	93.75%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

this avoids copying tokens for each candidate during WAND search, these small strings could lead high memory usage if we have many partitions and the queries are with large limit. verified on a 1M dataset before: Peak RSS delta: 497.09 MiB after: Peak RSS delta: 222.34 MiB this also improves the FTS performance by ~10% for such queries --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal added 9 commits December 24, 2025 13:41

Reduce FTS candidate memory

2f3aea1

Add FTS memory bench script

7668d7e

Add FTS data generator script

7e9f057

Use scalar index API for FTS

e25983a

Fix FTS bench scanner usage

e7827a9

Use to_table for FTS bench

cb2b763

Measure RSS delta after FTS prewarm

0174435

Prewarm indices in FTS bench

b3a0a06

Stabilize FTS bench measurements

0aa0776

github-actions Bot added the performance label Dec 24, 2025

remove test scripts

caf5e68

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested a review from Xuanwo December 24, 2025 08:05

Xuanwo reviewed Dec 24, 2025

View reviewed changes

fix

4eaa025

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested a review from Xuanwo December 24, 2025 08:28

Xuanwo approved these changes Dec 24, 2025

View reviewed changes

BubbleCal merged commit 34e5a19 into main Dec 24, 2025
30 of 32 checks passed

BubbleCal deleted the fts-oom-investigation branch December 24, 2025 09:19

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: materialize the tokens after WAND done#5572

perf: materialize the tokens after WAND done#5572
BubbleCal merged 11 commits intomainfrom
fts-oom-investigation

BubbleCal commented Dec 24, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Dec 24, 2025

Uh oh!

github-actions Bot commented Dec 24, 2025

Uh oh!

Uh oh!

Xuanwo Dec 24, 2025

Uh oh!

codecov Bot commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BubbleCal commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Dec 24, 2025

Uh oh!

github-actions Bot commented Dec 24, 2025

Code Review

Observations

Minor Observations (non-blocking)

Uh oh!

Uh oh!

Xuanwo Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Dec 24, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BubbleCal commented Dec 24, 2025 •

edited

Loading