Skip to content

perf: materialize the tokens after WAND done#5572

Merged
BubbleCal merged 11 commits intomainfrom
fts-oom-investigation
Dec 24, 2025
Merged

perf: materialize the tokens after WAND done#5572
BubbleCal merged 11 commits intomainfrom
fts-oom-investigation

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal commented Dec 24, 2025

this avoids copying tokens for each candidate during WAND search, these small strings could lead high memory usage if we have many partitions and the queries are with large limit.

verified on a 1M dataset
before:
Peak RSS delta: 497.09 MiB

after:
Peak RSS delta: 222.34 MiB

this also improves the FTS performance by ~10% for such queries

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions
Copy link
Copy Markdown
Contributor

Code Review - PR optimizes WAND search by deferring token materialization, reducing memory from 497 to 222 MiB. P1: Bounds check in index.rs silently skips invalid indices - consider making this a hard error. P1: Add unit test for token-to-index-to-token roundtrip. Note: Benchmark scripts useful but consider gitignore if not for CI.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

Summary: This PR optimizes memory usage in WAND FTS search by deferring token string materialization until after candidate selection. Instead of copying token strings for each candidate during the search, it stores term indices and builds a token lookup table per partition.

Observations

P1 - Potential out-of-bounds access in flat_search path

The flat_search method in wand.rs was not updated to use the new iter_term_freqs pattern. It still uses iter_token_freqs and collects (token.to_owned(), freq) tuples (lines 397-400 and 525-528). This creates a mismatch: the DocCandidate::freqs type has changed to Vec(u32, u32) but flat_search tries to store (String, u32).

This should cause a compilation error - please verify the code compiles and that flat_search was correctly updated. If I am reading the diff incorrectly and the current code compiles, please ignore this comment.

Minor Observations (non-blocking)

  1. The BinaryHeap::with_capacity optimization (line 374-378) is a nice touch for reducing allocations when limit is small.

  2. The debug_assert at line 310 in index.rs is good for catching out-of-bounds issues during development.

Overall the approach is sound - using term indices instead of strings during the hot path and materializing tokens once per partition is a clean optimization with significant memory savings (497MB to 222MB as reported).

@BubbleCal BubbleCal requested a review from Xuanwo December 24, 2025 08:05
Comment thread rust/lance-index/src/scalar/inverted/index.rs Outdated
let mut candidates = if limit <= BLOCK_SIZE * 10 {
BinaryHeap::with_capacity(limit)
} else {
BinaryHeap::new()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the limit is quite large, starting with BinaryHeap::with_capacity(BLOCK_SIZE * 10) is still a good approach.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal requested a review from Xuanwo December 24, 2025 08:28
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 24, 2025

Codecov Report

❌ Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/index.rs 94.44% 0 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted/wand.rs 93.75% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@BubbleCal BubbleCal merged commit 34e5a19 into main Dec 24, 2025
30 of 32 checks passed
@BubbleCal BubbleCal deleted the fts-oom-investigation branch December 24, 2025 09:19
BubbleCal added a commit that referenced this pull request Dec 24, 2025
this avoids copying tokens for each candidate during WAND search, these
small strings could lead high memory usage if we have many partitions
and the queries are with large limit.

verified on a 1M dataset
before:
Peak RSS delta: 497.09 MiB

after:
Peak RSS delta: 222.34 MiB

this also improves the FTS performance by ~10% for such queries

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
this avoids copying tokens for each candidate during WAND search, these
small strings could lead high memory usage if we have many partitions
and the queries are with large limit.

verified on a 1M dataset
before:
Peak RSS delta: 497.09 MiB

after:
Peak RSS delta: 222.34 MiB

this also improves the FTS performance by ~10% for such queries

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants