Skip to content

perf: use binary search to skip documents#5636

Merged
BubbleCal merged 8 commits intomainfrom
yang/inverted-query-opt
Jan 15, 2026
Merged

perf: use binary search to skip documents#5636
BubbleCal merged 8 commits intomainfrom
yang/inverted-query-opt

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal commented Jan 6, 2026

searching gains ~7% faster

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 6, 2026

Code Review

Summary

This PR optimizes the PostingIterator::next() method for compressed posting lists by replacing a linear scan with binary search (partition_point), yielding ~7% performance improvement for indexing/searching.

Issues

P1: Redundant block skipping logic

The new code at line 239-244 still performs the block-skipping loop before entering the new binary search loop. However, the new while-loop (lines 245-269) re-calculates block_idx = self.index / BLOCK_SIZE on each iteration and independently handles block advancement. This means the initial block-skipping loop's work may be immediately undone or duplicated. Consider whether the initial block-skipping loop (lines 238-244) is still necessary or if it can be removed to simplify the logic.

P1: Potential infinite loop if decompression produces empty doc_ids

In the new loop at line 261, if compressed.doc_ids is empty after decompression (which shouldn't happen normally but could occur with corrupted data), offset_in_block would be 0 and new_offset would equal block_offset. If new_offset >= compressed.doc_ids.len() (0 >= 0 is false), it would break out correctly. However, if there's ever a case where the block contains fewer docs than expected, the code should handle it gracefully. The current bounds check at line 263 mitigates this, but adding a defensive check or comment would improve robustness.

Minor Observation

The test coverage is good - it tests cross-block seeking and seeking past the end. Consider adding an edge case test for seeking within the same block (e.g., seek from index 5 to 7 within block 0) to ensure the partition_point works correctly within a single block without block transitions.

Overall, this is a solid performance optimization with appropriate test coverage.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

while self.index < length {
let block_idx = self.index / BLOCK_SIZE;
let block_offset = self.index % BLOCK_SIZE;
let compressed = unsafe {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the posintg lists are compressed in memory and we do decompress the needed blocks while searching

}
self.index = self.index.max(block_idx * BLOCK_SIZE);
let length = self.list.len();
while self.index < length && (self.doc().unwrap().doc_id() as u32) < least_id {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is our new PR faster? Is self.doc() a heavy operation? Would it be better to provide an API like compressed_doc(doc_index) instead of maintaining complex logic inside next?

Copy link
Copy Markdown
Contributor Author

@BubbleCal BubbleCal Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to use binary search to avoid scanning the entire block.

doc() is the most costly operation during FTS search because it decompresses the block if needed

@BubbleCal BubbleCal requested a review from Xuanwo January 8, 2026 09:21
@BubbleCal BubbleCal merged commit e82da34 into main Jan 15, 2026
29 checks passed
@BubbleCal BubbleCal deleted the yang/inverted-query-opt branch January 15, 2026 09:41
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
searching gains ~7% faster

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants