perf: use binary search to skip documents#5636
Conversation
Code ReviewSummaryThis PR optimizes the IssuesP1: Redundant block skipping logic The new code at line 239-244 still performs the block-skipping loop before entering the new binary search loop. However, the new while-loop (lines 245-269) re-calculates P1: Potential infinite loop if decompression produces empty doc_ids In the new loop at line 261, if Minor ObservationThe test coverage is good - it tests cross-block seeking and seeking past the end. Consider adding an edge case test for seeking within the same block (e.g., seek from index 5 to 7 within block 0) to ensure the partition_point works correctly within a single block without block transitions. Overall, this is a solid performance optimization with appropriate test coverage. |
…g/inverted-query-opt
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| while self.index < length { | ||
| let block_idx = self.index / BLOCK_SIZE; | ||
| let block_offset = self.index % BLOCK_SIZE; | ||
| let compressed = unsafe { |
There was a problem hiding this comment.
the posintg lists are compressed in memory and we do decompress the needed blocks while searching
| } | ||
| self.index = self.index.max(block_idx * BLOCK_SIZE); | ||
| let length = self.list.len(); | ||
| while self.index < length && (self.doc().unwrap().doc_id() as u32) < least_id { |
There was a problem hiding this comment.
Why is our new PR faster? Is self.doc() a heavy operation? Would it be better to provide an API like compressed_doc(doc_index) instead of maintaining complex logic inside next?
There was a problem hiding this comment.
The idea is to use binary search to avoid scanning the entire block.
doc() is the most costly operation during FTS search because it decompresses the block if needed
…g/inverted-query-opt
searching gains ~7% faster --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
searching gains ~7% faster