feat: add LSM scanner with point lookup and vector search support by touch-of-grey · Pull Request #5850 · lance-format/lance

touch-of-grey · 2026-01-29T07:16:26Z

Summary

LSM scanner for unified reads across base table and MemWAL regions
Point lookup planner with bloom filter guards and short-circuit evaluation
Vector search planner with staleness detection via bloom filters
Benchmarks for scan, point lookup, and vector search operations

Test plan

Unit tests for all planners and exec nodes (51 tests)
Clippy clean

🤖 Generated with Claude Code

This introduces an LSM (Log-Structured Merge) scanner that enables consistent reads across multiple data sources: - Base table (merged data, generation=0) - Flushed MemTables (persisted, generation=1,2,...) - Active MemTable (in-memory, highest generation) Key components: - LsmScanner: High-level API for LSM reads with deduplication - LsmDataSourceCollector: Collects data sources from base table and regions - LsmScanPlanner: Builds execution plan with Union + Dedup - DeduplicateExec: Deduplicates by PK, keeping highest generation - GenerationTagExec: Adds _gen and _rowaddr columns for dedup ordering Also includes: - mem_wal_read benchmark with DATASET_PREFIX support for S3 testing - active_memtable_ref() method on RegionWriter for LSM integration - Documentation fixes for generation numbering (unsigned, base=0) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

jackye1995 · 2026-01-29T07:19:43Z

I have been thinking about this past 2 days, the part that we have to read each MemTable and then reverse the result feels just so inefficient to me. I think I have a good way to solve it now:

When we scan MemTable, everything is in memory, reverse scan is fine. So when we flush MemTable, we should read the whole BatchStore in reverse order. This means the indexes also need to reverse the row position mapping, so the new row position is length_of_batch_store - current_position - 1. By doing so, all the flushed MemTables are ordered from newest to oldest, not oldest to newest, so we can do the K-way merge much more efficiently.

What do you think?

touch-of-grey · 2026-01-29T07:20:28Z

Makes sense! Let me try update based on the current draft

When flushing MemTable to disk, write data in reverse order (newest to oldest) so flushed generations are pre-sorted for K-way merge during LSM scan. This eliminates the need to reverse data during reads. Key changes: - BatchStore: add to_vec_reversed() that reverses batch order and rows - MemTable: add scan_batches_reversed() returning (batches, total_rows) - Flush: use reversed batches and pass total_rows to index creation - BTree index: add to_training_batches_reversed() with mapped positions - IVF-PQ index: add to_partition_batches_reversed() with mapped positions Row position mapping formula: flushed_pos = total_rows - original_pos - 1 Co-Authored-By: Jack Ye <yezhaoqin@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When flushing MemTable to disk, write FTS index files directly from the in-memory FTS index without re-tokenizing the documents. This avoids duplicate tokenization work during flush. Key changes: - FtsMemIndex: add to_index_builder_reversed() that exports index data with reversed row positions for proper LSM ordering - InnerBuilder: add set_tokens/set_docs/set_posting_lists setters - InvertedIndexParams: add has_positions() getter - Flush: create_fts_indexes() now uses direct flush from in-memory data and properly commits index metadata to dataset manifest Row position mapping formula: flushed_pos = total_rows - original_pos - 1 Co-Authored-By: Jack Ye <yezhaoqin@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov · 2026-01-30T02:24:27Z

Codecov Report

❌ Patch coverage is 78.89391% with 748 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...lance/src/dataset/mem_wal/scanner/vector_search.rs	58.29%	95 Missing and 3 partials ⚠️
...e/src/dataset/mem_wal/scanner/exec/filter_stale.rs	74.92%	80 Missing and 8 partials ⚠️
...ce/src/dataset/mem_wal/scanner/exec/deduplicate.rs	81.73%	72 Missing and 10 partials ⚠️
...ce/src/dataset/mem_wal/scanner/exec/bloom_guard.rs	63.33%	75 Missing and 2 partials ⚠️
rust/lance/src/dataset/mem_wal/scanner/planner.rs	89.63%	54 Missing and 22 partials ⚠️
.../lance/src/dataset/mem_wal/scanner/point_lookup.rs	74.28%	51 Missing and 12 partials ⚠️
rust/lance/src/dataset/mem_wal/scanner/builder.rs	69.87%	43 Missing and 4 partials ⚠️
...ust/lance/src/dataset/mem_wal/scanner/collector.rs	64.92%	47 Missing ⚠️
...src/dataset/mem_wal/scanner/exec/coalesce_first.rs	85.38%	31 Missing and 1 partial ⚠️
...src/dataset/mem_wal/scanner/exec/generation_tag.rs	81.37%	25 Missing and 2 partials ⚠️
... and 10 more

📢 Thoughts on this report? Let us know!

- Change `to_vec_reversed()` to return `Result` instead of panicking on Arrow take kernel or RecordBatch creation errors - Replace `expect()` calls in `to_index_builder_reversed()` with proper `Error::io` returns for defensive error handling - Update callers to propagate errors appropriately Co-Authored-By: Jack Ye <yezhaoqin@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add specialized query planners for efficient point lookups and vector search across LSM levels: - LsmPointLookupPlanner: Primary key-based lookups with bloom filter guards and short-circuit evaluation (newest-first ordering) - LsmVectorSearchPlanner: KNN search with staleness detection using bloom filters, fast_search for indexed data only New DataFusion ExecutionPlan nodes: - BloomFilterGuardExec: Skip generations that don't contain the key - CoalesceFirstExec: Return first non-empty result with short-circuit - FilterStaleExec: Filter stale results using bloom filters Also adds benchmarks for point lookup and vector search operations. Co-Authored-By: Jack Ye <yezhaoqin@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

jackye1995 · 2026-02-06T08:07:02Z

Sorry for the late review, I made some minor edits, and I think this is good to go now!

jackye1995

as discussed, let's work on FTS separately since it requires changing the BM25 to be supplied externally.

touch-of-grey and others added 2 commits January 26, 2026 19:03

feat: add LSM scanner to merge read MemWAL regions

c8b5db6

github-actions Bot added the enhancement New feature or request label Jan 29, 2026

touch-of-grey and others added 2 commits January 29, 2026 17:18

touch-of-grey and others added 2 commits February 1, 2026 22:40

touch-of-grey changed the title ~~feat: add LSM scanner to merge read MemWAL regions~~ feat: add LSM scanner with point lookup and vector search support Feb 6, 2026

jackye1995 approved these changes Feb 6, 2026

View reviewed changes

jackye1995 merged commit 37dfddc into lance-format:main Feb 6, 2026
32 checks passed

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add LSM scanner with point lookup and vector search support#5850

feat: add LSM scanner with point lookup and vector search support#5850
jackye1995 merged 6 commits intolance-format:mainfrom
touch-of-grey:LsmScanQueryPlan

touch-of-grey commented Jan 29, 2026 •

edited

Loading

Uh oh!

jackye1995 commented Jan 29, 2026

Uh oh!

touch-of-grey commented Jan 29, 2026

Uh oh!

codecov Bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

jackye1995 commented Feb 6, 2026

Uh oh!

jackye1995 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

touch-of-grey commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

jackye1995 commented Jan 29, 2026

Uh oh!

touch-of-grey commented Jan 29, 2026

Uh oh!

codecov Bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jackye1995 commented Feb 6, 2026

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

touch-of-grey commented Jan 29, 2026 •

edited

Loading

codecov Bot commented Jan 30, 2026 •

edited

Loading