feat: support multiple index segments for FTS indices by wjones127 · Pull Request #6285 · lance-format/lance

wjones127 · 2026-03-25T01:40:02Z

Summary

FTS indices now support multiple segments via optimize_indices, matching IVF_PQ vector index behavior
optimize_indices(append()) creates a new segment for unindexed data without rebuilding existing segments
optimize_indices(merge(N)) merges the N most recent segments into one
All FTS query types (Match, Phrase, Boolean, Boost) search across all segments with unified cross-segment BM25 scoring (IDF reweighting), so multi-segment results are identical to single-segment

Previously, the scalar branch of merge_indices hardcoded indices_merged = 1, always merging everything into one index. Now it respects OptimizeOptions::num_indices_to_merge.

Test plan

test_fts_append_creates_separate_segment — append creates 2 disjoint segments
test_fts_multi_segment_score_equivalence — identical BM25 scores for 6 queries across single vs multi-segment
test_fts_merge_combines_segments — merge(2) reduces 2→1 segment, results unchanged
test_fts_three_segments_partial_merge — 3 segments → merge(2) → 2 segments
test_fts_multi_segment_with_unindexed_data — 2 segments + unindexed flat scan
test_fts_multi_segment_phrase_query — phrase queries across segments
All existing inverted index tests pass

🤖 Generated with Claude Code

Previously, FTS (inverted) indices always operated as a single segment. `optimize_indices` would always merge everything into one index, and queries loaded exactly one segment. This mirrored how scalar indices worked but differed from IVF_PQ vector indices which already supported multiple segments. This adds multi-segment support for FTS indices: - **Creating segments**: `optimize_indices(append())` creates a new FTS segment covering only unindexed data, without rebuilding existing segments. The scalar branch of `merge_indices` now respects `num_indices_to_merge` instead of hardcoding `indices_merged = 1`. - **Merging segments**: `optimize_indices(merge(N))` merges the N most recent FTS segments into one, matching the vector index behavior. - **Querying segments**: All FTS query types (Match, Phrase, Boolean, Boost) search across all segments with unified BM25 scoring. `multi_index_bm25_search` collects partitions from all segments into a single `IndexBM25Scorer` for cross-segment IDF reweighting, ensuring results are identical to a single-segment index. Each segment's `deleted_fragments` bitmap is unioned into the prefilter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-25T01:44:36Z

PR Review: feat: support multiple index segments for FTS indices

P0: Prefilter built from incomplete metadata (correctness bug)

In both MatchQueryExec::execute and PhraseQueryExec::execute, after calling load_all_fts_segments() (which internally loads all segment metadata), the code calls load_scalar_index() again and passes the result to build_prefilter(). But load_scalar_index returns Option of IndexMetadata -- at most ONE segment metadata. The .into_iter().collect() produces a Vec of 0 or 1 elements. So the prefilter is only built from the first segment fragment bitmap, not all segments.

DatasetPreFilter::new() uses the metadata fragment_bitmap to determine which fragments need deletion masking. With multi-segment indices, fragments covered by non-first segments will not have their deletion vectors loaded, so deleted rows from those fragments could appear in query results.

Fix: Have load_all_fts_segments return the Vec of IndexMetadata alongside the indices (it already loads this internally), and pass the full metadata list to build_prefilter. This also eliminates the redundant load_scalar_index call (P1 below).

P1: Redundant index metadata loading

load_all_fts_segments internally calls load_scalar_index + load_indices_by_name, loading all metadata. Then both MatchQueryExec and PhraseQueryExec call load_scalar_index again. This doubles the metadata load cost. Returning the metadata from load_all_fts_segments fixes both the P0 and this.

Minor: Redundant frag_bitmap OR in merge path

In append.rs, the merge branch (num_to_merge > 0) does frag_bitmap |= effective_old_frags inside the match arm. Then after the match, the generic code at line 302-306 ORs the same fragments again via removed_indices. Since bitmap OR is idempotent this is correct, but the duplication could cause confusion for future readers.

Tests

Good coverage of append, merge, partial merge, unindexed data, and phrase queries across segments. The score equivalence test is a nice touch. However, none of the tests exercise the deletion path across multi-segment indices (delete rows, then query), which is exactly where the P0 bug above would manifest. Consider adding a test that deletes rows from data covered by a non-first segment and verifies they do not appear in results.

- append.rs multi-merge: filter by effective_old_frags (not frag_bitmap) when collecting fragments to rebuild from. frag_bitmap already includes unindexed fragment IDs, so filtering by it and then extend_from_slice(unindexed) produced duplicate fragments. - fts.rs: load_all_fts_segments now returns Vec<IndexMetadata> alongside the opened indices. Callers (MatchQueryExec, PhraseQueryExec) use it directly for build_prefilter instead of calling load_scalar_index again, which was a redundant RPC and only passed one segment's metadata to build_prefilter instead of all segments'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously, multi_index_bm25_search built a HashMap<String, f32> idf_cache and a Vec<f32> idf_by_position per partition result in the streaming loop. Since all partitions share the same query Tokens with fixed term positions, IDF weights can be precomputed once by position before the loop. This eliminates: - idf_cache HashMap allocation and per-partition lookups/clones - per-partition idf_by_position Vec allocation - PartitionCandidates::tokens_by_position (no longer needed to carry token strings from the CPU task back to the merge loop) - PostingIterator::token field and getter (only caller was tokens_by_position construction) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot added the enhancement New feature or request label Mar 25, 2026

wjones127 and others added 2 commits March 24, 2026 20:43

Xuanwo mentioned this pull request Mar 27, 2026

Tracking: Distributed Indexes Search #6309

Open

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support multiple index segments for FTS indices#6285

feat: support multiple index segments for FTS indices#6285
wjones127 wants to merge 3 commits intolance-format:mainfrom
wjones127:feat/fts-index-segments

wjones127 commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wjones127 commented Mar 25, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant