feat: support multiple index segments for FTS indices#6285
feat: support multiple index segments for FTS indices#6285wjones127 wants to merge 3 commits intolance-format:mainfrom
Conversation
Previously, FTS (inverted) indices always operated as a single segment. `optimize_indices` would always merge everything into one index, and queries loaded exactly one segment. This mirrored how scalar indices worked but differed from IVF_PQ vector indices which already supported multiple segments. This adds multi-segment support for FTS indices: - **Creating segments**: `optimize_indices(append())` creates a new FTS segment covering only unindexed data, without rebuilding existing segments. The scalar branch of `merge_indices` now respects `num_indices_to_merge` instead of hardcoding `indices_merged = 1`. - **Merging segments**: `optimize_indices(merge(N))` merges the N most recent FTS segments into one, matching the vector index behavior. - **Querying segments**: All FTS query types (Match, Phrase, Boolean, Boost) search across all segments with unified BM25 scoring. `multi_index_bm25_search` collects partitions from all segments into a single `IndexBM25Scorer` for cross-segment IDF reweighting, ensuring results are identical to a single-segment index. Each segment's `deleted_fragments` bitmap is unioned into the prefilter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
PR Review: feat: support multiple index segments for FTS indices P0: Prefilter built from incomplete metadata (correctness bug) In both MatchQueryExec::execute and PhraseQueryExec::execute, after calling load_all_fts_segments() (which internally loads all segment metadata), the code calls load_scalar_index() again and passes the result to build_prefilter(). But load_scalar_index returns Option of IndexMetadata -- at most ONE segment metadata. The .into_iter().collect() produces a Vec of 0 or 1 elements. So the prefilter is only built from the first segment fragment bitmap, not all segments. DatasetPreFilter::new() uses the metadata fragment_bitmap to determine which fragments need deletion masking. With multi-segment indices, fragments covered by non-first segments will not have their deletion vectors loaded, so deleted rows from those fragments could appear in query results. Fix: Have load_all_fts_segments return the Vec of IndexMetadata alongside the indices (it already loads this internally), and pass the full metadata list to build_prefilter. This also eliminates the redundant load_scalar_index call (P1 below). P1: Redundant index metadata loading load_all_fts_segments internally calls load_scalar_index + load_indices_by_name, loading all metadata. Then both MatchQueryExec and PhraseQueryExec call load_scalar_index again. This doubles the metadata load cost. Returning the metadata from load_all_fts_segments fixes both the P0 and this. Minor: Redundant frag_bitmap OR in merge path In append.rs, the merge branch (num_to_merge > 0) does frag_bitmap |= effective_old_frags inside the match arm. Then after the match, the generic code at line 302-306 ORs the same fragments again via removed_indices. Since bitmap OR is idempotent this is correct, but the duplication could cause confusion for future readers. Tests Good coverage of append, merge, partial merge, unindexed data, and phrase queries across segments. The score equivalence test is a nice touch. However, none of the tests exercise the deletion path across multi-segment indices (delete rows, then query), which is exactly where the P0 bug above would manifest. Consider adding a test that deletes rows from data covered by a non-first segment and verifies they do not appear in results. |
- append.rs multi-merge: filter by effective_old_frags (not frag_bitmap) when collecting fragments to rebuild from. frag_bitmap already includes unindexed fragment IDs, so filtering by it and then extend_from_slice(unindexed) produced duplicate fragments. - fts.rs: load_all_fts_segments now returns Vec<IndexMetadata> alongside the opened indices. Callers (MatchQueryExec, PhraseQueryExec) use it directly for build_prefilter instead of calling load_scalar_index again, which was a redundant RPC and only passed one segment's metadata to build_prefilter instead of all segments'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously, multi_index_bm25_search built a HashMap<String, f32> idf_cache and a Vec<f32> idf_by_position per partition result in the streaming loop. Since all partitions share the same query Tokens with fixed term positions, IDF weights can be precomputed once by position before the loop. This eliminates: - idf_cache HashMap allocation and per-partition lookups/clones - per-partition idf_by_position Vec allocation - PartitionCandidates::tokens_by_position (no longer needed to carry token strings from the CPU task back to the merge loop) - PostingIterator::token field and getter (only caller was tokens_by_position construction) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
optimize_indices, matching IVF_PQ vector index behavioroptimize_indices(append())creates a new segment for unindexed data without rebuilding existing segmentsoptimize_indices(merge(N))merges the N most recent segments into onePreviously, the scalar branch of
merge_indiceshardcodedindices_merged = 1, always merging everything into one index. Now it respectsOptimizeOptions::num_indices_to_merge.Test plan
test_fts_append_creates_separate_segment— append creates 2 disjoint segmentstest_fts_multi_segment_score_equivalence— identical BM25 scores for 6 queries across single vs multi-segmenttest_fts_merge_combines_segments— merge(2) reduces 2→1 segment, results unchangedtest_fts_three_segments_partial_merge— 3 segments → merge(2) → 2 segmentstest_fts_multi_segment_with_unindexed_data— 2 segments + unindexed flat scantest_fts_multi_segment_phrase_query— phrase queries across segments🤖 Generated with Claude Code