Skip to content

feat: support multiple index segments for FTS indices#6285

Draft
wjones127 wants to merge 3 commits intolance-format:mainfrom
wjones127:feat/fts-index-segments
Draft

feat: support multiple index segments for FTS indices#6285
wjones127 wants to merge 3 commits intolance-format:mainfrom
wjones127:feat/fts-index-segments

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

Summary

  • FTS indices now support multiple segments via optimize_indices, matching IVF_PQ vector index behavior
  • optimize_indices(append()) creates a new segment for unindexed data without rebuilding existing segments
  • optimize_indices(merge(N)) merges the N most recent segments into one
  • All FTS query types (Match, Phrase, Boolean, Boost) search across all segments with unified cross-segment BM25 scoring (IDF reweighting), so multi-segment results are identical to single-segment

Previously, the scalar branch of merge_indices hardcoded indices_merged = 1, always merging everything into one index. Now it respects OptimizeOptions::num_indices_to_merge.

Test plan

  • test_fts_append_creates_separate_segment — append creates 2 disjoint segments
  • test_fts_multi_segment_score_equivalence — identical BM25 scores for 6 queries across single vs multi-segment
  • test_fts_merge_combines_segments — merge(2) reduces 2→1 segment, results unchanged
  • test_fts_three_segments_partial_merge — 3 segments → merge(2) → 2 segments
  • test_fts_multi_segment_with_unindexed_data — 2 segments + unindexed flat scan
  • test_fts_multi_segment_phrase_query — phrase queries across segments
  • All existing inverted index tests pass

🤖 Generated with Claude Code

Previously, FTS (inverted) indices always operated as a single segment.
`optimize_indices` would always merge everything into one index, and
queries loaded exactly one segment. This mirrored how scalar indices
worked but differed from IVF_PQ vector indices which already supported
multiple segments.

This adds multi-segment support for FTS indices:

- **Creating segments**: `optimize_indices(append())` creates a new FTS
  segment covering only unindexed data, without rebuilding existing
  segments. The scalar branch of `merge_indices` now respects
  `num_indices_to_merge` instead of hardcoding `indices_merged = 1`.

- **Merging segments**: `optimize_indices(merge(N))` merges the N most
  recent FTS segments into one, matching the vector index behavior.

- **Querying segments**: All FTS query types (Match, Phrase, Boolean,
  Boost) search across all segments with unified BM25 scoring.
  `multi_index_bm25_search` collects partitions from all segments into
  a single `IndexBM25Scorer` for cross-segment IDF reweighting, ensuring
  results are identical to a single-segment index. Each segment's
  `deleted_fragments` bitmap is unioned into the prefilter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the enhancement New feature or request label Mar 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat: support multiple index segments for FTS indices

P0: Prefilter built from incomplete metadata (correctness bug)

In both MatchQueryExec::execute and PhraseQueryExec::execute, after calling load_all_fts_segments() (which internally loads all segment metadata), the code calls load_scalar_index() again and passes the result to build_prefilter(). But load_scalar_index returns Option of IndexMetadata -- at most ONE segment metadata. The .into_iter().collect() produces a Vec of 0 or 1 elements. So the prefilter is only built from the first segment fragment bitmap, not all segments.

DatasetPreFilter::new() uses the metadata fragment_bitmap to determine which fragments need deletion masking. With multi-segment indices, fragments covered by non-first segments will not have their deletion vectors loaded, so deleted rows from those fragments could appear in query results.

Fix: Have load_all_fts_segments return the Vec of IndexMetadata alongside the indices (it already loads this internally), and pass the full metadata list to build_prefilter. This also eliminates the redundant load_scalar_index call (P1 below).

P1: Redundant index metadata loading

load_all_fts_segments internally calls load_scalar_index + load_indices_by_name, loading all metadata. Then both MatchQueryExec and PhraseQueryExec call load_scalar_index again. This doubles the metadata load cost. Returning the metadata from load_all_fts_segments fixes both the P0 and this.

Minor: Redundant frag_bitmap OR in merge path

In append.rs, the merge branch (num_to_merge > 0) does frag_bitmap |= effective_old_frags inside the match arm. Then after the match, the generic code at line 302-306 ORs the same fragments again via removed_indices. Since bitmap OR is idempotent this is correct, but the duplication could cause confusion for future readers.

Tests

Good coverage of append, merge, partial merge, unindexed data, and phrase queries across segments. The score equivalence test is a nice touch. However, none of the tests exercise the deletion path across multi-segment indices (delete rows, then query), which is exactly where the P0 bug above would manifest. Consider adding a test that deletes rows from data covered by a non-first segment and verifies they do not appear in results.

wjones127 and others added 2 commits March 24, 2026 20:43
- append.rs multi-merge: filter by effective_old_frags (not frag_bitmap)
  when collecting fragments to rebuild from. frag_bitmap already includes
  unindexed fragment IDs, so filtering by it and then extend_from_slice(unindexed)
  produced duplicate fragments.

- fts.rs: load_all_fts_segments now returns Vec<IndexMetadata> alongside the
  opened indices. Callers (MatchQueryExec, PhraseQueryExec) use it directly
  for build_prefilter instead of calling load_scalar_index again, which was
  a redundant RPC and only passed one segment's metadata to build_prefilter
  instead of all segments'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously, multi_index_bm25_search built a HashMap<String, f32> idf_cache
and a Vec<f32> idf_by_position per partition result in the streaming loop.
Since all partitions share the same query Tokens with fixed term positions,
IDF weights can be precomputed once by position before the loop.

This eliminates:
- idf_cache HashMap allocation and per-partition lookups/clones
- per-partition idf_by_position Vec allocation
- PartitionCandidates::tokens_by_position (no longer needed to carry
  token strings from the CPU task back to the merge loop)
- PostingIterator::token field and getter (only caller was
  tokens_by_position construction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant