Skip to content

fix: avoid fts corruption when a fragment is deleted#6171

Closed
westonpace wants to merge 1 commit intolance-format:mainfrom
westonpace:fix/avoid-fts-corruption-deleted-fragment
Closed

fix: avoid fts corruption when a fragment is deleted#6171
westonpace wants to merge 1 commit intolance-format:mainfrom
westonpace:fix/avoid-fts-corruption-deleted-fragment

Conversation

@westonpace
Copy link
Copy Markdown
Member

@westonpace westonpace commented Mar 11, 2026

This is basically the same problem encountered in #5929

If an entire fragment is deleted then, after an FTS update, the rows in that deleted fragment will still be indexed but shouldn't be. This fix uses the old_data_filter to properly prune the old data during the update.


This PR was created with the help of Claude Opus 4.6. I take responsibility for all code.

@github-actions github-actions Bot added the bug Something isn't working label Mar 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review

Clean and well-structured fix. The approach of threading old_data_filter through the inverted index merge pipeline is the right one, consistent with how other scalar indices handle this.

One minor observation

build_empty_batch hardcodes schema assumptions (index.rs:~1963-1987): The new build_empty_batchmanually constructs columns to match the posting list schema, duplicating the column layout knowledge that lives ininverted_list_schema()/build_batch(). If the schema ever changes (e.g., a new column is added), this function would silently produce a mismatched batch. Consider whether you could instead reuse build_batchwith an empty compressed array, or at least add a debug assertion that the constructed batch schema matchesinverted_list_schema(self.has_positions())`.

Looks good

  • The is_old tagging on PartitionSource is clean and correctly distinguishes old vs new partitions.
  • The old_doc_id_map approach in the merger correctly filters docs and remaps posting list entries in a single pass.
  • The single-partition shortcut bypass (len() <= 1 && old_data_filter.is_none()) is correct — can't just copy files when filtering is needed.
  • Filtering out empty postings before creating PostingIterator is a good defensive measure.
  • The regression test is thorough: covers merge_insert → compact → optimize → search, and validates both stale data removal and new data correctness.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 11, 2026

@westonpace
Copy link
Copy Markdown
Member Author

Superseded by #6216

@westonpace westonpace closed this Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant