Skip to content

fix: respect the old data filter on inverted index#6216

Merged
westonpace merged 3 commits intolance-format:mainfrom
westonpace:fix/fts-respect-old-data-filter
Mar 24, 2026
Merged

fix: respect the old data filter on inverted index#6216
westonpace merged 3 commits intolance-format:mainfrom
westonpace:fix/fts-respect-old-data-filter

Conversation

@westonpace
Copy link
Copy Markdown
Member

This is basically the same problem encountered in #5929

However, the solution there (and in other indices fixed since then) has been to prune indices during the update phase to remove old fragments. Unfortunately, FTS indices are maybe too large for this pruning to make sense (it would require a complete scan through of the old index) though I haven't verified how bad the performance would be.

This PR instead keeps track of which fragments have been invalidated. It then uses this as a block-list at search time.

@github-actions github-actions Bot added the bug Something isn't working label Mar 17, 2026
#[derive(Debug, Clone)]
pub enum OldIndexDataFilter {
/// Keep old rows whose row-address fragment is in this bitmap.
/// Keeps track of which fragments are still valid and which are no longer valid.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fix work for tables with stable row IDs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not but I'm a little vague on how indices work with stable row ids in general. Do we store the stable row id in the index? Or are we still storing row addresses?

Can we just filter search results against the list of valid row ids? If a dataset has an ordered list of valid row ids in memory then it should just be one O(N) pass against the index search results?

I'll add this as a follow-up task.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 18, 2026

Use invalidated fragments in FTS filter

fix test that doesn't write frag bitmap

Fix bench.  Address clippy
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Really good we are fixing this.

I have two questions I want to address before moving forward. The rest are non-blocking feedback.

Comment on lines +1621 to +1623
let part_invalid_fragments =
RoaringBitmap::deserialize_from(invalid_fragments_arr.value(0))?;
invalidated_fragments.extend(part_invalid_fragments);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: when might parts have different invalid fragments? Do we correctly handle where we invalidate fragment 1 in index 1, then index fragment 1 in index 2, and then try to merge index 1 and 2?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how this would happen. Fragments are invalidated during compaction because they've been deleted. How would we then go and index fragment 1 in index 2?

In any case, once a fragment is invalidated it never becomes "uninvalidated" (can't restore a fragment with it's original ID) so I think the current behavior (union of all invalidated fragments) should be safe.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I'm not talking about invalidation due to deletion. I'm talking about invalidation to data update. For example, like in this test:

#[tokio::test]
async fn test_data_replacement_invalidates_index_bitmap() {

Sequence can go like:

  1. Write fragments 1 and 2.
  2. Create FTS index covering fragment 1 and 2
  3. Update fragment 1 in-place using either merge_insert or DataReplacement. This removes fragment 1 from the fragment_bitmap.
  4. Call optimize index to create a new segment, will see fragment 1 is unindexed and create an index segment covering fragment 1
  5. Call optimize index to merge both segments. The first index segment contains the old, invalidated values for fragment 1. The second index segment contains the new, valid values for fragment 1. We should keep the ones from the second.

Basically, when rows are updated in-place, we need to invalidate them in the index. But then we also need to allow re-indexing them incrementally.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think you are correct that would be a problem. The merged segment would still contain 1 in its deleted_fragments. If we keep it, we will miss results from the new data. If we drop it we will get results from the old data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. This is a problem with IVF/PQ indexes as well. At the moment I kind of want to move forward with this PR as it doesn't actually make things worse and DataReplacement operations are less common at the moment. But...I think we have more problems to solve. I have created a follow-up at #6283

Comment on lines +2652 to +2655
// Remap the index via the ScalarIndex trait method (identity mapping)
use crate::scalar::ScalarIndex;
let mapping = HashMap::from([(0u64, Some(0u64))]);
index.remap(&mapping, dest_store.as_ref()).await.unwrap();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do we do identity remapping today? Do we test the common case of remapping to new fragments?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, Claude generated this test. We shouldn't ever do identity remapping. I'll fix it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. That being said, remapping is pretty trivial here. A deleted fragment should never appear in the remapping. All we need to do here is make sure we don't lose the deleted fragments list.

Comment on lines +2848 to +2851
// Use RowIds filter instead of Fragments — should not affect invalidated_fragments
let mut valid_ids = lance_core::utils::mask::RowAddrTreeMap::new();
valid_ids.insert(0);
let old_data_filter = Some(crate::scalar::OldIndexDataFilter::RowIds(valid_ids));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): how do we handle invalidated stable row_ids? I'm fine with handling this in a follow up issue if it's not solved yet.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not. I already created a follow-up (#6262) but I'm not entirely sure we need to handle the stable row id case at all since the result of a stable row id search is filtered by an allow-list of valid stable row ids at the moment. Though that behavior may change someday.

Comment on lines +346 to +348
// Fragments which are contained in the index, but no longer in the dataset.
// These should be pruned at search time since we don't prune them at update time.
invalidated_fragments: RoaringBitmap,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: when I say "invalidated fragments" elsewhere, I mean something different that deleted fragments. I mean that those fragments have had the column value updated in-place since this index segment was created. It might be nice to keep consistency on this terminology.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed all invalidated_fragments to deleted_fragments.

Comment thread rust/lance-table/src/format/index.rs Outdated
.map(|files| files.iter().map(|f| f.size_bytes).sum())
}

pub fn ineffective_fragment_bitmap(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: "ineffective" seems like a weird way of putting it, but I'm blanking on a better name right now. I'm guessing it's in reference to the effective_fragment_bitmap?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I suppose deleted_fragment_bitmap would probably be fine.?

@westonpace westonpace requested a review from wjones127 March 23, 2026 21:12
@westonpace westonpace merged commit d3803a9 into lance-format:main Mar 24, 2026
33 checks passed
westonpace added a commit that referenced this pull request Mar 24, 2026
This is basically the same problem encountered in
#5929

However, the solution there (and in other indices fixed since then) has
been to prune indices during the update phase to remove old fragments.
Unfortunately, FTS indices are maybe too large for this pruning to make
sense (it would require a complete scan through of the old index) though
I haven't verified how bad the performance would be.

This PR instead keeps track of which fragments have been invalidated. It
then uses this as a block-list at search time.
wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 29, 2026
This is basically the same problem encountered in
lance-format#5929

However, the solution there (and in other indices fixed since then) has
been to prune indices during the update phase to remove old fragments.
Unfortunately, FTS indices are maybe too large for this pruning to make
sense (it would require a complete scan through of the old index) though
I haven't verified how bad the performance would be.

This PR instead keeps track of which fragments have been invalidated. It
then uses this as a block-list at search time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants