fix: respect the old data filter on inverted index#6216
fix: respect the old data filter on inverted index#6216westonpace merged 3 commits intolance-format:mainfrom
Conversation
| #[derive(Debug, Clone)] | ||
| pub enum OldIndexDataFilter { | ||
| /// Keep old rows whose row-address fragment is in this bitmap. | ||
| /// Keeps track of which fragments are still valid and which are no longer valid. |
There was a problem hiding this comment.
Does this fix work for tables with stable row IDs?
There was a problem hiding this comment.
Probably not but I'm a little vague on how indices work with stable row ids in general. Do we store the stable row id in the index? Or are we still storing row addresses?
Can we just filter search results against the list of valid row ids? If a dataset has an ordered list of valid row ids in memory then it should just be one O(N) pass against the index search results?
I'll add this as a follow-up task.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Use invalidated fragments in FTS filter fix test that doesn't write frag bitmap Fix bench. Address clippy
3eb26e9 to
2e7c3b9
Compare
wjones127
left a comment
There was a problem hiding this comment.
Thanks for working on this. Really good we are fixing this.
I have two questions I want to address before moving forward. The rest are non-blocking feedback.
| let part_invalid_fragments = | ||
| RoaringBitmap::deserialize_from(invalid_fragments_arr.value(0))?; | ||
| invalidated_fragments.extend(part_invalid_fragments); |
There was a problem hiding this comment.
question: when might parts have different invalid fragments? Do we correctly handle where we invalidate fragment 1 in index 1, then index fragment 1 in index 2, and then try to merge index 1 and 2?
There was a problem hiding this comment.
I'm not sure I understand how this would happen. Fragments are invalidated during compaction because they've been deleted. How would we then go and index fragment 1 in index 2?
In any case, once a fragment is invalidated it never becomes "uninvalidated" (can't restore a fragment with it's original ID) so I think the current behavior (union of all invalidated fragments) should be safe.
There was a problem hiding this comment.
To clarify, I'm not talking about invalidation due to deletion. I'm talking about invalidation to data update. For example, like in this test:
lance/rust/lance/src/dataset/tests/dataset_merge_update.rs
Lines 1611 to 1612 in 244c721
Sequence can go like:
- Write fragments 1 and 2.
- Create FTS index covering fragment 1 and 2
- Update fragment 1 in-place using either merge_insert or DataReplacement. This removes fragment 1 from the
fragment_bitmap. - Call optimize index to create a new segment, will see fragment 1 is unindexed and create an index segment covering fragment 1
- Call optimize index to merge both segments. The first index segment contains the old, invalidated values for fragment 1. The second index segment contains the new, valid values for fragment 1. We should keep the ones from the second.
Basically, when rows are updated in-place, we need to invalidate them in the index. But then we also need to allow re-indexing them incrementally.
There was a problem hiding this comment.
Ok, I think you are correct that would be a problem. The merged segment would still contain 1 in its deleted_fragments. If we keep it, we will miss results from the new data. If we drop it we will get results from the old data.
There was a problem hiding this comment.
Ok. This is a problem with IVF/PQ indexes as well. At the moment I kind of want to move forward with this PR as it doesn't actually make things worse and DataReplacement operations are less common at the moment. But...I think we have more problems to solve. I have created a follow-up at #6283
| // Remap the index via the ScalarIndex trait method (identity mapping) | ||
| use crate::scalar::ScalarIndex; | ||
| let mapping = HashMap::from([(0u64, Some(0u64))]); | ||
| index.remap(&mapping, dest_store.as_ref()).await.unwrap(); |
There was a problem hiding this comment.
question: do we do identity remapping today? Do we test the common case of remapping to new fragments?
There was a problem hiding this comment.
Fair point, Claude generated this test. We shouldn't ever do identity remapping. I'll fix it.
There was a problem hiding this comment.
Fixed. That being said, remapping is pretty trivial here. A deleted fragment should never appear in the remapping. All we need to do here is make sure we don't lose the deleted fragments list.
| // Use RowIds filter instead of Fragments — should not affect invalidated_fragments | ||
| let mut valid_ids = lance_core::utils::mask::RowAddrTreeMap::new(); | ||
| valid_ids.insert(0); | ||
| let old_data_filter = Some(crate::scalar::OldIndexDataFilter::RowIds(valid_ids)); |
There was a problem hiding this comment.
question (non-blocking): how do we handle invalidated stable row_ids? I'm fine with handling this in a follow up issue if it's not solved yet.
There was a problem hiding this comment.
We do not. I already created a follow-up (#6262) but I'm not entirely sure we need to handle the stable row id case at all since the result of a stable row id search is filtered by an allow-list of valid stable row ids at the moment. Though that behavior may change someday.
| // Fragments which are contained in the index, but no longer in the dataset. | ||
| // These should be pruned at search time since we don't prune them at update time. | ||
| invalidated_fragments: RoaringBitmap, |
There was a problem hiding this comment.
nitpick: when I say "invalidated fragments" elsewhere, I mean something different that deleted fragments. I mean that those fragments have had the column value updated in-place since this index segment was created. It might be nice to keep consistency on this terminology.
There was a problem hiding this comment.
I renamed all invalidated_fragments to deleted_fragments.
| .map(|files| files.iter().map(|f| f.size_bytes).sum()) | ||
| } | ||
|
|
||
| pub fn ineffective_fragment_bitmap( |
There was a problem hiding this comment.
nitpick: "ineffective" seems like a weird way of putting it, but I'm blanking on a better name right now. I'm guessing it's in reference to the effective_fragment_bitmap?
There was a problem hiding this comment.
Yes. I suppose deleted_fragment_bitmap would probably be fine.?
This is basically the same problem encountered in #5929 However, the solution there (and in other indices fixed since then) has been to prune indices during the update phase to remove old fragments. Unfortunately, FTS indices are maybe too large for this pruning to make sense (it would require a complete scan through of the old index) though I haven't verified how bad the performance would be. This PR instead keeps track of which fragments have been invalidated. It then uses this as a block-list at search time.
This is basically the same problem encountered in lance-format#5929 However, the solution there (and in other indices fixed since then) has been to prune indices during the update phase to remove old fragments. Unfortunately, FTS indices are maybe too large for this pruning to make sense (it would require a complete scan through of the old index) though I haven't verified how bad the performance would be. This PR instead keeps track of which fragments have been invalidated. It then uses this as a block-list at search time.
This is basically the same problem encountered in #5929
However, the solution there (and in other indices fixed since then) has been to prune indices during the update phase to remove old fragments. Unfortunately, FTS indices are maybe too large for this pruning to make sense (it would require a complete scan through of the old index) though I haven't verified how bad the performance would be.
This PR instead keeps track of which fragments have been invalidated. It then uses this as a block-list at search time.