fix: handle logical rows deletion properly for zonemap and bloomfilter#5140
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
b5250e9 to
6645773
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5140 +/- ##
==========================================
+ Coverage 81.77% 82.06% +0.28%
==========================================
Files 340 342 +2
Lines 140102 141835 +1733
Branches 140102 141835 +1733
==========================================
+ Hits 114568 116396 +1828
+ Misses 21729 21594 -135
- Partials 3805 3845 +40
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
31fb346 to
1bcd634
Compare
1bcd634 to
a28f159
Compare
a76eda5 to
9973797
Compare
339d914 to
d5b9892
Compare
| // Public test utilities module - only available during testing | ||
| #[cfg(test)] | ||
| pub(crate) mod test; | ||
| pub mod test; |
There was a problem hiding this comment.
so it can be used by other modules
There was a problem hiding this comment.
What modules are using this?
There was a problem hiding this comment.
dead code. Will remove.
d5b9892 to
8403a51
Compare
| pub dataset: Dataset, | ||
| } | ||
|
|
||
| impl Default for NoContextTestFixture { |
There was a problem hiding this comment.
To suppress clippy warnings
8403a51 to
bac5a00
Compare
| let after_index: Vec<arrow_array::RecordBatch> = ds | ||
| .scan() | ||
| .filter("value") | ||
| .unwrap() | ||
| .try_into_stream() | ||
| .await | ||
| .unwrap() | ||
| .try_collect() | ||
| .await | ||
| .unwrap(); | ||
|
|
||
| let after_ids: Vec<u64> = after_index[0] | ||
| .column_by_name("id") | ||
| .unwrap() | ||
| .as_any() | ||
| .downcast_ref::<arrow_array::UInt64Array>() | ||
| .unwrap() | ||
| .values() | ||
| .iter() | ||
| .copied() | ||
| .collect(); |
There was a problem hiding this comment.
You can simplify this a lot:
- Use
try_into_batch()to just get oneRecordBatchfrom the scan - Since we don't mind panicking in tests, you can directly index into the column with
after_index["id"] - Instead of collecting the values to a vec, just grab the
ScalarBuffer. This implementsAsRef<[T]>, so you can compare it to a slice in theassert_eq!()call.
| let after_index: Vec<arrow_array::RecordBatch> = ds | |
| .scan() | |
| .filter("value") | |
| .unwrap() | |
| .try_into_stream() | |
| .await | |
| .unwrap() | |
| .try_collect() | |
| .await | |
| .unwrap(); | |
| let after_ids: Vec<u64> = after_index[0] | |
| .column_by_name("id") | |
| .unwrap() | |
| .as_any() | |
| .downcast_ref::<arrow_array::UInt64Array>() | |
| .unwrap() | |
| .values() | |
| .iter() | |
| .copied() | |
| .collect(); | |
| let after_index = ds | |
| .scan() | |
| .filter("value") | |
| .unwrap() | |
| .try_into_batch() | |
| .await | |
| .unwrap(); | |
| let after_ids = after_index["id"] | |
| .as_any() | |
| .downcast_ref::<arrow_array::UInt64Array>() | |
| .unwrap() | |
| .values(); |
There was a problem hiding this comment.
It's cool... will rewrite
| ); | ||
| assert_eq!( | ||
| after_ids, | ||
| vec![0, 2, 4, 6, 8], |
There was a problem hiding this comment.
| vec![0, 2, 4, 6, 8], | |
| &[0, 2, 4, 6, 8], |
| // Public test utilities module - only available during testing | ||
| #[cfg(test)] | ||
| pub(crate) mod test; | ||
| pub mod test; |
There was a problem hiding this comment.
What modules are using this?
| fragment_id: u64, | ||
| // zone_start is the start row of the zone in the fragment, also known | ||
| // as local row offset | ||
| // zone_start is the actual first row address (local offset within fragment) |
There was a problem hiding this comment.
Local offset and row address are not the same thing.
| // zone_start is the actual first row address (local offset within fragment) | |
| // zone_start is the start row of the zone in the fragment, also known | |
| // as local row offset. To get the first row address, you can do | |
| // `fragment_id << 32 + zone_start`. |
There was a problem hiding this comment.
true, will fix the wrong wording
| // zone_length is the address span: (last_row_addr - first_row_addr + 1) | ||
| // AKA offset in the fragment, which allows handling non-contiguous addresses after deletions |
There was a problem hiding this comment.
I don't understand this comment. What does it mean? How are deletions handled with respect to this?
There was a problem hiding this comment.
Rewrote the whole part and added two examples
| fragment_id: u64, | ||
| // zone_start is the start row of the zone in the fragment, also known | ||
| // as local row offset | ||
| // zone_start is the actual first row address (local offset within fragment) |
There was a problem hiding this comment.
Same comment applies here
| self.update_stats(&data_array.slice(array_offset, remaining))?; | ||
|
|
||
| // Track first and last row addresses (local offsets within fragment) | ||
| let first_addr = row_addrs_array.value(array_offset) & 0xFFFFFFFF; |
There was a problem hiding this comment.
What does this do?
For clarity, could you use the RowAddress struct to manipulate row ids?
| let mut ds = lance_datagen::gen_batch() | ||
| .col("id", array::step::<UInt64Type>()) | ||
| .col("value", array::cycle_bool(vec![true, false])) | ||
| .into_ram_dataset(FragmentCount::from(1), FragmentRowCount::from(10)) |
There was a problem hiding this comment.
Whenever you are testing something involving row addresses, it's really important you test with multiple fragments. You'll miss a lot of bugs if you don't.
| .into_ram_dataset(FragmentCount::from(1), FragmentRowCount::from(10)) | |
| .into_ram_dataset(FragmentCount::from(2), FragmentRowCount::from(10)) |
28389c2 to
aacad24
Compare
a0883fc to
2e9d62b
Compare
3021837 to
532c932
Compare
532c932 to
88ecb64
Compare
westonpace
left a comment
There was a problem hiding this comment.
This seems correct but there is a lot of similarity between zone map and bloom filter still. Do you want to create an issue to reduce the code by creating some common abstractions? That way we can remember to tackle this at some point.
| @@ -0,0 +1,43 @@ | |||
| diff --git a/Cargo.lock b/Cargo.lock | |||
| // | ||
| // Example: Suppose we have two fragments, each with 4 rows. | ||
| // Fragment 0: zone_start = 0, zone_length = 4 // covers rows 0, 1, 2, 3 in fragment 0 | ||
| // The row addresses for fragment 0 are: 0, 1, 2, 3 | ||
| // Fragment 1: zone_start = 0, zone_length = 4 // covers rows 0, 1, 2, 3 in fragment 1 | ||
| // The row addresses for fragment 1 are: 32>>1, 32>>1 + 1, 32>>1 + 2, 32>>1 + 3 | ||
| // | ||
| // Deletion is 0 index based. We delete the 0th and 1st row in fragment 0, | ||
| // and the 1st and 2nd row in fragment 1, | ||
| // Fragment 0: zone_start = 2, zone_length = 2 // covers rows 2, 3 in fragment 0 | ||
| // The row addresses for fragment 0 are: 2, 3 | ||
| // Fragment 1: zone_start = 0, zone_length = 4 // covers rows 0, 3 in fragment 1 | ||
| // The row addresses for fragment 1 are: 32>>1, 32>>1 + 3 |
There was a problem hiding this comment.
Is this example describing BloomFilterStatistics?
|
|
@wjones127 gentle ping, I will beout starting from next Tuesday for a trip, if possible would like to merge the fix before I'm gone |
wjones127
left a comment
There was a problem hiding this comment.
I've added another test. I think this is good to go.
|
Test failures unrelated. |
lance-format#5140) The old zonemap && bloomfilter use the logical row concept to define zones. It breaks when the rows are not contiguous, e.g. deletion. This PR now ensures that the fragment offset is taken into consideration so that the new "physical row address" handles deletion scenario. Both Rust and Python tests are added. close lance-format#4758 --------- Co-authored-by: Will Jones <willjones127@gmail.com>
The old zonemap && bloomfilter use the logical row concept to define zones. It breaks when the rows are not contiguous, e.g. deletion. This PR now ensures that the fragment offset is taken into consideration so that the new "physical row address" handles deletion scenario.
Both Rust and Python tests are added.
close #4758