Skip to content

fix: handle logical rows deletion properly for zonemap and bloomfilter#5140

Merged
wjones127 merged 7 commits intolance-format:mainfrom
HaochengLIU:4758-fix-zonemap-deletion-handle
Nov 19, 2025
Merged

fix: handle logical rows deletion properly for zonemap and bloomfilter#5140
wjones127 merged 7 commits intolance-format:mainfrom
HaochengLIU:4758-fix-zonemap-deletion-handle

Conversation

@HaochengLIU
Copy link
Copy Markdown
Member

@HaochengLIU HaochengLIU commented Nov 4, 2025

The old zonemap && bloomfilter use the logical row concept to define zones. It breaks when the rows are not contiguous, e.g. deletion. This PR now ensures that the fragment offset is taken into consideration so that the new "physical row address" handles deletion scenario.

Both Rust and Python tests are added.

close #4758

@HaochengLIU HaochengLIU marked this pull request as draft November 4, 2025 16:21
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 4, 2025

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@HaochengLIU HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch 2 times, most recently from b5250e9 to 6645773 Compare November 4, 2025 16:59
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 4, 2025

Codecov Report

❌ Patch coverage is 91.06145% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.06%. Comparing base (9ed9ee2) to head (edf263f).
⚠️ Report is 50 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/index/scalar.rs 91.01% 18 Missing and 6 partials ⚠️
rust/lance-index/src/scalar/bloomfilter.rs 91.83% 1 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/zonemap.rs 90.24% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5140      +/-   ##
==========================================
+ Coverage   81.77%   82.06%   +0.28%     
==========================================
  Files         340      342       +2     
  Lines      140102   141835    +1733     
  Branches   140102   141835    +1733     
==========================================
+ Hits       114568   116396    +1828     
+ Misses      21729    21594     -135     
- Partials     3805     3845      +40     
Flag Coverage Δ
unittests 82.06% <91.06%> (+0.28%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@HaochengLIU HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch 5 times, most recently from 31fb346 to 1bcd634 Compare November 5, 2025 16:03
@HaochengLIU HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from 1bcd634 to a28f159 Compare November 5, 2025 16:08
@HaochengLIU HaochengLIU changed the title WIP fix: handle deletion properly with zonemap and bloomfilter fix: handle log rows deletion properly for zonemap and bloomfilter Nov 5, 2025
@github-actions github-actions Bot added the bug Something isn't working label Nov 5, 2025
@github-actions github-actions Bot added the python label Nov 5, 2025
@HaochengLIU HaochengLIU marked this pull request as ready for review November 5, 2025 16:53
@HaochengLIU HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from a76eda5 to 9973797 Compare November 5, 2025 16:54
@HaochengLIU HaochengLIU changed the title fix: handle log rows deletion properly for zonemap and bloomfilter fix: handle logical rows deletion properly for zonemap and bloomfilter Nov 5, 2025
@HaochengLIU HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch 3 times, most recently from 339d914 to d5b9892 Compare November 5, 2025 20:51
Comment thread rust/lance/src/utils.rs Outdated
// Public test utilities module - only available during testing
#[cfg(test)]
pub(crate) mod test;
pub mod test;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it can be used by other modules

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What modules are using this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code. Will remove.

@HaochengLIU HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from d5b9892 to 8403a51 Compare November 6, 2025 15:50
pub dataset: Dataset,
}

impl Default for NoContextTestFixture {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To suppress clippy warnings

Comment thread rust/lance/src/index/scalar.rs Outdated
Comment on lines +1243 to +1263
let after_index: Vec<arrow_array::RecordBatch> = ds
.scan()
.filter("value")
.unwrap()
.try_into_stream()
.await
.unwrap()
.try_collect()
.await
.unwrap();

let after_ids: Vec<u64> = after_index[0]
.column_by_name("id")
.unwrap()
.as_any()
.downcast_ref::<arrow_array::UInt64Array>()
.unwrap()
.values()
.iter()
.copied()
.collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this a lot:

  1. Use try_into_batch() to just get one RecordBatch from the scan
  2. Since we don't mind panicking in tests, you can directly index into the column with after_index["id"]
  3. Instead of collecting the values to a vec, just grab the ScalarBuffer. This implements AsRef<[T]>, so you can compare it to a slice in the assert_eq!() call.
Suggested change
let after_index: Vec<arrow_array::RecordBatch> = ds
.scan()
.filter("value")
.unwrap()
.try_into_stream()
.await
.unwrap()
.try_collect()
.await
.unwrap();
let after_ids: Vec<u64> = after_index[0]
.column_by_name("id")
.unwrap()
.as_any()
.downcast_ref::<arrow_array::UInt64Array>()
.unwrap()
.values()
.iter()
.copied()
.collect();
let after_index = ds
.scan()
.filter("value")
.unwrap()
.try_into_batch()
.await
.unwrap();
let after_ids = after_index["id"]
.as_any()
.downcast_ref::<arrow_array::UInt64Array>()
.unwrap()
.values();

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cool... will rewrite

Comment thread rust/lance/src/index/scalar.rs Outdated
);
assert_eq!(
after_ids,
vec![0, 2, 4, 6, 8],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vec![0, 2, 4, 6, 8],
&[0, 2, 4, 6, 8],

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment thread rust/lance/src/utils.rs Outdated
// Public test utilities module - only available during testing
#[cfg(test)]
pub(crate) mod test;
pub mod test;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What modules are using this?

fragment_id: u64,
// zone_start is the start row of the zone in the fragment, also known
// as local row offset
// zone_start is the actual first row address (local offset within fragment)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local offset and row address are not the same thing.

Suggested change
// zone_start is the actual first row address (local offset within fragment)
// zone_start is the start row of the zone in the fragment, also known
// as local row offset. To get the first row address, you can do
// `fragment_id << 32 + zone_start`.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, will fix the wrong wording

Comment on lines +57 to +58
// zone_length is the address span: (last_row_addr - first_row_addr + 1)
// AKA offset in the fragment, which allows handling non-contiguous addresses after deletions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. What does it mean? How are deletions handled with respect to this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the whole part and added two examples

Comment thread rust/lance-index/src/scalar/zonemap.rs Outdated
fragment_id: u64,
// zone_start is the start row of the zone in the fragment, also known
// as local row offset
// zone_start is the actual first row address (local offset within fragment)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment applies here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment thread rust/lance-index/src/scalar/zonemap.rs Outdated
self.update_stats(&data_array.slice(array_offset, remaining))?;

// Track first and last row addresses (local offsets within fragment)
let first_addr = row_addrs_array.value(array_offset) & 0xFFFFFFFF;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

For clarity, could you use the RowAddress struct to manipulate row ids?

https://github.com/lancedb/lance/blob/5ba47fbd30c03f7787c4a43a3a7cad6012dcfad8/rust/lance-core/src/utils/address.rs#L7

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment thread rust/lance/src/index/scalar.rs Outdated
let mut ds = lance_datagen::gen_batch()
.col("id", array::step::<UInt64Type>())
.col("value", array::cycle_bool(vec![true, false]))
.into_ram_dataset(FragmentCount::from(1), FragmentRowCount::from(10))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever you are testing something involving row addresses, it's really important you test with multiple fragments. You'll miss a lot of bugs if you don't.

Suggested change
.into_ram_dataset(FragmentCount::from(1), FragmentRowCount::from(10))
.into_ram_dataset(FragmentCount::from(2), FragmentRowCount::from(10))

@wjones127 wjones127 self-assigned this Nov 10, 2025
@HaochengLIU HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from 28389c2 to aacad24 Compare November 11, 2025 21:56
@HaochengLIU HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch 3 times, most recently from a0883fc to 2e9d62b Compare November 12, 2025 01:50
@HaochengLIU HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch 2 times, most recently from 3021837 to 532c932 Compare November 12, 2025 16:36
@HaochengLIU HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from 532c932 to 88ecb64 Compare November 12, 2025 16:41
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct but there is a lot of similarity between zone map and bloom filter still. Do you want to create an issue to reduce the code by creating some common abstractions? That way we can remember to tackle this at some point.

Comment thread rust/debug.output Outdated
@@ -0,0 +1,43 @@
diff --git a/Cargo.lock b/Cargo.lock
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this file?

Comment on lines +53 to +65
//
// Example: Suppose we have two fragments, each with 4 rows.
// Fragment 0: zone_start = 0, zone_length = 4 // covers rows 0, 1, 2, 3 in fragment 0
// The row addresses for fragment 0 are: 0, 1, 2, 3
// Fragment 1: zone_start = 0, zone_length = 4 // covers rows 0, 1, 2, 3 in fragment 1
// The row addresses for fragment 1 are: 32>>1, 32>>1 + 1, 32>>1 + 2, 32>>1 + 3
//
// Deletion is 0 index based. We delete the 0th and 1st row in fragment 0,
// and the 1st and 2nd row in fragment 1,
// Fragment 0: zone_start = 2, zone_length = 2 // covers rows 2, 3 in fragment 0
// The row addresses for fragment 0 are: 2, 3
// Fragment 1: zone_start = 0, zone_length = 4 // covers rows 0, 3 in fragment 1
// The row addresses for fragment 1 are: 32>>1, 32>>1 + 3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this example describing BloomFilterStatistics?

@HaochengLIU
Copy link
Copy Markdown
Member Author

HaochengLIU commented Nov 12, 2025

This seems correct but there is a lot of similarity between zone map and bloom filter still. Do you want to create an issue to reduce the code by creating some common abstractions? That way we can remember to tackle this at some point.

#5230

@HaochengLIU
Copy link
Copy Markdown
Member Author

@wjones127 gentle ping, I will beout starting from next Tuesday for a trip, if possible would like to merge the fix before I'm gone

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added another test. I think this is good to go.

@wjones127
Copy link
Copy Markdown
Contributor

Test failures unrelated.

@wjones127 wjones127 merged commit f0f3cb6 into lance-format:main Nov 19, 2025
22 of 25 checks passed
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
lance-format#5140)

The old zonemap && bloomfilter use the logical row concept to define
zones. It breaks when the rows are not contiguous, e.g. deletion. This
PR now ensures that the fragment offset is taken into consideration so
that the new "physical row address" handles deletion scenario.

Both Rust and Python tests are added.

close lance-format#4758

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zone map and bloom filter don't seem to handle deletions correctly

4 participants