fix: handle logical rows deletion properly for zonemap and bloomfilter by HaochengLIU · Pull Request #5140 · lance-format/lance

HaochengLIU · 2025-11-04T16:21:40Z

The old zonemap && bloomfilter use the logical row concept to define zones. It breaks when the rows are not contiguous, e.g. deletion. This PR now ensures that the fragment offset is taken into consideration so that the new "physical row address" handles deletion scenario.

Both Rust and Python tests are added.

close #4758

github-actions · 2025-11-04T16:22:04Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

codecov-commenter · 2025-11-04T17:48:45Z

Codecov Report

❌ Patch coverage is 91.06145% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.06%. Comparing base (9ed9ee2) to head (edf263f).
⚠️ Report is 50 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/index/scalar.rs	91.01%	18 Missing and 6 partials ⚠️
rust/lance-index/src/scalar/bloomfilter.rs	91.83%	1 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/zonemap.rs	90.24%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5140      +/-   ##
==========================================
+ Coverage   81.77%   82.06%   +0.28%     
==========================================
  Files         340      342       +2     
  Lines      140102   141835    +1733     
  Branches   140102   141835    +1733     
==========================================
+ Hits       114568   116396    +1828     
+ Misses      21729    21594     -135     
- Partials     3805     3845      +40

Flag	Coverage Δ
unittests	`82.06% <91.06%> (+0.28%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

HaochengLIU · 2025-11-06T15:46:25Z

+// Public test utilities module - only available during testing
 #[cfg(test)]
-pub(crate) mod test;
+pub mod test;


so it can be used by other modules

What modules are using this?

dead code. Will remove.

HaochengLIU · 2025-11-06T15:50:47Z

    pub dataset: Dataset,
 }

+impl Default for NoContextTestFixture {


To suppress clippy warnings

wjones127 · 2025-11-10T19:57:24Z

+        let after_index: Vec<arrow_array::RecordBatch> = ds
+            .scan()
+            .filter("value")
+            .unwrap()
+            .try_into_stream()
+            .await
+            .unwrap()
+            .try_collect()
+            .await
+            .unwrap();
+
+        let after_ids: Vec<u64> = after_index[0]
+            .column_by_name("id")
+            .unwrap()
+            .as_any()
+            .downcast_ref::<arrow_array::UInt64Array>()
+            .unwrap()
+            .values()
+            .iter()
+            .copied()
+            .collect();


You can simplify this a lot:

Use try_into_batch() to just get one RecordBatch from the scan

Since we don't mind panicking in tests, you can directly index into the column with after_index["id"]

Instead of collecting the values to a vec, just grab the ScalarBuffer. This implements AsRef<[T]>, so you can compare it to a slice in the assert_eq!() call.

Suggested change

let after_index: Vec<arrow_array::RecordBatch> = ds

.scan()

.filter("value")

.unwrap()

.try_into_stream()

.await

.unwrap()

.try_collect()

.await

.unwrap();

let after_ids: Vec<u64> = after_index[0]

.column_by_name("id")

.unwrap()

.as_any()

.downcast_ref::<arrow_array::UInt64Array>()

.unwrap()

.values()

.iter()

.copied()

.collect();

let after_index = ds

.scan()

.filter("value")

.unwrap()

.try_into_batch()

.await

.unwrap();

let after_ids = after_index["id"]

.as_any()

.downcast_ref::<arrow_array::UInt64Array>()

.unwrap()

.values();

It's cool... will rewrite

wjones127 · 2025-11-10T19:57:35Z

+        );
+        assert_eq!(
+            after_ids,
+            vec![0, 2, 4, 6, 8],


Suggested change

vec![0, 2, 4, 6, 8],

&[0, 2, 4, 6, 8],

wjones127 · 2025-11-10T20:01:31Z

+// Public test utilities module - only available during testing
 #[cfg(test)]
-pub(crate) mod test;
+pub mod test;


What modules are using this?

wjones127 · 2025-11-10T20:04:47Z

    fragment_id: u64,
-    // zone_start is the start row of the zone in the fragment, also known
-    // as local row offset
+    // zone_start is the actual first row address (local offset within fragment)


Local offset and row address are not the same thing.

Suggested change

// zone_start is the actual first row address (local offset within fragment)

// zone_start is the start row of the zone in the fragment, also known

// as local row offset. To get the first row address, you can do

// `fragment_id << 32 + zone_start`.

true, will fix the wrong wording

wjones127 · 2025-11-10T20:06:31Z

+    // zone_length is the address span: (last_row_addr - first_row_addr + 1)
+    // AKA offset in the fragment, which allows handling non-contiguous addresses after deletions


I don't understand this comment. What does it mean? How are deletions handled with respect to this?

Rewrote the whole part and added two examples

wjones127 · 2025-11-10T20:14:51Z

    fragment_id: u64,
-    // zone_start is the start row of the zone in the fragment, also known
-    // as local row offset
+    // zone_start is the actual first row address (local offset within fragment)


Same comment applies here

wjones127 · 2025-11-10T20:18:09Z

                    self.update_stats(&data_array.slice(array_offset, remaining))?;
+
+                    // Track first and last row addresses (local offsets within fragment)
+                    let first_addr = row_addrs_array.value(array_offset) & 0xFFFFFFFF;


What does this do?

For clarity, could you use the RowAddress struct to manipulate row ids?

https://github.com/lancedb/lance/blob/5ba47fbd30c03f7787c4a43a3a7cad6012dcfad8/rust/lance-core/src/utils/address.rs#L7

wjones127 · 2025-11-10T20:19:08Z

+        let mut ds = lance_datagen::gen_batch()
+            .col("id", array::step::<UInt64Type>())
+            .col("value", array::cycle_bool(vec![true, false]))
+            .into_ram_dataset(FragmentCount::from(1), FragmentRowCount::from(10))


Whenever you are testing something involving row addresses, it's really important you test with multiple fragments. You'll miss a lot of bugs if you don't.

Suggested change

.into_ram_dataset(FragmentCount::from(1), FragmentRowCount::from(10))

.into_ram_dataset(FragmentCount::from(2), FragmentRowCount::from(10))

westonpace

This seems correct but there is a lot of similarity between zone map and bloom filter still. Do you want to create an issue to reduce the code by creating some common abstractions? That way we can remember to tackle this at some point.

westonpace · 2025-11-12T22:26:08Z

@@ -0,0 +1,43 @@
+diff --git a/Cargo.lock b/Cargo.lock


Drop this file?

westonpace · 2025-11-12T22:26:40Z

+//
+// Example: Suppose we have two fragments, each with 4 rows.
+// Fragment 0: zone_start = 0, zone_length = 4  // covers rows 0, 1, 2, 3 in fragment 0
+// The row addresses for fragment 0 are: 0, 1, 2, 3
+// Fragment 1: zone_start = 0, zone_length = 4  // covers rows 0, 1, 2, 3 in fragment 1
+// The row addresses for fragment 1 are: 32>>1, 32>>1 + 1, 32>>1 + 2, 32>>1 + 3
+//
+// Deletion is 0 index based. We delete the 0th and 1st row in fragment 0,
+// and the 1st and 2nd row in fragment 1,
+// Fragment 0: zone_start = 2, zone_length = 2 // covers rows 2, 3 in fragment 0
+// The row addresses for fragment 0 are: 2, 3
+// Fragment 1: zone_start = 0, zone_length = 4  // covers rows 0, 3 in fragment 1
+// The row addresses for fragment 1 are: 32>>1, 32>>1 + 3


Is this example describing BloomFilterStatistics?

HaochengLIU · 2025-11-12T23:21:25Z

This seems correct but there is a lot of similarity between zone map and bloom filter still. Do you want to create an issue to reduce the code by creating some common abstractions? That way we can remember to tackle this at some point.

#5230

HaochengLIU · 2025-11-14T22:21:18Z

@wjones127 gentle ping, I will beout starting from next Tuesday for a trip, if possible would like to merge the fix before I'm gone

wjones127

I've added another test. I think this is good to go.

wjones127 · 2025-11-19T18:58:53Z

Test failures unrelated.

lance-format#5140) The old zonemap && bloomfilter use the logical row concept to define zones. It breaks when the rows are not contiguous, e.g. deletion. This PR now ensures that the fragment offset is taken into consideration so that the new "physical row address" handles deletion scenario. Both Rust and Python tests are added. close lance-format#4758 --------- Co-authored-by: Will Jones <willjones127@gmail.com>

HaochengLIU marked this pull request as draft November 4, 2025 16:21

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch 2 times, most recently from b5250e9 to 6645773 Compare November 4, 2025 16:59

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch 5 times, most recently from 31fb346 to 1bcd634 Compare November 5, 2025 16:03

fix: Make zonemap handle deletion properly

a28f159

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from 1bcd634 to a28f159 Compare November 5, 2025 16:08

HaochengLIU changed the title ~~WIP fix: handle deletion properly with zonemap and bloomfilter~~ fix: handle log rows deletion properly for zonemap and bloomfilter Nov 5, 2025

github-actions Bot added the bug Something isn't working label Nov 5, 2025

add a python test

ad300de

github-actions Bot added the python label Nov 5, 2025

HaochengLIU requested a review from westonpace November 5, 2025 16:50

HaochengLIU marked this pull request as ready for review November 5, 2025 16:53

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from a76eda5 to 9973797 Compare November 5, 2025 16:54

HaochengLIU changed the title ~~fix: handle log rows deletion properly for zonemap and bloomfilter~~ fix: handle logical rows deletion properly for zonemap and bloomfilter Nov 5, 2025

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch 3 times, most recently from 339d914 to d5b9892 Compare November 5, 2025 20:51

HaochengLIU commented Nov 6, 2025

View reviewed changes

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from d5b9892 to 8403a51 Compare November 6, 2025 15:50

HaochengLIU commented Nov 6, 2025

View reviewed changes

fix bloom filter as well

bac5a00

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from 8403a51 to bac5a00 Compare November 7, 2025 15:23

HaochengLIU mentioned this pull request Nov 7, 2025

Zonemap index reading too many files #5130

Open

wjones127 requested changes Nov 10, 2025

View reviewed changes

wjones127 self-assigned this Nov 10, 2025

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from 28389c2 to aacad24 Compare November 11, 2025 21:56

Address first round of PR review

d8b6d26

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch 3 times, most recently from a0883fc to 2e9d62b Compare November 12, 2025 01:50

HaochengLIU requested a review from wjones127 November 12, 2025 01:50

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch 2 times, most recently from 3021837 to 532c932 Compare November 12, 2025 16:36

revert the types for zonemap and bloomfilter stats

88ecb64

HaochengLIU force-pushed the 4758-fix-zonemap-deletion-handle branch from 532c932 to 88ecb64 Compare November 12, 2025 16:41

westonpace approved these changes Nov 12, 2025

View reviewed changes

Delete rust/debug.output

edf263f

add another test

b41d03e

wjones127 approved these changes Nov 19, 2025

View reviewed changes

wjones127 merged commit f0f3cb6 into lance-format:main Nov 19, 2025
22 of 25 checks passed

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

		// zone_length is the address span: (last_row_addr - first_row_addr + 1)
		// AKA offset in the fragment, which allows handling non-contiguous addresses after deletions

	.into_ram_dataset(FragmentCount::from(1), FragmentRowCount::from(10))
	.into_ram_dataset(FragmentCount::from(2), FragmentRowCount::from(10))

Conversation

HaochengLIU commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Nov 4, 2025

Uh oh!

codecov-commenter commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HaochengLIU commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaochengLIU commented Nov 14, 2025

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

wjones127 commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HaochengLIU commented Nov 4, 2025 •

edited

Loading

codecov-commenter commented Nov 4, 2025 •

edited

Loading

HaochengLIU commented Nov 12, 2025 •

edited

Loading