Skip to content

fix: handle list-level NULLs in NOT filters#6044

Merged
wjones127 merged 12 commits intolance-format:mainfrom
fenfeng9:fix/label-list-null-not
Mar 24, 2026
Merged

fix: handle list-level NULLs in NOT filters#6044
wjones127 merged 12 commits intolance-format:mainfrom
fenfeng9:fix/label-list-null-not

Conversation

@fenfeng9
Copy link
Copy Markdown
Contributor

@fenfeng9 fenfeng9 commented Feb 27, 2026

closes #5904

LabelListIndex drops list-level NULL rows during unnest, so NOT filters can return NULL lists incorrectly.

Changed:

  • persists list-level NULL rows in the LabelList index and merges them back into query null sets
  • keeps older bitmap-only LabelList index layouts readable by treating missing null-set metadata as empty
  • warns when opening an older LabelList index on a nullable column, since NOT-filter semantics may still be incomplete until the index is rebuilt

@github-actions github-actions Bot added bug Something isn't working python labels Feb 27, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 27, 2026

@fenfeng9
Copy link
Copy Markdown
Contributor Author

fenfeng9 commented Mar 2, 2026

@wjones127 @westonpace Could you please review this when you have a moment? Thanks!

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I think the one thing I would change is put the nulls in a global buffer. Here's an example of how we do something similar for IVF indexes:

// Add IVF protobuf as a global buffer and reference via IVF_METADATA_KEY.
let pos = v2_writer.add_global_buffer(ivf_bytes).await?;
v2_writer.add_schema_metadata(IVF_METADATA_KEY, pos.to_string());

I was going to say we should also add more compat tests, but I think we are already well covered there. You can see the warnings are working in our label list compat test:

https://github.com/lance-format/lance/actions/runs/22492208650/job/65160507370?pr=6044#step:6:117

That is defined here:

@compat_test(min_version="0.22.0")
class BitmapLabelListIndex(UpgradeDowngradeTest):
"""Test BITMAP and LABEL_LIST scalar index compatibility (introduced in 0.20.0).
Started fully working in 0.22.0 with fixes to LABEL_LIST index.
"""
def __init__(self, path: Path):
self.path = path
def create(self):
"""Create dataset with BITMAP and LABEL_LIST indices."""
shutil.rmtree(self.path, ignore_errors=True)
data = pa.table(
{
"idx": pa.array(range(1000)),
"bitmap": pa.array(range(1000)),
"label_list": pa.array([[f"label{i}"] for i in range(1000)]),
}
)
dataset = lance.write_dataset(data, self.path, max_rows_per_file=100)
dataset.create_scalar_index("bitmap", "BITMAP")
dataset.create_scalar_index("label_list", "LABEL_LIST")
def check_read(self):
"""Verify BITMAP and LABEL_LIST indices can be queried."""
ds = lance.dataset(self.path)
# Test BITMAP index
table = ds.to_table(filter="bitmap == 7")
assert table.num_rows == 1
assert table.column("idx").to_pylist() == [7]
# Test LABEL_LIST index
table = ds.to_table(filter="array_has_any(label_list, ['label7'])")
assert table.num_rows == 1
assert table.column("idx").to_pylist() == [7]
def check_write(self):
"""Verify can insert data and optimize indices."""
ds = lance.dataset(self.path)
data = pa.table(
{
"idx": pa.array([1000]),
"bitmap": pa.array([1000]),
"label_list": pa.array([["label1000"]]),
}
)
ds.insert(data)
ds.optimize.optimize_indices()
ds.optimize.compact_files()


pub const BITMAP_LOOKUP_NAME: &str = "bitmap_page_lookup.lance";
const LABEL_LIST_INDEX_VERSION: u32 = 0;
pub const LABEL_LIST_NULLS_NAME: &str = "label_list_nulls.lance";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are just writing a single buffer, why don't we instead use a global buffer in one of the existing index files? That will cut down number of files needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and the suggestion! I’ll look into the global buffer idea and get back to you.

Copy link
Copy Markdown
Contributor Author

@fenfeng9 fenfeng9 Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at what it would take to do this on the scalar side.

The main difference from the IVF example is that the scalar index path currently goes through the IndexStore / IndexReader / IndexWriter abstraction:

pub trait IndexWriter: Send {
/// Writes a record batch into the file, returning the 0-based index of the batch in the file
///
/// E.g. if this is the third time this is called this method will return 2
async fn write_record_batch(&mut self, batch: RecordBatch) -> Result<u64>;
/// Finishes writing the file and closes the file
async fn finish(&mut self) -> Result<()>;
/// Finishes writing the file and closes the file with additional metadata
async fn finish_with_metadata(&mut self, metadata: HashMap<String, String>) -> Result<()>;
}

and that abstraction is centered around record-batch files. It does not currently expose global buffer read/write directly.

There is also a second constraint here: LabelList currently reuses the bitmap index train/update/remap paths, so moving the null map into the same index file would likely require some refactoring in that bitmap-backed flow as well.

let data = unnest_chunks(data)?;
let bitmap_plugin = BitmapIndexPlugin;
bitmap_plugin
.train_index(data, index_store, request, fragment_ids, progress)
.await?;
Ok(CreatedIndex {
index_details: prost_types::Any::from_msg(&pbold::LabelListIndexDetails::default())
.unwrap(),
index_version: LABEL_LIST_INDEX_VERSION,
})

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this to a global buffer, but on the scalar side it's a bit more than just moving the null map. We'd probably need to add global-buffer read/write support to the current scalar reader/writer path, and also adjust the bitmap-backed LabelList train/update/remap flow so the null map can be written into the same index file.
So before I go further with that, I wanted to check with you whether you'd like me to also convert LabelList to a global-buffer-based layout in this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would think adding something like:

pub trait IndexWriter: Send { 
     async fn add_global_buffer(&mut self, data: Bytes) -> Result<()>;
}

Would make sense. @westonpace do you have any objection?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would think adding something like:

pub trait IndexWriter: Send { 
     async fn add_global_buffer(&mut self, data: Bytes) -> Result<()>;
}

Thanks for the suggestion! I'll implement this and update the PR soon.

@wjones127 wjones127 self-assigned this Mar 2, 2026
@fenfeng9 fenfeng9 requested a review from wjones127 March 13, 2026 13:24
@fenfeng9 fenfeng9 force-pushed the fix/label-list-null-not branch from 77dc90c to 8ebd55d Compare March 16, 2026 08:33
@fenfeng9
Copy link
Copy Markdown
Contributor Author

I updated the PR in that direction.

I added global buffer support to the scalar reader/writer path, and LabelList now stores list-level nulls in the bitmap file instead of a separate file. The buffer index is recorded in metadata and read back from there.

Comment thread rust/lance-index/src/scalar/label_list.rs Outdated
@wjones127 wjones127 merged commit 8b4a926 into lance-format:main Mar 24, 2026
30 checks passed
westonpace pushed a commit that referenced this pull request Mar 24, 2026
closes #5904

`LabelListIndex` drops list-level NULL rows during unnest, so `NOT`
filters can return NULL lists incorrectly.

Changed:
- persists list-level NULL rows in the LabelList index and merges them
back into query null sets
- keeps older bitmap-only LabelList index layouts readable by treating
missing null-set metadata as empty
- warns when opening an older LabelList index on a nullable column,
since NOT-filter semantics may still be incomplete until the index is
rebuilt

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
wjones127 added a commit to wjones127/lance that referenced this pull request Mar 29, 2026
closes lance-format#5904

`LabelListIndex` drops list-level NULL rows during unnest, so `NOT`
filters can return NULL lists incorrectly.

Changed:
- persists list-level NULL rows in the LabelList index and merges them
back into query null sets
- keeps older bitmap-only LabelList index layouts readable by treating
missing null-set metadata as empty
- warns when opening an older LabelList index on a nullable column,
since NOT-filter semantics may still be incomplete until the index is
rebuilt

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LabelListIndex: NOT filters mis-handle NULL lists (list-level NULLs)

2 participants