In the btree index's remap function we remove deleted rows from the page data file. However, we do not update the page lookup. The assumption was, if we are just changing row ids, the lookup is still valid. However, the lookup tells us the min and max for each page. If we are removing deleted rows then we are changing how pages are distributed, and the mins and maxes will need to be updated as well.
Example test case that fails:
def test_btree_remap_big_deletions(tmp_path: Path):
# Write 15K rows in 3 fragments
ds = lance.write_dataset(pa.table({"a": range(5000)}), tmp_path)
ds = lance.write_dataset(
pa.table({"a": range(5000, 10000)}), tmp_path, mode="append"
)
ds = lance.write_dataset(
pa.table({"a": range(10000, 15000)}), tmp_path, mode="append"
)
# Create index (will have 4 pages)
ds.create_scalar_index("a", index_type="BTREE")
# Delete a lot of data (now there will only be two pages worth)
ds.delete("a > 1000 AND a < 10000")
# Run compaction (deletions will be materialized)
ds.optimize.compact_files()
# Reload dataset and ensure index still works
ds = lance.dataset(tmp_path)
# This fails today when it hits 10000. The lookup says we would find
# this value in the 3rd page but there are only 2 pages
for idx in [0, 500, 1000, 10000, 13000, 14000, 15000]:
assert ds.to_table(filter=f"a = {idx}").num_rows == 1
for idx in [1001, 5000, 8000, 9999]:
assert ds.to_table(filter=f"a = {idx}").num_rows == 0
In the btree index's remap function we remove deleted rows from the page data file. However, we do not update the page lookup. The assumption was, if we are just changing row ids, the lookup is still valid. However, the lookup tells us the min and max for each page. If we are removing deleted rows then we are changing how pages are distributed, and the mins and maxes will need to be updated as well.
Example test case that fails: