Index corruption in btree index when compacting across deletions

In the btree index's remap function we remove deleted rows from the page data file.  However, we do not update the page lookup.  The assumption was, if we are just changing row ids, the lookup is still valid.  However, the lookup tells us the min and max for each page.  If we are removing deleted rows then we are changing how pages are distributed, and the mins and maxes will need to be updated as well.

Example test case that fails:

```
def test_btree_remap_big_deletions(tmp_path: Path):
    # Write 15K rows in 3 fragments
    ds = lance.write_dataset(pa.table({"a": range(5000)}), tmp_path)
    ds = lance.write_dataset(
        pa.table({"a": range(5000, 10000)}), tmp_path, mode="append"
    )
    ds = lance.write_dataset(
        pa.table({"a": range(10000, 15000)}), tmp_path, mode="append"
    )

    # Create index (will have 4 pages)
    ds.create_scalar_index("a", index_type="BTREE")

    # Delete a lot of data (now there will only be two pages worth)
    ds.delete("a > 1000 AND a < 10000")

    # Run compaction (deletions will be materialized)
    ds.optimize.compact_files()

    # Reload dataset and ensure index still works
    ds = lance.dataset(tmp_path)

    # This fails today when it hits 10000.  The lookup says we would find
    # this value in the 3rd page but there are only 2 pages
    for idx in [0, 500, 1000, 10000, 13000, 14000, 15000]:
        assert ds.to_table(filter=f"a = {idx}").num_rows == 1

    for idx in [1001, 5000, 8000, 9999]:
        assert ds.to_table(filter=f"a = {idx}").num_rows == 0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index corruption in btree index when compacting across deletions #5826

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Index corruption in btree index when compacting across deletions #5826

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions