to_batches with non-null filter fails if vector index is present

I get the error:
```
Io error: Execution error: Query Execution error: Execution error: Query Execution error: Execution error: Not found: path/to/temp_ivfpq_to_batches.lance/_indices/97975599-23c9-4c5b-8eb4-730a4de34cb0/page_lookup.lance](http://localhost:8888/jacketsj/proj/lance-work/notebooks/temp_ivfpq_to_batches.lance/_indices/97975599-23c9-4c5b-8eb4-730a4de34cb0/page_lookup.lance), [/home/runner/work/lance/lance/rust/lance-io/src/local.rs:100:31](http://localhost:8888/home/runner/work/lance/lance/rust/lance-io/src/local.rs#line=99), [/rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/task/poll.rs:288:44](http://localhost:8888/rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/task/poll.rs#line=287), [/home/runner/work/lance/lance/rust/lance/src/io/exec/take.rs:65:42](http://localhost:8888/home/runner/work/lance/lance/rust/lance/src/io/exec/take.rs#line=64)
```
(looks like lots of error catching and re-throwing, oh boy)

Simple repro:
```py
import lance
import numpy as np
import pyarrow as pa
import pyarrow.compute as pc

ds_uri = "temp_ivfpq_to_batches.lance"

# Generate data
dims = 4
nrows = 500

def next_batch(batch_size, offset):
    values = pc.random(dims * batch_size).cast('float32')
    return pa.table({
        'id': pa.array([offset + j for j in range(batch_size)]),
        'vector': pa.FixedSizeListArray.from_arrays(values, dims),
    }).to_batches()[0]

def batch_iter(num_rows):
    i = 0
    while i < num_rows:
        batch_size = min(10_000, num_rows - i)
        yield next_batch(batch_size, i)
        i += batch_size

schema = next_batch(1, 0).schema

ds = lance.write_dataset(batch_iter(nrows), ds_uri, schema=schema, mode="overwrite")

# No vector index yet, so this will not crash:
next(iter(ds.to_batches(filter="vector is not null")))

# Create index
metric = "L2"
index_type="IVF_PQ"
num_partitions=256
num_sub_vectors=2
column="vector"

ds.create_index(
    column=[column],
    metric=metric,
    index_type=index_type,
    num_partitions=num_partitions,
    num_sub_vectors=num_sub_vectors,
)

# Crash occurs now that vector index is present
next(iter(ds.to_batches(filter="vector is not null")))
```

Seems to repro on the release from last week, which is up-to-date with all merged PRs at the time of writing.

This also causes an error if one builds a second time with an accelerator enabled, since we have the above filter type enabled for such cases. I originally ran into this while doing so over S3, which turns out not to be related.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_batches with non-null filter fails if vector index is present #2987

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

to_batches with non-null filter fails if vector index is present #2987

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions