Skip to content

to_batches with non-null filter fails if vector index is present #2987

@jacketsj

Description

@jacketsj

I get the error:

Io error: Execution error: Query Execution error: Execution error: Query Execution error: Execution error: Not found: path/to/temp_ivfpq_to_batches.lance/_indices/97975599-23c9-4c5b-8eb4-730a4de34cb0/page_lookup.lance](http://localhost:8888/jacketsj/proj/lance-work/notebooks/temp_ivfpq_to_batches.lance/_indices/97975599-23c9-4c5b-8eb4-730a4de34cb0/page_lookup.lance), [/home/runner/work/lance/lance/rust/lance-io/src/local.rs:100:31](http://localhost:8888/home/runner/work/lance/lance/rust/lance-io/src/local.rs#line=99), [/rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/task/poll.rs:288:44](http://localhost:8888/rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/task/poll.rs#line=287), [/home/runner/work/lance/lance/rust/lance/src/io/exec/take.rs:65:42](http://localhost:8888/home/runner/work/lance/lance/rust/lance/src/io/exec/take.rs#line=64)

(looks like lots of error catching and re-throwing, oh boy)

Simple repro:

import lance
import numpy as np
import pyarrow as pa
import pyarrow.compute as pc

ds_uri = "temp_ivfpq_to_batches.lance"

# Generate data
dims = 4
nrows = 500

def next_batch(batch_size, offset):
    values = pc.random(dims * batch_size).cast('float32')
    return pa.table({
        'id': pa.array([offset + j for j in range(batch_size)]),
        'vector': pa.FixedSizeListArray.from_arrays(values, dims),
    }).to_batches()[0]

def batch_iter(num_rows):
    i = 0
    while i < num_rows:
        batch_size = min(10_000, num_rows - i)
        yield next_batch(batch_size, i)
        i += batch_size

schema = next_batch(1, 0).schema

ds = lance.write_dataset(batch_iter(nrows), ds_uri, schema=schema, mode="overwrite")

# No vector index yet, so this will not crash:
next(iter(ds.to_batches(filter="vector is not null")))

# Create index
metric = "L2"
index_type="IVF_PQ"
num_partitions=256
num_sub_vectors=2
column="vector"

ds.create_index(
    column=[column],
    metric=metric,
    index_type=index_type,
    num_partitions=num_partitions,
    num_sub_vectors=num_sub_vectors,
)

# Crash occurs now that vector index is present
next(iter(ds.to_batches(filter="vector is not null")))

Seems to repro on the release from last week, which is up-to-date with all merged PRs at the time of writing.

This also causes an error if one builds a second time with an accelerator enabled, since we have the above filter type enabled for such cases. I originally ran into this while doing so over S3, which turns out not to be related.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions