Re-parameterize IVF index based on target partition size

We'd like to support the experience where users can set their desired index configuration before adding data. And do have their index configuration be sensible even as they add or delete rows. Currently, users set a fixed `num_partitions` value, which is only helpful within a certain range of row counts.

Instead, we should have them set a `target_partition_size` parameter, which can scale appropriately as they change their dataset size. `optimize_indices` should automatically handle retraining IVF depending when the num_partitions has drifted far enough from the ideal / requested value.

![Image](https://github.com/user-attachments/assets/f04e7026-fb14-40a9-a22a-39ed336e7202)

The default for `target_partition_size` should be `4096`. That works well for datasets with fewer than 10 million rows. After that, we start to get too many partitions. So the calculation needs to be:

```
num_partitions = min( num_rows / target_partition_size, sqrt(num_rows), max_partitions )
```

## Related issues

* [ ] https://github.com/lancedb/lance/issues/4034
* [ ] https://github.com/lancedb/lance/issues/3674
* [x] https://github.com/lancedb/lance/issues/3940

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-parameterize IVF index based on target partition size #4164

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Re-parameterize IVF index based on target partition size #4164

Description

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions