Skip to content

Re-parameterize IVF index based on target partition size #4164

@wjones127

Description

@wjones127

We'd like to support the experience where users can set their desired index configuration before adding data. And do have their index configuration be sensible even as they add or delete rows. Currently, users set a fixed num_partitions value, which is only helpful within a certain range of row counts.

Instead, we should have them set a target_partition_size parameter, which can scale appropriately as they change their dataset size. optimize_indices should automatically handle retraining IVF depending when the num_partitions has drifted far enough from the ideal / requested value.

Image

The default for target_partition_size should be 4096. That works well for datasets with fewer than 10 million rows. After that, we start to get too many partitions. So the calculation needs to be:

num_partitions = min( num_rows / target_partition_size, sqrt(num_rows), max_partitions )

Related issues

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions