Skip to content

Top-level indices part 2: empty vector indices #4034

@wjones127

Description

@wjones127

Follow up to #3940

We are trying to make it possible to keep indices even though we don't have enough data to actually train them. In an earlier PR, we are adding a train: bool parameter to let the user choose whether they want to defer training an index. If train=False, then we just create an empty index and it will show num_indexed_rows: 0 in stats. They can later call optimize_indices() to do the actual training.

Vector indices are more complex because of IVF partitions, since we need a minimum amount of data to create them.

Here is the behavior I think we want:

  1. Can pass train=False to create an empty index. No IVF or PQ will be trained. Will save metadata as empty array.
  2. If you call create_index(..., train=True) or optimize_indices():
    • < 256 non-null vectors -> same as train=False
    • 256 <= number of non-null vectors < num_partitions * 256 => train on smaller number of partitions
    • number of non-null vectors >= num_partitions * 256 -> train full index

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions