Inverted indexes work today with the concept of partitions. During training we create possibly several partitions. When we are adding new data we create a partition with the new data and add it to the index. When we search we search the partitions in parallel, normalize the scores, and then do a top-k.
However, because partitions are not segments, they cannot stand on their own. We cannot create a new partition and commit it on its own. Instead we have to copy all the existing partitions and then add the new partition. This leads to a lot of write amplification during the training process. In addition, we've had to do special things like deleted_fragments because partitions don't have their own fragment bitmap, don't get remapped, etc. Finally, it would be difficult to build out distributed search architecture that works similarly between vector search and full text search.
As a result, we should migrate from a "partitions" concept to the already established "segments" concept that exists in the table format today. Each partition can be a segment. This gives us all the same perks as before without the downsides.
Inverted indexes work today with the concept of partitions. During training we create possibly several partitions. When we are adding new data we create a partition with the new data and add it to the index. When we search we search the partitions in parallel, normalize the scores, and then do a top-k.
However, because partitions are not segments, they cannot stand on their own. We cannot create a new partition and commit it on its own. Instead we have to copy all the existing partitions and then add the new partition. This leads to a lot of write amplification during the training process. In addition, we've had to do special things like
deleted_fragmentsbecause partitions don't have their own fragment bitmap, don't get remapped, etc. Finally, it would be difficult to build out distributed search architecture that works similarly between vector search and full text search.As a result, we should migrate from a "partitions" concept to the already established "segments" concept that exists in the table format today. Each partition can be a segment. This gives us all the same perks as before without the downsides.