fix: inconsistent transposed pq code and metadata when build ivf_pq index distributedly#5834
Conversation
Code Review SummaryOverviewThis PR addresses low recall performance for IVF_PQ in distributed mode by ensuring consistent row ordering between single-machine and distributed builds. The core fix enforces stable sorting of batches by ROW_ID and adjusts the transpose/storage semantics for PQ codes during the distributed merge phase. P0/P1 Issues1. Potential correctness issue with transpose logic (P1)In This appears to be intentional double-processing to ensure consistent format, but:
2. Skip logic in write_pq_partitions may silently drop data (P1)In if num_vectors == 0 || pq_code.len() == 0 {
continue;
}
if pq_code.len() % num_vectors != 0 {
continue;
}The second Suggestions (Non-blocking)
TestingThe new test adequately covers the fix by comparing single-build vs distributed-build results. The test verifies both structural equality (IVF layout, partition sizes) and functional equality (Top-K query results). Overall the approach is sound. Please address the P1 silent-skip issue and verify the transpose semantics are intentional. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
I found that the recall performance is not expected. Will fix it. |
|
After the refactor, I tested that the recall has been recovered. |
…ndex distributedly (lance-format#5834) There is a bug that happens when building the ivf_pq index distributedly. That is when building partial ivf_pq, it transposes the pq code for some fragments. But in the final merge step, all the pq codes are not transposed, and the metadata is marked as `transposed`. So the logic of vectors search reads the inconsistent information. Here the solution is: * add a configuration about transposing; * for the distributed vector build, the partial indices are not transposed; * in the merge step, we finally do the trapsposing that means we transform all the PQ codes from row-based to column-based;
There is a bug that happens when building the ivf_pq index distributedly.
That is when building partial ivf_pq, it transposes the pq code for some fragments. But in the final merge step, all the pq codes are not transposed, and the metadata is marked as
transposed. So the logic of vectors search reads the inconsistent information.Here the solution is: