Skip to content

fix: inconsistent transposed pq code and metadata when build ivf_pq index distributedly#5834

Merged
yanghua merged 10 commits intolance-format:mainfrom
yanghua:fix-low-recall-ivfpq-new
Jan 31, 2026
Merged

fix: inconsistent transposed pq code and metadata when build ivf_pq index distributedly#5834
yanghua merged 10 commits intolance-format:mainfrom
yanghua:fix-low-recall-ivfpq-new

Conversation

@yanghua
Copy link
Copy Markdown
Collaborator

@yanghua yanghua commented Jan 28, 2026

There is a bug that happens when building the ivf_pq index distributedly.

That is when building partial ivf_pq, it transposes the pq code for some fragments. But in the final merge step, all the pq codes are not transposed, and the metadata is marked as transposed. So the logic of vectors search reads the inconsistent information.

Here the solution is:

  • add a configuration about transposing;
  • for the distributed vector build, the partial indices are not transposed;
  • in the merge step, we finally do the trapsposing that means we transform all the PQ codes from row-based to column-based;

@github-actions github-actions Bot added the bug Something isn't working label Jan 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Code Review Summary

Overview

This PR addresses low recall performance for IVF_PQ in distributed mode by ensuring consistent row ordering between single-machine and distributed builds. The core fix enforces stable sorting of batches by ROW_ID and adjusts the transpose/storage semantics for PQ codes during the distributed merge phase.

P0/P1 Issues

1. Potential correctness issue with transpose logic (P1)

In builder.rs, around the new merge logic (lines 1056-1082 in the diff), the PQ codes are transposed back to per-row layout, then sorted, but the metadata flag is set to transposed: !is_pq (i.e., false for PQ). This means the reader will transpose them again on load (see ProductQuantizationStorage::new at lines 220-237).

This appears to be intentional double-processing to ensure consistent format, but:

  • The code comments should explicitly document this contract: "PQ codes are stored in per-row layout, reader will transpose to columnar format"
  • Please verify this matches the existing single-machine build path behavior

2. Skip logic in write_pq_partitions may silently drop data (P1)

In ivf/io.rs:

if num_vectors == 0 || pq_code.len() == 0 {
    continue;
}
if pq_code.len() % num_vectors != 0 {
    continue;
}

The second continue silently skips partitions where pq_code.len() % num_vectors != 0. This could indicate corrupted data or a bug elsewhere. Consider logging a warning or returning an error rather than silently skipping.

Suggestions (Non-blocking)

  1. The new stable_sort_batch_by_row_id function uses Vec::sort_by which is O(n log n). For the distributed merge case, if batches are already partially sorted (e.g., from different fragments), consider documenting whether a merge-sort approach would be more efficient for future optimization.

  2. The test test_ivfpq_two_fragments_single_vs_split_layout_and_rowids is comprehensive and well-structured. Consider adding a brief doc comment explaining what scenario it validates.

Testing

The new test adequately covers the fix by comparing single-build vs distributed-build results. The test verifies both structural equality (IVF layout, partition sizes) and functional equality (Top-K query results).


Overall the approach is sound. Please address the P1 silent-skip issue and verify the transpose semantics are intentional.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 28, 2026

@yanghua yanghua marked this pull request as ready for review January 28, 2026 13:00
Comment thread python/python/tests/test_vector_index.py Outdated
@yanghua
Copy link
Copy Markdown
Collaborator Author

yanghua commented Jan 30, 2026

I found that the recall performance is not expected. Will fix it.

@yanghua
Copy link
Copy Markdown
Collaborator Author

yanghua commented Jan 31, 2026

After the refactor, I tested that the recall has been recovered.

@yanghua yanghua changed the title fix: low recall performance for ivf_pq in distributed mode fix: inconsistent transposed pq code and metadata when build ivf_pq index distributedly Jan 31, 2026
@yanghua yanghua merged commit 8fdbd5a into lance-format:main Jan 31, 2026
32 checks passed
vivek-bharathan pushed a commit to vivek-bharathan/lance that referenced this pull request Feb 2, 2026
…ndex distributedly (lance-format#5834)

There is a bug that happens when building the ivf_pq index
distributedly.

That is when building partial ivf_pq, it transposes the pq code for some
fragments. But in the final merge step, all the pq codes are not
transposed, and the metadata is marked as `transposed`. So the logic of
vectors search reads the inconsistent information.

Here the solution is:
* add a configuration about transposing;
* for the distributed vector build, the partial indices are not
transposed;
* in the merge step, we finally do the trapsposing that means we
transform all the PQ codes from row-based to column-based;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants