refactor: distributed vector segment build#6220
Conversation
PR Review: refactor: internalize vector staging merge flowOverall: Well-structured refactor that cleanly separates planning from execution for distributed vector index finalize. Good test coverage across IVF subtypes and error cases. A few items to consider: P1:
|
5c4279d to
44062e8
Compare
|
CI is not running because our base is not |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
westonpace
left a comment
There was a problem hiding this comment.
Still trying to get up to speed I think. What's the difference between merging staging files into a segment and combining multiple smaller segments into a larger segment?
3c6d5a7 to
4938e8a
Compare
|
Thank you @westonpace for the review! I have revisited all the concepts, aligned their naming and comments, and compiled them into a document at https://github.com/lance-format/lance/blob/xuanwo/vector-staging-merge-internal/docs/src/guide/distributed_indexing.md. I hope this will be helpful for the review. |
|
This guide is great, and super helpful. I have a few more questions but we can potentially address these in a follow-up. How does a worker decide how many shards to create? Are those shards complete indexes? Why not just make them segments? Then there is no need for a staging dir. The |
0950fc8 to
1ea42de
Compare
1ea42de to
0a3f230
Compare
d43183d to
b86f91c
Compare
westonpace
left a comment
There was a problem hiding this comment.
Can you clearly document what the breaking changes are in the PR description?
Overall this seems to be a good extension to the previous single segment mechanism. It is flexible and the API makes sense.
It might be nice to see a full end-to-end example (perhaps in distributed_indexing.md) at some point.
| *, | ||
| target_partition_size: Optional[int] = None, | ||
| skip_transpose: bool = False, | ||
| require_commit: bool = True, |
| ) | ||
| return index | ||
|
|
||
| def create_index( |
There was a problem hiding this comment.
It's hard to tell how this changed from the existing API. Were there any breaking changes?
I think we should mark #6209 as a breaking change. Therefore, this PR does not necessarily need to be a breaking change now. I will update.
Yes! I'm thinking about this too, will work on a follow-up. |
## Summary This fixes a flaky regression test added in #6220 (`b80fbb3231cf58dd50e5670f9c56d309999bbd73`). The affected test is: - `test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results` Recent failures showed up in both of these runs: - https://github.com/lance-format/lance/actions/runs/23450834811 `main` at `244c721504c6ef0b4c2f9700a342509976898d6e` - https://github.com/lance-format/lance/actions/runs/23460892697 #6263 In those failures, different platforms / index variants failed: - `linux-arm / case_1_ivf_flat` on `main` - `linux-build / case_2_ivf_pq` on #6263 That points to an existing flaky test. ## Root Cause The test compared the exact Top-K `_rowid` results between: - a single-segment index build, and - a distributed multi-segment index build However, the query path used by the test is ANN (`ANNIVFPartition`) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact `_rowid` equality too strict for this test and causes intermittent failures. ## Fix Make the test probe all IVF partitions before comparing Top-K row ids: - add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query This keeps the existing strong assertion (`ids_single == ids_split`) but removes the probing-related source of nondeterminism. ## Testing Local verification: - `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc` - `cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture` Observed result: - `case_1_ivf_flat ... ok` - `case_2_ivf_pq ... ok` - `case_3_ivf_sq ... ok` I also verified during debugging that with full probing enabled, repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became stable.
This refactors distributed vector indexing into a staged segment-build pipeline and exposes the public APIs needed to integrate an external distributed build workflow with Lance. It defines the storage-level model for partial shards, staged planning, built segments, and segment commit, and documents the current distributed indexing flow. This PR builds on the segment commit API from #6209. The main changes are organized into five commits: - [test: cover distributed vector segment build](a86274da2) - [refactor: internalize distributed vector segment build](1e5f0e15b) - [feat: add public vector segment builder API](691cecb9a) - [feat: add Python vector segment builder API](a07ef6144) - [docs: document distributed vector segment build](0a3f230d7) Follow-up fixes: - [fix: expose python create_index_uncommitted](c1d3b1666) - [fix: format python bindings](b86f91c4d) - [fix: format python segment builder bindings](bfc9e63a0) Please review accordingly. --- Guide: https://github.com/lance-format/lance/blob/xuanwo/vector-staging-merge-internal/docs/src/guide/distributed_indexing.md
## Summary This fixes a flaky regression test added in #6220 (`b80fbb3231cf58dd50e5670f9c56d309999bbd73`). The affected test is: - `test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results` Recent failures showed up in both of these runs: - https://github.com/lance-format/lance/actions/runs/23450834811 `main` at `244c721504c6ef0b4c2f9700a342509976898d6e` - https://github.com/lance-format/lance/actions/runs/23460892697 #6263 In those failures, different platforms / index variants failed: - `linux-arm / case_1_ivf_flat` on `main` - `linux-build / case_2_ivf_pq` on #6263 That points to an existing flaky test. ## Root Cause The test compared the exact Top-K `_rowid` results between: - a single-segment index build, and - a distributed multi-segment index build However, the query path used by the test is ANN (`ANNIVFPartition`) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact `_rowid` equality too strict for this test and causes intermittent failures. ## Fix Make the test probe all IVF partitions before comparing Top-K row ids: - add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query This keeps the existing strong assertion (`ids_single == ids_split`) but removes the probing-related source of nondeterminism. ## Testing Local verification: - `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc` - `cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture` Observed result: - `case_1_ivf_flat ... ok` - `case_2_ivf_pq ... ok` - `case_3_ivf_sq ... ok` I also verified during debugging that with full probing enabled, repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became stable.
This tightens the new multi-segment vector index path added in #6220. It enforces disjoint fragment coverage when committing a segment set, adds regression coverage that grouped segment coverage matches the union of its source shard coverage, and verifies that remap only touches segments covering affected fragments. It also adds cleanup coverage for both replaced committed segments and stale uncommitted `_indices/<uuid>` artifacts, and documents these contracts in the distributed indexing guide.
This refactors distributed vector indexing into a staged segment-build pipeline and exposes the public APIs needed to integrate an external distributed build workflow with Lance. It defines the storage-level model for partial shards, staged planning, built segments, and segment commit, and documents the current distributed indexing flow. This PR builds on the segment commit API from lance-format#6209. The main changes are organized into five commits: - [test: cover distributed vector segment build](lance-format@a86274da2) - [refactor: internalize distributed vector segment build](lance-format@1e5f0e15b) - [feat: add public vector segment builder API](lance-format@691cecb9a) - [feat: add Python vector segment builder API](lance-format@a07ef6144) - [docs: document distributed vector segment build](lance-format@0a3f230d7) Follow-up fixes: - [fix: expose python create_index_uncommitted](lance-format@c1d3b1666) - [fix: format python bindings](lance-format@b86f91c4d) - [fix: format python segment builder bindings](lance-format@bfc9e63a0) Please review accordingly. --- Guide: https://github.com/lance-format/lance/blob/xuanwo/vector-staging-merge-internal/docs/src/guide/distributed_indexing.md
) ## Summary This fixes a flaky regression test added in lance-format#6220 (`b80fbb3231cf58dd50e5670f9c56d309999bbd73`). The affected test is: - `test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results` Recent failures showed up in both of these runs: - https://github.com/lance-format/lance/actions/runs/23450834811 `main` at `244c721504c6ef0b4c2f9700a342509976898d6e` - https://github.com/lance-format/lance/actions/runs/23460892697 lance-format#6263 In those failures, different platforms / index variants failed: - `linux-arm / case_1_ivf_flat` on `main` - `linux-build / case_2_ivf_pq` on lance-format#6263 That points to an existing flaky test. ## Root Cause The test compared the exact Top-K `_rowid` results between: - a single-segment index build, and - a distributed multi-segment index build However, the query path used by the test is ANN (`ANNIVFPartition`) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact `_rowid` equality too strict for this test and causes intermittent failures. ## Fix Make the test probe all IVF partitions before comparing Top-K row ids: - add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query This keeps the existing strong assertion (`ids_single == ids_split`) but removes the probing-related source of nondeterminism. ## Testing Local verification: - `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc` - `cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture` Observed result: - `case_1_ivf_flat ... ok` - `case_2_ivf_pq ... ok` - `case_3_ivf_sq ... ok` I also verified during debugging that with full probing enabled, repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became stable.
This refactors distributed vector indexing into a staged segment-build pipeline and exposes the public APIs needed to integrate an external distributed build workflow with Lance. It defines the storage-level model for partial shards, staged planning, built segments, and segment commit, and documents the current distributed indexing flow.
This PR builds on the segment commit API from #6209. The main changes are organized into five commits:
Follow-up fixes:
Please review accordingly.
Guide: https://github.com/lance-format/lance/blob/xuanwo/vector-staging-merge-internal/docs/src/guide/distributed_indexing.md