feat: support lance-ray distributed fts indexing building#45
Merged
chenghao-guo merged 3 commits intolance-format:mainfrom Sep 16, 2025
Merged
Conversation
BubbleCal
pushed a commit
to lance-format/lance
that referenced
this pull request
Sep 8, 2025
Close #4514 Related with lance-format/lance-ray#12 This PR introduces distributed fts index capabilities enabling parallel index creation across multiple fragments. ### New added methods - `execute_uncommitted()`: Creates index metadata without dataset commitment, returning `IndexMetadata` for distributed coordination. Chainable method specifying target fragment IDs for selective indexing as suggested by Will Jones in discussion #4514 (comment) ``` let partial_index=CreateIndexBuilder::new(&mut dataset, &["text"], IndexType::Inverted, ¶ms) .name("distributed_index".to_string()) .fragments(vec![fragment_id]) .fragment_uuid(shared_uuid.clone()) .execute_uncommitted() .await?; ``` - `merge_index_metadata()`: Merges distributed index metadata from multiple workers into consolidated final metadata ### Distributed Workflow 1. **Split and Parallel Phase**: Ray header distributes the fragments to different workers. (We may distribute the fragments evenly to different workers by fragment statistics, which will be implemented in lance-ray connector) Ray Workers call `execute_uncommitted()` on specific fragments using `fragments()` method. Shared UUID via `fragment_uuid()` ensures consistent index identity. 2. **Merge Phase**: `merge_index_metadata()` consolidates partition metadata files (`part_*_metadata.lance`) 3. **Commit Phase**: Final index commitment with unified metadata **Example** The workflow example can be found in `test_distribute_fts_index_build` in [test_scalar_index.py](https://github.com/lancedb/lance/pull/4578/files#diff-a95edaddaa3a260e498c04e10f073261bdc529cd4f47b928ad80274754af0548R1964-R2021). **Following work after this PR:** The distributed index building workflow PR will be proposed to lance-ray connector. lance-ray draft PR lance-format/lance-ray#45 **Other implementation details on fragment_mask** Optional mask with fragment_id in high 32 bits. When provided, only partitions whose partition id matches this fragment will be included. The fragment mask is constructed as `(fragment_id as u64) << 32`.
2b8978f to
bbfe251
Compare
bbfe251 to
9c1c37d
Compare
9c1c37d to
9eb38d4
Compare
Collaborator
|
@BubbleCal index expert take a look |
jiaoew1991
reviewed
Sep 9, 2025
a19fd1c to
06a17a8
Compare
06a17a8 to
a52ffd8
Compare
c8cead2 to
77ee2e0
Compare
Collaborator
|
cc @jackye1995 PTAL, thanks! |
BubbleCal
reviewed
Sep 15, 2025
- Change index_type to Union[Literal[...], IndexConfig] for better type safety - Add standard LanceDataset parameters: replace, train, fragment_ids, fragment_uuid - Make num_workers, storage_options, ray_remote_args keyword-only arguments
2253727 to
b8ce229
Compare
Collaborator
Author
All issues have been addressed and the unit tests are passing. Thanks again for your review. Please let me know if any further suggestions. If there are no additional comments within the next day, I’ll go ahead and merge it. |
4 tasks
jackye1995
pushed a commit
to jackye1995/lance
that referenced
this pull request
Jan 21, 2026
Close lance-format#4514 Related with lance-format/lance-ray#12 This PR introduces distributed fts index capabilities enabling parallel index creation across multiple fragments. ### New added methods - `execute_uncommitted()`: Creates index metadata without dataset commitment, returning `IndexMetadata` for distributed coordination. Chainable method specifying target fragment IDs for selective indexing as suggested by Will Jones in discussion lance-format#4514 (comment) ``` let partial_index=CreateIndexBuilder::new(&mut dataset, &["text"], IndexType::Inverted, ¶ms) .name("distributed_index".to_string()) .fragments(vec![fragment_id]) .fragment_uuid(shared_uuid.clone()) .execute_uncommitted() .await?; ``` - `merge_index_metadata()`: Merges distributed index metadata from multiple workers into consolidated final metadata ### Distributed Workflow 1. **Split and Parallel Phase**: Ray header distributes the fragments to different workers. (We may distribute the fragments evenly to different workers by fragment statistics, which will be implemented in lance-ray connector) Ray Workers call `execute_uncommitted()` on specific fragments using `fragments()` method. Shared UUID via `fragment_uuid()` ensures consistent index identity. 2. **Merge Phase**: `merge_index_metadata()` consolidates partition metadata files (`part_*_metadata.lance`) 3. **Commit Phase**: Final index commitment with unified metadata **Example** The workflow example can be found in `test_distribute_fts_index_build` in [test_scalar_index.py](https://github.com/lancedb/lance/pull/4578/files#diff-a95edaddaa3a260e498c04e10f073261bdc529cd4f47b928ad80274754af0548R1964-R2021). **Following work after this PR:** The distributed index building workflow PR will be proposed to lance-ray connector. lance-ray draft PR lance-format/lance-ray#45 **Other implementation details on fragment_mask** Optional mask with fragment_id in high 32 bits. When provided, only partitions whose partition id matches this fragment will be included. The fragment mask is constructed as `(fragment_id as u64) << 32`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
close #12
Key Improvements
New Distributed APIs in lance-ray:
create_scalar_index()- Distributedly create index index using ray with a single method. Currently only support INVERTED type for FTS index. Other types like Btree shall be supported soon by other contributors.Three-Phase Workflow:
merge_index_metadataby this feat: support build FTS index distributedly lance#4578