Skip to content

feat: support lance-ray distributed fts indexing building#45

Merged
chenghao-guo merged 3 commits intolance-format:mainfrom
chenghao-guo:distributed-text-indexing
Sep 16, 2025
Merged

feat: support lance-ray distributed fts indexing building#45
chenghao-guo merged 3 commits intolance-format:mainfrom
chenghao-guo:distributed-text-indexing

Conversation

@chenghao-guo
Copy link
Copy Markdown
Collaborator

@chenghao-guo chenghao-guo commented Sep 4, 2025

close #12

Key Improvements

  1. New Distributed APIs in lance-ray:

    • create_scalar_index() - Distributedly create index index using ray with a single method. Currently only support INVERTED type for FTS index. Other types like Btree shall be supported soon by other contributors.
  2. Three-Phase Workflow:

    • Split and Parallel Phase: Distribute dataset fragments to different Ray worker nodes, try to balance the fragments based on the number of rows. Each worker node builds indices for its assigned fragments as a partition.
    • Merge Phase: Collect and merge partition index metadata from all worker nodes by calling the new method merge_index_metadata by this feat: support build FTS index distributedly lance#4578
    • Commit Phase: Atomically commit index information to the dataset

@github-actions github-actions Bot added the enhancement New feature or request label Sep 4, 2025
BubbleCal pushed a commit to lance-format/lance that referenced this pull request Sep 8, 2025
Close #4514
Related with lance-format/lance-ray#12

This PR introduces distributed fts index capabilities enabling parallel
index creation across multiple fragments.

### New added methods
- `execute_uncommitted()`: Creates index metadata without dataset
commitment, returning `IndexMetadata` for distributed coordination.
Chainable method specifying target fragment IDs for selective indexing
as suggested by Will Jones in discussion
#4514 (comment)
```
              let partial_index=CreateIndexBuilder::new(&mut dataset, &["text"], IndexType::Inverted, &params)
                    .name("distributed_index".to_string())
                    .fragments(vec![fragment_id])
                    .fragment_uuid(shared_uuid.clone())
                    .execute_uncommitted()
                    .await?;
```

- `merge_index_metadata()`: Merges distributed index metadata from
multiple workers into consolidated final metadata

### Distributed Workflow
1. **Split and Parallel Phase**: Ray header distributes the fragments to
different workers. (We may distribute the fragments evenly to different
workers by fragment statistics, which will be implemented in lance-ray
connector)
Ray Workers call `execute_uncommitted()` on specific fragments using
`fragments()` method. Shared UUID via `fragment_uuid()` ensures
consistent index identity.
2. **Merge Phase**: `merge_index_metadata()` consolidates partition
metadata files (`part_*_metadata.lance`)
3. **Commit Phase**: Final index commitment with unified metadata

**Example**
The workflow example can be found in `test_distribute_fts_index_build`
in
[test_scalar_index.py](https://github.com/lancedb/lance/pull/4578/files#diff-a95edaddaa3a260e498c04e10f073261bdc529cd4f47b928ad80274754af0548R1964-R2021).

**Following work after this PR:**
The distributed index building workflow PR will be proposed to lance-ray
connector.
lance-ray draft PR lance-format/lance-ray#45

**Other implementation details on fragment_mask**
 Optional mask with fragment_id in high 32 bits. When provided,
only partitions whose partition id matches this fragment will be
included.
The fragment mask is constructed as `(fragment_id as u64) << 32`.
@chenghao-guo chenghao-guo force-pushed the distributed-text-indexing branch 3 times, most recently from 2b8978f to bbfe251 Compare September 9, 2025 09:09
@chenghao-guo chenghao-guo marked this pull request as ready for review September 9, 2025 09:23
@chenghao-guo chenghao-guo force-pushed the distributed-text-indexing branch from bbfe251 to 9c1c37d Compare September 9, 2025 13:00
@chenghao-guo chenghao-guo force-pushed the distributed-text-indexing branch from 9c1c37d to 9eb38d4 Compare September 9, 2025 13:42
@jiaoew1991
Copy link
Copy Markdown
Collaborator

@BubbleCal index expert take a look

@jiaoew1991 jiaoew1991 requested a review from BubbleCal September 9, 2025 15:45
Comment thread lance_ray/index.py Outdated
@chenghao-guo chenghao-guo force-pushed the distributed-text-indexing branch 3 times, most recently from a19fd1c to 06a17a8 Compare September 10, 2025 07:44
@chenghao-guo chenghao-guo force-pushed the distributed-text-indexing branch from 06a17a8 to a52ffd8 Compare September 15, 2025 02:50
@yanghua
Copy link
Copy Markdown
Collaborator

yanghua commented Sep 15, 2025

cc @jackye1995 PTAL, thanks!

Copy link
Copy Markdown

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just some questions

Comment thread lance_ray/index.py
Comment thread lance_ray/index.py Outdated
- Change index_type to Union[Literal[...], IndexConfig] for better type safety
- Add standard LanceDataset parameters: replace, train, fragment_ids, fragment_uuid
- Make num_workers, storage_options, ray_remote_args keyword-only arguments
@chenghao-guo chenghao-guo force-pushed the distributed-text-indexing branch from 2253727 to b8ce229 Compare September 15, 2025 12:31
@chenghao-guo
Copy link
Copy Markdown
Collaborator Author

LGTM overall, just some questions

All issues have been addressed and the unit tests are passing. Thanks again for your review. Please let me know if any further suggestions. If there are no additional comments within the next day, I’ll go ahead and merge it.

Copy link
Copy Markdown

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the awesome work!

@chenghao-guo chenghao-guo merged commit ead2283 into lance-format:main Sep 16, 2025
4 checks passed
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Close lance-format#4514
Related with lance-format/lance-ray#12

This PR introduces distributed fts index capabilities enabling parallel
index creation across multiple fragments.

### New added methods
- `execute_uncommitted()`: Creates index metadata without dataset
commitment, returning `IndexMetadata` for distributed coordination.
Chainable method specifying target fragment IDs for selective indexing
as suggested by Will Jones in discussion
lance-format#4514 (comment)
```
              let partial_index=CreateIndexBuilder::new(&mut dataset, &["text"], IndexType::Inverted, &params)
                    .name("distributed_index".to_string())
                    .fragments(vec![fragment_id])
                    .fragment_uuid(shared_uuid.clone())
                    .execute_uncommitted()
                    .await?;
```

- `merge_index_metadata()`: Merges distributed index metadata from
multiple workers into consolidated final metadata

### Distributed Workflow
1. **Split and Parallel Phase**: Ray header distributes the fragments to
different workers. (We may distribute the fragments evenly to different
workers by fragment statistics, which will be implemented in lance-ray
connector)
Ray Workers call `execute_uncommitted()` on specific fragments using
`fragments()` method. Shared UUID via `fragment_uuid()` ensures
consistent index identity.
2. **Merge Phase**: `merge_index_metadata()` consolidates partition
metadata files (`part_*_metadata.lance`)
3. **Commit Phase**: Final index commitment with unified metadata

**Example**
The workflow example can be found in `test_distribute_fts_index_build`
in
[test_scalar_index.py](https://github.com/lancedb/lance/pull/4578/files#diff-a95edaddaa3a260e498c04e10f073261bdc529cd4f47b928ad80274754af0548R1964-R2021).

**Following work after this PR:**
The distributed index building workflow PR will be proposed to lance-ray
connector.
lance-ray draft PR lance-format/lance-ray#45

**Other implementation details on fragment_mask**
 Optional mask with fragment_id in high 32 bits. When provided,
only partitions whose partition id matches this fragment will be
included.
The fragment mask is constructed as `(fragment_id as u64) << 32`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support create fts index distributively on ray

4 participants