Skip to content

feat: support build FTS index distributedly#4578

Merged
BubbleCal merged 1 commit intolance-format:mainfrom
chenghao-guo:fts-builder-opensource
Sep 8, 2025
Merged

feat: support build FTS index distributedly#4578
BubbleCal merged 1 commit intolance-format:mainfrom
chenghao-guo:fts-builder-opensource

Conversation

@chenghao-guo
Copy link
Copy Markdown
Contributor

@chenghao-guo chenghao-guo commented Aug 27, 2025

Close #4514
Related with lance-format/lance-ray#12

This PR introduces distributed fts index capabilities enabling parallel index creation across multiple fragments.

New added methods

              let partial_index=CreateIndexBuilder::new(&mut dataset, &["text"], IndexType::Inverted, &params)
                    .name("distributed_index".to_string())
                    .fragments(vec![fragment_id])
                    .fragment_uuid(shared_uuid.clone())
                    .execute_uncommitted()
                    .await?;
  • merge_index_metadata(): Merges distributed index metadata from multiple workers into consolidated final metadata

Distributed Workflow

  1. Split and Parallel Phase: Ray header distributes the fragments to different workers. (We may distribute the fragments evenly to different workers by fragment statistics, which will be implemented in lance-ray connector)
    Ray Workers call execute_uncommitted() on specific fragments using fragments() method. Shared UUID via fragment_uuid() ensures consistent index identity.
  2. Merge Phase: merge_index_metadata() consolidates partition metadata files (part_*_metadata.lance)
  3. Commit Phase: Final index commitment with unified metadata

Example
The workflow example can be found in test_distribute_fts_index_build in test_scalar_index.py.

Following work after this PR:
The distributed index building workflow PR will be proposed to lance-ray connector.
lance-ray draft PR lance-format/lance-ray#45

Other implementation details on fragment_mask
Optional mask with fragment_id in high 32 bits. When provided,
only partitions whose partition id matches this fragment will be included.
The fragment mask is constructed as (fragment_id as u64) << 32.

@github-actions github-actions Bot added enhancement New feature or request python labels Aug 27, 2025
@chenghao-guo chenghao-guo force-pushed the fts-builder-opensource branch from d26ed0f to a58678a Compare August 27, 2025 09:59
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 61.73184% with 137 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.59%. Comparing base (103006f) to head (56cbff1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/builder.rs 20.76% 95 Missing and 8 partials ⚠️
rust/lance/src/index/scalar.rs 78.78% 9 Missing and 5 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 57.69% 11 Missing ⚠️
rust/lance/src/index/create.rs 95.86% 5 Missing ⚠️
rust/lance-index/src/scalar/inverted.rs 77.77% 2 Missing ⚠️
rust/lance/src/index.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4578      +/-   ##
==========================================
- Coverage   80.62%   80.59%   -0.03%     
==========================================
  Files         317      317              
  Lines      118785   119097     +312     
  Branches   118785   119097     +312     
==========================================
+ Hits        95769    95985     +216     
- Misses      19568    19660      +92     
- Partials     3448     3452       +4     
Flag Coverage Δ
unittests 80.59% <61.73%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chenghao-guo chenghao-guo force-pushed the fts-builder-opensource branch 2 times, most recently from 311e247 to 18fc5a8 Compare August 27, 2025 11:31
@chenghao-guo chenghao-guo changed the title feat: support build inverted index distributedly feat: support build fts index distributedly Aug 27, 2025
@chenghao-guo chenghao-guo changed the title feat: support build fts index distributedly feat: support build FTS index distributedly Aug 27, 2025
@chenghao-guo chenghao-guo marked this pull request as ready for review August 28, 2025 02:26
@chenghao-guo chenghao-guo marked this pull request as draft August 28, 2025 07:17
@chenghao-guo chenghao-guo force-pushed the fts-builder-opensource branch from a079fb8 to d5b3790 Compare August 28, 2025 12:46
@chenghao-guo chenghao-guo marked this pull request as ready for review August 28, 2025 13:46
@chenghao-guo chenghao-guo force-pushed the fts-builder-opensource branch 2 times, most recently from 80d542b to b07d1a9 Compare August 29, 2025 04:52
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great work!
Just some questions

location: location!(),
})?;

all_partitions.extend(partition_ids);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think each indexing worker/fragment would have partition IDs starting from 0, we need to map each new partition id to all_partitions.len() + part_id?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion — that’s convincing. I’ll adjust the partition ID assignment logic and make the corresponding changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think each indexing worker/fragment would have partition IDs starting from 0, we need to map each new partition id to all_partitions.len() + part_id?

Hi @BubbleCal , thanks for taking a look. I want to confirm that I’m interpreting the implementation correctly.

In the previous design, each partition ID was assigned using the mask fragment_id << 32 to ensure isolation across fragments and avoid conflicts. As a result, partition IDs are not sequential (i.e., not 0, 1, 2, 3, …).

Based on the suggestions, in the final stage, I reorder and rename all fragment partitions and then reassign partition IDs starting from 0. While this entails some rename operations, I’m not sure this is the best approach. I can’t assign sequential partition IDs 0,1,2... at fragment-index creation time because that work happens on different machines (e.g., Ray workers). Therefore, the only practical option I see is to sort the IDs and remap them in the last stage in a coordinator (e.g. Ray header)

I’ve made some changes to the function accordingly—please see the diff here:

I’d appreciate any feedback or alternative suggestions if you think there’s a better way to handle this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, TBH I didn't realize that the partition IDs were encoded by frag_id << 32 + i, but I think the current reorder and re-assign design is still better, just in case we will probably build single index over multiple fragments (e.g. there are many small fragments?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BubbleCal Yes, I completely agree with your idea. Using “reorder” and “reassign” is definitely clearer, since the generated FTS index is almost the same as the single-node index.

My only concern is that the renaming and overall code structure may not be very elegant, which could make the code harder for others to understand. To mitigate this, I’ve added inline comments for clarity. I’ll likely refactor the code structure in the near future.

BTW, kindly let me know if any other suggestions, or if I could get your approval to merge the PR.

return ds


def test_distribute_fts_index_build(tmp_path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice if we can add a helper function that builds the FTS index in distributed way, then we can add it into all existing FTS tests, to make sure the 2 ways will always get the same result

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you for your great suggestions. I will add them later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added helper function assert_distributed_fts_consistency and applied it to all existing FTS tests. Validated that distributed execution produces results consistent with single-node FTS index searches.

@chenghao-guo chenghao-guo force-pushed the fts-builder-opensource branch 2 times, most recently from 2d1d80f to 9bb5b42 Compare September 4, 2025 09:22
@chenghao-guo chenghao-guo force-pushed the fts-builder-opensource branch 2 times, most recently from 6c0c885 to 67e2424 Compare September 5, 2025 08:23
@chenghao-guo
Copy link
Copy Markdown
Contributor Author

Rebased to resolve conflict.

@chenghao-guo chenghao-guo force-pushed the fts-builder-opensource branch from 67e2424 to 56cbff1 Compare September 8, 2025 06:04
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work is amazing, thanks!

"text",
"lance search text",
tmp_path,
index_params={"with_position": False},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, we've changed the default value to False for with_position, so no need to specify it here. we can remove this in the future PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it , thanks !

@BubbleCal BubbleCal merged commit 16bfffc into lance-format:main Sep 8, 2025
26 checks passed
jackye1995 pushed a commit that referenced this pull request Nov 5, 2025
…dule (#4961)

This PR is about to bring the distributed index creation functionality
(see #4667,
#4578) to java module, which is
aligned with the python implementation.

---------

Co-authored-by: 喆宇 <wxy407679@antgroup.com>
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Close lance-format#4514
Related with lance-format/lance-ray#12

This PR introduces distributed fts index capabilities enabling parallel
index creation across multiple fragments.

### New added methods
- `execute_uncommitted()`: Creates index metadata without dataset
commitment, returning `IndexMetadata` for distributed coordination.
Chainable method specifying target fragment IDs for selective indexing
as suggested by Will Jones in discussion
lance-format#4514 (comment)
```
              let partial_index=CreateIndexBuilder::new(&mut dataset, &["text"], IndexType::Inverted, &params)
                    .name("distributed_index".to_string())
                    .fragments(vec![fragment_id])
                    .fragment_uuid(shared_uuid.clone())
                    .execute_uncommitted()
                    .await?;
```

- `merge_index_metadata()`: Merges distributed index metadata from
multiple workers into consolidated final metadata

### Distributed Workflow
1. **Split and Parallel Phase**: Ray header distributes the fragments to
different workers. (We may distribute the fragments evenly to different
workers by fragment statistics, which will be implemented in lance-ray
connector)
Ray Workers call `execute_uncommitted()` on specific fragments using
`fragments()` method. Shared UUID via `fragment_uuid()` ensures
consistent index identity.
2. **Merge Phase**: `merge_index_metadata()` consolidates partition
metadata files (`part_*_metadata.lance`)
3. **Commit Phase**: Final index commitment with unified metadata

**Example**
The workflow example can be found in `test_distribute_fts_index_build`
in
[test_scalar_index.py](https://github.com/lancedb/lance/pull/4578/files#diff-a95edaddaa3a260e498c04e10f073261bdc529cd4f47b928ad80274754af0548R1964-R2021).

**Following work after this PR:**
The distributed index building workflow PR will be proposed to lance-ray
connector.
lance-ray draft PR lance-format/lance-ray#45

**Other implementation details on fragment_mask**
 Optional mask with fragment_id in high 32 bits. When provided,
only partitions whose partition id matches this fragment will be
included.
The fragment mask is constructed as `(fragment_id as u64) << 32`.
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…dule (lance-format#4961)

This PR is about to bring the distributed index creation functionality
(see lance-format#4667,
lance-format#4578) to java module, which is
aligned with the python implementation.

---------

Co-authored-by: 喆宇 <wxy407679@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add create_fragment_inverted_index for distributed inverted index building

3 participants