Skip to content

feat: support create btree index distributedly#4667

Merged
westonpace merged 20 commits intolance-format:mainfrom
xloya:feat-create-btree-distributely
Sep 16, 2025
Merged

feat: support create btree index distributedly#4667
westonpace merged 20 commits intolance-format:mainfrom
xloya:feat-create-btree-distributely

Conversation

@xloya
Copy link
Copy Markdown
Contributor

@xloya xloya commented Sep 8, 2025

Closed #4665.

Overall Steps:

  1. Create ordered Btree sub-page files / sub-lookup files at the fragment level based on Ray / Daft.
  2. Sort and merge the sub-page files using a k-way merge sort algorithm, supporting prefetch data of sub-page files.
  3. Output the final lookup file.
  4. Commit the final index to dataset.

Production Test Results:
In a production scenario, using Ray and 50 workers on a string ID field in a dataset of 700 million records, we achieved the following:

  1. Btree index build time was reduced from 190 minutes to 19 minutes, a 10x increase in build speed.
  2. Peak memory usage on the Ray head node when creating the index was reduced from 90+ GB to 4+ GB, a 95%+ reduction.

@github-actions github-actions Bot added enhancement New feature or request python java labels Sep 8, 2025
@xloya xloya changed the title feat: support btree distributely feat: support create btree distributely Sep 8, 2025
@xloya xloya changed the title feat: support create btree distributely feat: support create btree index distributely Sep 8, 2025
@xloya xloya force-pushed the feat-create-btree-distributely branch 2 times, most recently from 2cdc2b1 to b634aa8 Compare September 8, 2025 12:50
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 8, 2025

Codecov Report

❌ Patch coverage is 81.82927% with 149 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.73%. Comparing base (76a710e) to head (65ad622).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/btree.rs 85.13% 89 Missing and 28 partials ⚠️
rust/lance-index/src/scalar/inverted/builder.rs 0.00% 32 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #4667    +/-   ##
========================================
  Coverage   80.72%   80.73%            
========================================
  Files         321      321            
  Lines      124043   124847   +804     
  Branches   124043   124847   +804     
========================================
+ Hits       100131   100792   +661     
- Misses      20340    20457   +117     
- Partials     3572     3598    +26     
Flag Coverage Δ
unittests 80.73% <81.82%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xloya
Copy link
Copy Markdown
Contributor Author

xloya commented Sep 9, 2025

@jackye1995 @westonpace @BubbleCal @chenghao-guo Please take a look when you have time, thx!

@xloya xloya changed the title feat: support create btree index distributely feat: support create btree index distributedly Sep 9, 2025
@jackye1995 jackye1995 self-requested a review September 9, 2025 03:27
],
prefetch_batch: Optional[int] = None,
):
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the refactor. My previous design wasn’t thoroughly considered; yours is a significant improvement

@xloya xloya force-pushed the feat-create-btree-distributely branch from f09c92b to d39785d Compare September 9, 2025 05:15
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is amazing feature! I added some initial comments, will look more into the details tomorrow

Comment thread python/python/lance/dataset.py Outdated
Comment thread python/python/lance/dataset.py Outdated
Comment thread python/python/tests/test_scalar_index.py Outdated
Comment thread python/python/tests/test_scalar_index.py Outdated
Comment thread python/python/lance/dataset.py
Comment thread python/src/dataset.rs Outdated
Comment thread python/src/dataset.rs Outdated
Comment thread rust/lance-index/src/scalar/btree.rs Outdated
Comment thread rust/lance-index/src/scalar/btree.rs
@xloya xloya force-pushed the feat-create-btree-distributely branch from d39785d to 550e37a Compare September 9, 2025 06:39
@xloya
Copy link
Copy Markdown
Contributor Author

xloya commented Sep 9, 2025

@jackye1995 Thanks for your review, address all comments.

@xloya xloya force-pushed the feat-create-btree-distributely branch from 42022fb to 966bceb Compare September 9, 2025 09:34
@xloya xloya force-pushed the feat-create-btree-distributely branch 2 times, most recently from 5c3c211 to 00b5f59 Compare September 9, 2025 11:13
@xloya xloya force-pushed the feat-create-btree-distributely branch 2 times, most recently from d3c3e6d to c7353ce Compare September 10, 2025 01:49
@xloya xloya force-pushed the feat-create-btree-distributely branch from c7353ce to eaaac6b Compare September 10, 2025 01:51
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I love the overall approach of adding a train method that takes in some fragments and creates a partial index and then a merge method that will complete all the fragments.

I think we can simplify the prefetch and just use the file reader's prefetch (I have some comments here).

Also, does this merge the batches themselves? For example, if I sort fragment 1 and I get a batch with values 10-50 and then I sort fragment 2 and I get a batch with values 30-80 is there some code that is merging these two batches (maybe the output is a batch with 10-50 and 50-80)? I might be missing it but maybe it is just emitting 10-50 and then 30-80 in sequence?

Also, if we want to simplify some of the merge code, Datafusion has a SortPreservingMerge operator that can do this for us too.

Comment thread python/python/lance/dataset.py
Comment thread python/python/tests/test_scalar_index.py Outdated
Comment thread python/python/tests/test_scalar_index.py
Comment thread python/python/tests/test_scalar_index.py Outdated
Comment thread python/src/dataset.rs Outdated
Comment thread rust/lance-index/src/scalar/btree.rs Outdated
Comment thread rust/lance-index/src/scalar/btree.rs Outdated
Comment thread rust/lance-index/src/scalar/btree.rs Outdated
Comment thread rust/lance-index/src/scalar/btree.rs Outdated
Comment thread rust/lance-index/src/scalar/btree.rs Outdated
@xloya
Copy link
Copy Markdown
Contributor Author

xloya commented Sep 11, 2025

Also, does this merge the batches themselves? For example, if I sort fragment 1 and I get a batch with values 10-50 and then I sort fragment 2 and I get a batch with values 30-80 is there some code that is merging these two batches (maybe the output is a batch with 10-50 and 50-80)? I might be missing it but maybe it is just emitting 10-50 and then 30-80 in sequence?

I might not fully understand what's going on here. Let me explain the logic using these two fragments:

  1. Use train_btree_index to construct sub-indexes for fragment_1 and fragment_2. If they are grouped together, I believe this code will ensure strict ordering: https://github.com/lancedb/lance/blob/main/rust/lance/src/index/scalar.rs#L131-L135, which generates the sequence [10, 80]. If they are not grouped together, the sequences [10, 50] and [30, 80] are generated, respectively.
  2. Use the k-way merge algorithm to merge them, ensuring the final order.

@westonpace
Copy link
Copy Markdown
Member

I might not fully understand what's going on here. Let me explain the logic using these two fragments:

  1. Use train_btree_index to construct sub-indexes for fragment_1 and fragment_2. If they are grouped together, I believe this code will ensure strict ordering: https://github.com/lancedb/lance/blob/main/rust/lance/src/index/scalar.rs#L131-L135, which generates the sequence [10, 80]. If they are not grouped together, the sequences [10, 50] and [30, 80] are generated, respectively.
  2. Use the k-way merge algorithm to merge them, ensuring the final order.

Ah, I see where I was confused. I thought the partition iterator was yielding batches. Instead it is yielding rows. So the heap is built one row at a time and batches are merged that way. You can ignore that particular comment.

@xloya
Copy link
Copy Markdown
Contributor Author

xloya commented Sep 12, 2025

@jackye1995 @westonpace @BubbleCal Hi, I've refactored the code based on your feedback. Please review it again when you have time, thanks! I've also noticed that using Datafusion's SortPreservingMerge has slowed down index creation compared to my previous version(the speed can be improved by adjusting the prefetch nums and the number of sub-indexes), but it's still significantly faster than the current single-node creation. Furthermore, the code has been significantly simplified. I think we can further optimize performance in future PRs.

@xloya
Copy link
Copy Markdown
Contributor Author

xloya commented Sep 16, 2025

@jackye1995 @westonpace @BubbleCal Gentle pin for this

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the reviews, this looks good to me now.

@westonpace westonpace merged commit 8476edb into lance-format:main Sep 16, 2025
26 checks passed
yanghua pushed a commit to lance-format/lance-ray that referenced this pull request Sep 19, 2025
Add support for distributed BTREE index building in ray connector based
on @xloya's great work in lance-format/lance#4667
jackye1995 pushed a commit that referenced this pull request Nov 5, 2025
…dule (#4961)

This PR is about to bring the distributed index creation functionality
(see #4667,
#4578) to java module, which is
aligned with the python implementation.

---------

Co-authored-by: 喆宇 <wxy407679@antgroup.com>
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Closed lance-format#4665.

Overall Steps:
1. Create ordered Btree sub-page files / sub-lookup files at the
fragment level based on Ray / Daft.
2. Sort and merge the sub-page files using a k-way merge sort algorithm,
supporting prefetch data of sub-page files.
3. Output the final lookup file.
4. Commit the final index to dataset.

Production Test Results:
In a production scenario, using Ray and 50 workers on a string ID field
in a dataset of 700 million records, we achieved the following:
1. Btree index build time was reduced from 190 minutes to 19 minutes, a
10x increase in build speed.
2. Peak memory usage on the Ray head node when creating the index was
reduced from 90+ GB to 4+ GB, a 95%+ reduction.

---------

Co-authored-by: xloya <xiaojiebao@apache.org>
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…dule (lance-format#4961)

This PR is about to bring the distributed index creation functionality
(see lance-format#4667,
lance-format#4578) to java module, which is
aligned with the python implementation.

---------

Co-authored-by: 喆宇 <wxy407679@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Support create btree index distributedly

5 participants