Introduction
The BTree index is a crucial scalar index in Lance. Many thanks to @westonpace for the implementation! Its main structure is shown below:
In summary, all <value, rowAddress> pairs are stored in a single page_data.lance in object storage. A B-tree structure from page_lookup.lance is kept in memory to enable fast access to a relatively small data block, and the final results are obtained through sequential scanning within that block.
To accelerate index building, @xloya contributed the distributed version (as illustrated below). The key points are:
- Fragments are horizentally divided. Each subtask only build BTree index for a subset of the full-list fragments.
- A dedicated merger is responsible for k-ways merging all page files into one final page file as well as the lookup file. Since all page files are already sorted, the merge process can be relativly faster.

However, inevitably, a single point (the Merger) has to process all the data, which becomes unacceptable at very large scales (tens of billions of records or more).
Resolution
Existing big data computing engines (such as Spark and Flink) provide mature range-shuffle capabilities, enabling efficient global sorting of datasets at the scale of hundreds of billions of records. Leveraging this, we have implemented a range-based BTree index using Lance-Spark and Lance, with the following key features:
- Minimal changes to the existing Lance BTree index — only a few hundred lines of core code were added.
- Full backward compatibility— completely compatible with the current implementation.
- Dramatic performance improvement — for datasets with hundreds of millions of records, the merge phase is reduced from tens of minutes to under a few seconds, and end-to-end index construction completes in just a few minutes.
The range-based BTree index has a structure nearly identical to the current implementation, with the only addition being multiple distinct page files—each covering a non-overlapping data range:
Workflow

The overall workflow is as follows:
- Spark/Flink reads the column to be indexed and its corresponding RowAddresses from Lance, performs a global sort, and applies range-based shuffling.
- For each data range, it invokes Lance
dataset.create_index with the corresponding rangeId passed in IndexParams. This create_index call is identical to the standard BTree index construction process.
- The Merger combines all lookup files. Since this step only requires sequentially processing and merging the lookup files, the merge phase is extremely fast.
Partial Index Creation
This part of the implementation remains fully consistent with the current approach, with only two minor modifications:
- A new field
preprocessed_data is added to IndexCreateBuilder. When this data is present, the load_train_data step is skipped.
- The generated page_file name now includes the rangeId as part of its filename. The rangeId is already sorted by its range, which is guaranteed by Spark/Flink.
Lookup File Merging
During the merge process, only the lookup files need to be merged. The purpose of this step is to:
- Reassign a global PageIndex number for each page across all ranges.
- Record the offset corresponding to each PageIndex in the consolidated pages file.
For example, consider these two lookup files:

In the merged lookup file, the page indices from lookup file 2 are reassigned to 3 and 4, which correctly reflect their global page indices. Additionally, the number of pages in each original lookup file is recorded, enabling the recovery of per-page-file internal page indices during read operations, as illustrated below:
Update
During updates, the distributed processing pipeline cannot be effectively leveraged for acceleration. Therefore, both update and remapping operations fall back to the standard BTree index behavior: all PageFiles are merged into a single file, resulting in only one PageFile being generated.
We will introduce distributed update mechanism in future issues.
Tests
We tested two tables with data volumes of 100 million and 10 billion rows respectively, and indexed an integer type column. The results are as follows:
| num of rows |
num of ranges |
execution time |
merge time |
| 130 million |
3 |
23 min |
1 s |
| 130 million |
50 |
3 min |
3 s |
| 10 billion |
1000 |
15 min |
46 s |
Even on the 10-billion-row dataset, we were able to complete the merge in just 46 seconds, the end to end latency is reduced to about 15 min.
Introduction
The BTree index is a crucial scalar index in Lance. Many thanks to @westonpace for the implementation! Its main structure is shown below:
In summary, all
<value, rowAddress>pairs are stored in a single page_data.lance in object storage. A B-tree structure from page_lookup.lance is kept in memory to enable fast access to a relatively small data block, and the final results are obtained through sequential scanning within that block.To accelerate index building, @xloya contributed the distributed version (as illustrated below). The key points are:
Resolution
Existing big data computing engines (such as Spark and Flink) provide mature range-shuffle capabilities, enabling efficient global sorting of datasets at the scale of hundreds of billions of records. Leveraging this, we have implemented a range-based BTree index using Lance-Spark and Lance, with the following key features:
The range-based BTree index has a structure nearly identical to the current implementation, with the only addition being multiple distinct page files—each covering a non-overlapping data range:
Workflow
dataset.create_indexwith the corresponding rangeId passed inIndexParams. Thiscreate_indexcall is identical to the standard BTree index construction process.Partial Index Creation
This part of the implementation remains fully consistent with the current approach, with only two minor modifications:
preprocessed_datais added toIndexCreateBuilder. When this data is present, theload_train_datastep is skipped.Lookup File Merging
During the merge process, only the lookup files need to be merged. The purpose of this step is to:
For example, consider these two lookup files:
Update
During updates, the distributed processing pipeline cannot be effectively leveraged for acceleration. Therefore, both update and remapping operations fall back to the standard BTree index behavior: all PageFiles are merged into a single file, resulting in only one PageFile being generated.
We will introduce distributed update mechanism in future issues.
Tests
We tested two tables with data volumes of 100 million and 10 billion rows respectively, and indexed an integer type column. The results are as follows:
Even on the 10-billion-row dataset, we were able to complete the merge in just 46 seconds, the end to end latency is reduced to about 15 min.