Supports range-based distributed BTree index

# Introduction
The BTree index is a crucial scalar index in Lance. Many thanks to @westonpace for the implementation! Its main structure is shown below:

<img width="600" height="455" alt="Image" src="https://github.com/user-attachments/assets/f0977fc0-bab4-40b1-bd05-20c2cd71530c" />

In summary, all `<value, rowAddress>` pairs are stored in a single page_data.lance in object storage. A B-tree structure from page_lookup.lance is kept in memory to enable fast access to a relatively small data block, and the final results are obtained through sequential scanning within that block. 
To accelerate index building, @xloya  contributed the distributed version (as illustrated below). The key points are: 
1. Fragments are horizentally divided. Each subtask only build BTree index for a subset of the full-list fragments.
2. A dedicated merger is responsible for k-ways merging all page files into one final page file as well as the lookup file. Since all page files are already sorted, the merge process can be relativly faster.

<img width="1502" height="448" alt="Image" src="https://github.com/user-attachments/assets/c5d4fee3-6a67-4f09-8868-e575fdc4d308" />
However, inevitably, a single point (the Merger) has to process all the data, which becomes unacceptable at very large scales (tens of billions of records or more).

# Resolution
Existing big data computing engines (such as Spark and Flink) provide mature range-shuffle capabilities, enabling efficient global sorting of datasets at the scale of hundreds of billions of records. Leveraging this, we have implemented a **range-based BTree index** using Lance-Spark and Lance, with the following key features:
1. **Minimal changes to the existing Lance BTree index** — only a few hundred lines of core code were added.
2.  **Full backward compatibility**— completely compatible with the current implementation.
3.  **Dramatic performance improvement** — for datasets with hundreds of millions of records, the merge phase is <font color="red">**reduced from tens of minutes to under a few seconds**</font>, and end-to-end index construction completes in <font color="red">**just a few minutes**</font>.

The range-based BTree index has a structure nearly identical to the current implementation, with the only addition being multiple distinct page files—each covering a non-overlapping data range:

<img width="750" height="500" alt="Image" src="https://github.com/user-attachments/assets/e1bcbf88-efd3-4dce-b8e4-b382e88a5835" />

## Workflow

<img width="1496" height="478" alt="Image" src="https://github.com/user-attachments/assets/a5897342-7c87-455b-b8b9-0cc0b711e0d2" />
The overall workflow is as follows:

1. **Spark/Flink reads the column to be indexed and its corresponding RowAddresses from Lance**, performs a global sort, and applies range-based shuffling.
3. **For each data range**, it invokes Lance `dataset.create_index` with the corresponding rangeId passed in `IndexParams`. This `create_index` call is identical to the standard BTree index construction process.
4. **The Merger combines all lookup files.** Since this step only requires sequentially processing and merging the lookup files, the merge phase is extremely fast.

## Partial Index Creation
This part of the implementation remains fully consistent with the current approach, with only two minor modifications:

1. A new field `preprocessed_data` is added to `IndexCreateBuilder`. When this data is present, the `load_train_data` step is skipped.
2. The generated page_file name now includes the rangeId as part of its filename. The rangeId is already **sorted by its range**, which is guaranteed by Spark/Flink. 

## Lookup File Merging
During the merge process, only the lookup files need to be merged. The purpose of this step is to:
1. **Reassign a global PageIndex number** for each page across all ranges.
2. **Record the offset** corresponding to each PageIndex in the consolidated pages file.

For example, consider these two lookup files:

<img width="1000" height="455" alt="Image" src="https://github.com/user-attachments/assets/77cf3a35-137d-49d2-a530-1d47126d1f35" />
In the merged lookup file, the page indices from lookup file 2 are reassigned to 3 and 4, which correctly reflect their global page indices. Additionally, the number of pages in each original lookup file is recorded, enabling the recovery of per-page-file internal page indices during read operations, as illustrated below:

<img width="1504" height="556" alt="Image" src="https://github.com/user-attachments/assets/f42d9219-cc16-4fde-9ce1-7b5ff6f1e313" />

## Update
During updates, the distributed processing pipeline cannot be effectively leveraged for acceleration. Therefore, both **update** and **remapping** operations fall back to the standard BTree index behavior: all PageFiles are merged into a single file, resulting in only one PageFile being generated.
We will introduce distributed update mechanism in future issues.

## Tests
We tested two tables with data volumes of 100 million and 10 billion rows respectively, and indexed an integer type column. The results are as follows: 

| num of rows | num of ranges | execution time | merge time|
|:---------:|:-------:|:-------:|:----------:|
|130 million|3|23 min|1 s|
|130 million|50|3 min|3 s|
|10 billion|1000|15 min| 46 s|

Even on the 10-billion-row dataset, we were able to complete the merge in just 46 seconds, the end to end latency is reduced to about 15 min.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supports range-based distributed BTree index #5164

Introduction

Resolution

Workflow

Partial Index Creation

Lookup File Merging

Update

Tests

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

num of rows	num of ranges	execution time	merge time
130 million	3	23 min	1 s
130 million	50	3 min	3 s
10 billion	1000	15 min	46 s

Supports range-based distributed BTree index #5164

Description

Introduction

Resolution

Workflow

Partial Index Creation

Lookup File Merging

Update

Tests

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions