Skip to content

Supports range-based distributed BTree index #5164

@steFaiz

Description

@steFaiz

Introduction

The BTree index is a crucial scalar index in Lance. Many thanks to @westonpace for the implementation! Its main structure is shown below:

Image

In summary, all <value, rowAddress> pairs are stored in a single page_data.lance in object storage. A B-tree structure from page_lookup.lance is kept in memory to enable fast access to a relatively small data block, and the final results are obtained through sequential scanning within that block.
To accelerate index building, @xloya contributed the distributed version (as illustrated below). The key points are:

  1. Fragments are horizentally divided. Each subtask only build BTree index for a subset of the full-list fragments.
  2. A dedicated merger is responsible for k-ways merging all page files into one final page file as well as the lookup file. Since all page files are already sorted, the merge process can be relativly faster.
Image However, inevitably, a single point (the Merger) has to process all the data, which becomes unacceptable at very large scales (tens of billions of records or more).

Resolution

Existing big data computing engines (such as Spark and Flink) provide mature range-shuffle capabilities, enabling efficient global sorting of datasets at the scale of hundreds of billions of records. Leveraging this, we have implemented a range-based BTree index using Lance-Spark and Lance, with the following key features:

  1. Minimal changes to the existing Lance BTree index — only a few hundred lines of core code were added.
  2. Full backward compatibility— completely compatible with the current implementation.
  3. Dramatic performance improvement — for datasets with hundreds of millions of records, the merge phase is reduced from tens of minutes to under a few seconds, and end-to-end index construction completes in just a few minutes.

The range-based BTree index has a structure nearly identical to the current implementation, with the only addition being multiple distinct page files—each covering a non-overlapping data range:

Image

Workflow

Image The overall workflow is as follows:
  1. Spark/Flink reads the column to be indexed and its corresponding RowAddresses from Lance, performs a global sort, and applies range-based shuffling.
  2. For each data range, it invokes Lance dataset.create_index with the corresponding rangeId passed in IndexParams. This create_index call is identical to the standard BTree index construction process.
  3. The Merger combines all lookup files. Since this step only requires sequentially processing and merging the lookup files, the merge phase is extremely fast.

Partial Index Creation

This part of the implementation remains fully consistent with the current approach, with only two minor modifications:

  1. A new field preprocessed_data is added to IndexCreateBuilder. When this data is present, the load_train_data step is skipped.
  2. The generated page_file name now includes the rangeId as part of its filename. The rangeId is already sorted by its range, which is guaranteed by Spark/Flink.

Lookup File Merging

During the merge process, only the lookup files need to be merged. The purpose of this step is to:

  1. Reassign a global PageIndex number for each page across all ranges.
  2. Record the offset corresponding to each PageIndex in the consolidated pages file.

For example, consider these two lookup files:

Image In the merged lookup file, the page indices from lookup file 2 are reassigned to 3 and 4, which correctly reflect their global page indices. Additionally, the number of pages in each original lookup file is recorded, enabling the recovery of per-page-file internal page indices during read operations, as illustrated below: Image

Update

During updates, the distributed processing pipeline cannot be effectively leveraged for acceleration. Therefore, both update and remapping operations fall back to the standard BTree index behavior: all PageFiles are merged into a single file, resulting in only one PageFile being generated.
We will introduce distributed update mechanism in future issues.

Tests

We tested two tables with data volumes of 100 million and 10 billion rows respectively, and indexed an integer type column. The results are as follows:

num of rows num of ranges execution time merge time
130 million 3 23 min 1 s
130 million 50 3 min 3 s
10 billion 1000 15 min 46 s

Even on the 10-billion-row dataset, we were able to complete the merge in just 46 seconds, the end to end latency is reduced to about 15 min.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions