Skip to content

Commit 0a3f230

Browse files
committed
docs: document distributed vector segment build
1 parent a07ef61 commit 0a3f230

File tree

3 files changed

+277
-1
lines changed

3 files changed

+277
-1
lines changed

docs/src/guide/.pages

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ nav:
77
- Tags and Branches: tags_and_branches.md
88
- Object Store Configuration: object_store.md
99
- Distributed Write: distributed_write.md
10+
- Distributed Indexing: distributed_indexing.md
1011
- Migration Guide: migration.md
1112
- Performance Guide: performance.md
1213
- Tokenizer: tokenizer.md
13-
- Extension Arrays: arrays.md
14+
- Extension Arrays: arrays.md
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Distributed Indexing
2+
3+
!!! warning
4+
Lance exposes public APIs that can be integrated into an external
5+
distributed index build workflow, but Lance itself does not provide a full
6+
distributed scheduler or end-to-end orchestration layer.
7+
8+
This page describes the current model, terminology, and execution flow so
9+
that callers can integrate these APIs correctly.
10+
11+
## Overview
12+
13+
Distributed index build in Lance follows the same high-level pattern as distributed
14+
write:
15+
16+
1. multiple workers build index data in parallel
17+
2. the caller invokes Lance segment build APIs for one distributed build
18+
3. Lance discovers the relevant worker outputs, then plans and builds index artifacts
19+
4. the built artifacts are committed into the dataset manifest
20+
21+
For vector indices, the worker outputs are temporary shard directories under a
22+
shared UUID. Internally, Lance can turn these shard outputs into one or more
23+
built physical segments.
24+
25+
![Distributed Vector Segment Build](../images/distributed_vector_segment_build.svg)
26+
27+
## Terminology
28+
29+
This guide uses the following terms consistently:
30+
31+
- **Staging root**: the shared UUID directory used during distributed index build
32+
- **Partial shard**: one worker output written under the staging root as
33+
`partial_<uuid>/`
34+
- **Built segment**: one physical index segment produced during segment build and
35+
ready to be committed into the manifest
36+
- **Logical index**: the user-visible index identified by name; a logical index
37+
may contain one or more built segments
38+
39+
For example, a distributed vector build may create a layout like:
40+
41+
```text
42+
indices/<staging_uuid>/
43+
├── partial_<shard_0>/
44+
│ ├── index.idx
45+
│ └── auxiliary.idx
46+
├── partial_<shard_1>/
47+
│ ├── index.idx
48+
│ └── auxiliary.idx
49+
└── partial_<shard_2>/
50+
├── index.idx
51+
└── auxiliary.idx
52+
```
53+
54+
After segment build, Lance produces one or more segment directories:
55+
56+
```text
57+
indices/<segment_uuid_0>/
58+
├── index.idx
59+
└── auxiliary.idx
60+
61+
indices/<segment_uuid_1>/
62+
├── index.idx
63+
└── auxiliary.idx
64+
```
65+
66+
These physical segments are then committed together as one logical index.
67+
68+
## Roles
69+
70+
There are two parties involved in distributed indexing:
71+
72+
- **Workers** build partial shards
73+
- **The caller** launches workers, chooses when a distributed build should be
74+
turned into built segments, provides any additional inputs requested by the
75+
segment build APIs, and
76+
commits the final result
77+
78+
Lance does not provide a distributed scheduler. The caller is responsible for
79+
launching workers and driving the overall workflow.
80+
81+
## Current Model
82+
83+
The current model for distributed vector indexing has two layers of parallelism.
84+
85+
### Shard Build
86+
87+
First, multiple workers build partial shards in parallel:
88+
89+
1. on each worker, call
90+
`create_index_builder(...).fragments(...).index_uuid(staging_index_uuid).execute_uncommitted()`
91+
2. each worker writes one `partial_<uuid>/` under the shared staging root
92+
93+
### Segment Build
94+
95+
Then the caller turns that staging root into one or more built segments:
96+
97+
1. open the staging root with `create_index_segment_builder(staging_index_uuid)`
98+
2. provide partial index metadata with `with_partial_indices(...)`
99+
3. optionally choose a grouping policy with `with_target_segment_bytes(...)`
100+
4. call `plan()` to get `Vec<IndexSegmentPlan>`
101+
102+
At that point the caller has two execution choices:
103+
104+
- call `build(plan)` for each plan and run those builds in parallel
105+
- call `build_all()` to let Lance build every planned segment on the current node
106+
107+
After the segments are built, publish them with
108+
`commit_existing_index_segments(...)`.
109+
110+
## Internal Segmented Finalize Model
111+
112+
Internally, Lance models distributed vector segment build as:
113+
114+
1. **plan** which partial shards should become each built segment
115+
2. **build** each segment from its selected partial shards
116+
3. **commit** the resulting physical segments as one logical index
117+
118+
The plan step is driven by the staging root and any additional shard metadata
119+
required by the segment build APIs.
120+
121+
This is intentionally a storage-level model:
122+
123+
- partial shards are temporary worker outputs
124+
- built segments are durable physical artifacts
125+
- the logical index identity is attached only at commit time
126+
127+
## Segment Grouping
128+
129+
When Lance builds segments from a staging root, it may either:
130+
131+
- keep shard boundaries, so each partial shard becomes one built segment
132+
- group multiple partial shards into a larger built segment
133+
134+
The grouping decision is separate from shard build. Workers only build partial
135+
shards; Lance applies the segment build policy when it plans built segments.
136+
137+
## Responsibility Boundaries
138+
139+
The caller is expected to know:
140+
141+
- which distributed build is ready for segment build
142+
- any additional shard metadata requested by the segment build APIs
143+
- how the resulting built segments should be published
144+
145+
Lance is responsible for:
146+
147+
- writing partial shard artifacts
148+
- discovering partial shards under the staging root
149+
- planning built segments from the discovered shard set
150+
- merging shard storage into built segment artifacts
151+
- committing built segments into the manifest
152+
153+
This split keeps distributed scheduling outside the storage engine while still
154+
letting Lance own the on-disk index format.

0 commit comments

Comments
 (0)