docs: add distributed indexing workflow example by Xuanwo · Pull Request #6250 · lance-format/lance

Xuanwo · 2026-03-21T22:49:59Z

This updates the distributed indexing guide with a full end-to-end Python example based on the staged segment workflow added in #6220. The example now follows the actual distributed execution model: shared training artifacts are written to object storage, shard build runs on distributed workers, segment build is also distributed, and the coordinator only plans and commits the final segments.

This should make the guide much easier to follow for users integrating Lance with their own scheduler on S3 or other object stores.

github-actions · 2026-03-21T22:52:28Z

PR Review

Clean, well-structured docs addition. All API calls verified against the current codebase — signatures are correct.

One issue to flag

IvfModel.save/load and PqModel.save/load don't accept storage_options — the example uses S3 URIs for model persistence (s3://my-bucket/example/index-training/ivf.lance), but save()/load() pass the URI directly to LanceFileReader/LanceFileWriter without forwarding storage_options. This works only if credentials are available through environment variables or instance profiles, not through explicit storage_options. Worth a note in the example so users don't get stuck, e.g.:

Model save/load rely on environment-based credentials (e.g., AWS_* env vars); explicit storage_options are not yet supported for model files.

Minor nit

In step 2, train=True is the default and is redundant when pre-trained ivf_centroids/pq_codebook are already provided. Removing it would reduce potential confusion about whether training re-runs on workers.

Otherwise LGTM — the five-step flow is clear and matches the staged segment API well.

westonpace

This example is great. I think my main concern would just be how we keep this example from getting out of date / stale.

One minor nit about the API...

# Here we create the segment_builder that we will use to create the plan
segment_builder = (
    ds.create_index_segment_builder(staging_index_uuid)
    .with_partial_indices(partial_indices)
    .with_target_segment_bytes(2 * 1024 * 1024 * 1024)  # optional
)
...
# Here we create the segment_builder that we will use to execute the plan
segment_builder = ds.create_index_segment_builder(staging_index_uuid)

It feels a little strange that we create a "segment builder" using the same method both for planning and for executing plans. In particular, it is a little odd that I have to care about with_partial_indices and with_target_segment_bytes in the former but not in the latter. I wonder if partial_indices and target_segment_bytes should be "plan options" instead of "segment builder options"?

docs: add distributed indexing workflow example

a85ea93

github-actions Bot added the documentation Improvements or additions to documentation label Mar 21, 2026

westonpace approved these changes Mar 23, 2026

View reviewed changes

Xuanwo closed this Mar 23, 2026

Xuanwo deleted the docs/distributed-indexing-example branch March 27, 2026 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add distributed indexing workflow example#6250

docs: add distributed indexing workflow example#6250
Xuanwo wants to merge 1 commit intomainfrom
docs/distributed-indexing-example

Xuanwo commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

westonpace left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Xuanwo commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

PR Review

One issue to flag

Minor nit

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants