Skip to content

docs: add distributed indexing workflow example#6250

Closed
Xuanwo wants to merge 1 commit intomainfrom
docs/distributed-indexing-example
Closed

docs: add distributed indexing workflow example#6250
Xuanwo wants to merge 1 commit intomainfrom
docs/distributed-indexing-example

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Mar 21, 2026

This updates the distributed indexing guide with a full end-to-end Python example based on the staged segment workflow added in #6220. The example now follows the actual distributed execution model: shared training artifacts are written to object storage, shard build runs on distributed workers, segment build is also distributed, and the coordinator only plans and commits the final segments.

This should make the guide much easier to follow for users integrating Lance with their own scheduler on S3 or other object stores.

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Mar 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review

Clean, well-structured docs addition. All API calls verified against the current codebase — signatures are correct.

One issue to flag

IvfModel.save/load and PqModel.save/load don't accept storage_options — the example uses S3 URIs for model persistence (s3://my-bucket/example/index-training/ivf.lance), but save()/load() pass the URI directly to LanceFileReader/LanceFileWriter without forwarding storage_options. This works only if credentials are available through environment variables or instance profiles, not through explicit storage_options. Worth a note in the example so users don't get stuck, e.g.:

Model save/load rely on environment-based credentials (e.g., AWS_* env vars); explicit storage_options are not yet supported for model files.

Minor nit

In step 2, train=True is the default and is redundant when pre-trained ivf_centroids/pq_codebook are already provided. Removing it would reduce potential confusion about whether training re-runs on workers.

Otherwise LGTM — the five-step flow is clear and matches the staged segment API well.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is great. I think my main concern would just be how we keep this example from getting out of date / stale.

One minor nit about the API...

# Here we create the segment_builder that we will use to create the plan
segment_builder = (
    ds.create_index_segment_builder(staging_index_uuid)
    .with_partial_indices(partial_indices)
    .with_target_segment_bytes(2 * 1024 * 1024 * 1024)  # optional
)
...
# Here we create the segment_builder that we will use to execute the plan
segment_builder = ds.create_index_segment_builder(staging_index_uuid)

It feels a little strange that we create a "segment builder" using the same method both for planning and for executing plans. In particular, it is a little odd that I have to care about with_partial_indices and with_target_segment_bytes in the former but not in the latter. I wonder if partial_indices and target_segment_bytes should be "plan options" instead of "segment builder options"?

@Xuanwo Xuanwo closed this Mar 23, 2026
@Xuanwo Xuanwo deleted the docs/distributed-indexing-example branch March 27, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants