docs: add distributed indexing workflow example#6250
Conversation
PR ReviewClean, well-structured docs addition. All API calls verified against the current codebase — signatures are correct. One issue to flag
Minor nitIn step 2, Otherwise LGTM — the five-step flow is clear and matches the staged segment API well. |
westonpace
left a comment
There was a problem hiding this comment.
This example is great. I think my main concern would just be how we keep this example from getting out of date / stale.
One minor nit about the API...
# Here we create the segment_builder that we will use to create the plan
segment_builder = (
ds.create_index_segment_builder(staging_index_uuid)
.with_partial_indices(partial_indices)
.with_target_segment_bytes(2 * 1024 * 1024 * 1024) # optional
)
...
# Here we create the segment_builder that we will use to execute the plan
segment_builder = ds.create_index_segment_builder(staging_index_uuid)
It feels a little strange that we create a "segment builder" using the same method both for planning and for executing plans. In particular, it is a little odd that I have to care about with_partial_indices and with_target_segment_bytes in the former but not in the latter. I wonder if partial_indices and target_segment_bytes should be "plan options" instead of "segment builder options"?
This updates the distributed indexing guide with a full end-to-end Python example based on the staged segment workflow added in #6220. The example now follows the actual distributed execution model: shared training artifacts are written to object storage, shard build runs on distributed workers, segment build is also distributed, and the coordinator only plans and commits the final segments.
This should make the guide much easier to follow for users integrating Lance with their own scheduler on S3 or other object stores.