Skip to content

feat: add index segment commit API#6209

Merged
Xuanwo merged 14 commits intomainfrom
xuanwo/index-segment-commit-api
Mar 18, 2026
Merged

feat: add index segment commit API#6209
Xuanwo merged 14 commits intomainfrom
xuanwo/index-segment-commit-api

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Mar 17, 2026

This adds a first-class IndexSegment type and a commit_existing_index_segments API so a logical index can be committed from multiple physical segments. It is intended as a building block for the multi-segment vector index work without changing the high-level create_index flow.

Part of #6180

@github-actions github-actions Bot added the enhancement New feature or request label Mar 17, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat: add index segment commit API

Overall this is a clean, well-tested PR. A few items worth discussing:

P1: Truncation casting u64 fragment id → u32

Fragment.id is u64, but iter_fragment_ids() casts via f.id as u32, which silently truncates if a fragment id ever exceeds u32::MAX. This pre-existing pattern is used elsewhere (scanner, get_fragments), but codifying it in a new public API method is a good time to add a safety check or at least document the invariant. Consider u32::try_from(f.id).expect("fragment id exceeds u32") or a comment explaining why truncation is safe.

P1: IndexSegment duplicates every field of IndexMetadata (minus name/fields/dataset_version)

IndexSegment has the exact same field types and names as the corresponding IndexMetadata fields. This creates a maintenance risk — any future field added to IndexMetadata (e.g., a new optional attribute) must be manually mirrored. Consider either:

  • Making IndexSegment contain an IndexMetadata internally and having the commit API fill in the shared fields, or
  • Adding a From<IndexSegment> / builder that takes the shared fields as parameters

This is a design suggestion, not a blocker — but worth thinking about before the type becomes part of the public API surface.

Minor

  • The chrono dependency added to lance-index (Cargo.toml) appears unused — IndexSegment lives in lance-table, and the traits file doesn't use chrono directly. Verify this is needed.

Tests look thorough — good coverage of the happy path, duplicate-segment rejection, empty-segments rejection, and backward-compat wrapper.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 94.64286% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/types.rs 74.28% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions but looks good overall

Comment thread rust/lance-index/src/types.rs
Comment thread rust/lance/src/dataset/mem_wal/memtable/flush.rs Outdated
Comment thread rust/lance/src/index/vector/ivf/v2.rs
Comment thread rust/lance/src/dataset.rs Outdated
Comment thread rust/lance/src/index.rs Outdated
Comment thread rust/lance/src/index.rs Outdated
Comment thread rust/lance-index/src/types.rs Outdated
Comment thread rust/lance/src/index.rs
index_details,
index_version,
created_at: Some(chrono::Utc::now()),
base_id: None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to specify a base id? Not that important for now but I think we'll eventually need it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexSegment exposed this API but no upper APIs to use it, we need to refactor the upper APIs first. Yes, we can treat this as a follow up.

@Xuanwo Xuanwo merged commit c9e5b1e into main Mar 18, 2026
28 checks passed
@Xuanwo Xuanwo deleted the xuanwo/index-segment-commit-api branch March 18, 2026 17:22
wjones127 added a commit to wjones127/lance that referenced this pull request Mar 21, 2026
The method was renamed in lance-format#6209 but the test call site in v2.rs was not
updated during the merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Xuanwo added a commit that referenced this pull request Mar 21, 2026
This refactors distributed vector indexing into a staged segment-build
pipeline and exposes the public APIs needed to integrate an external
distributed build workflow with Lance. It defines the storage-level
model for partial shards, staged planning, built segments, and segment
commit, and documents the current distributed indexing flow.

This PR builds on the segment commit API from
#6209. The main changes are
organized into five commits:

- [test: cover distributed vector segment
build](a86274da2)
- [refactor: internalize distributed vector segment
build](1e5f0e15b)
- [feat: add public vector segment builder
API](691cecb9a)
- [feat: add Python vector segment builder
API](a07ef6144)
- [docs: document distributed vector segment
build](0a3f230d7)

Follow-up fixes:

- [fix: expose python
create_index_uncommitted](c1d3b1666)
- [fix: format python
bindings](b86f91c4d)
- [fix: format python segment builder
bindings](bfc9e63a0)

Please review accordingly.

---

Guide:
https://github.com/lance-format/lance/blob/xuanwo/vector-staging-merge-internal/docs/src/guide/distributed_indexing.md
westonpace pushed a commit that referenced this pull request Mar 24, 2026
This refactors distributed vector indexing into a staged segment-build
pipeline and exposes the public APIs needed to integrate an external
distributed build workflow with Lance. It defines the storage-level
model for partial shards, staged planning, built segments, and segment
commit, and documents the current distributed indexing flow.

This PR builds on the segment commit API from
#6209. The main changes are
organized into five commits:

- [test: cover distributed vector segment
build](a86274da2)
- [refactor: internalize distributed vector segment
build](1e5f0e15b)
- [feat: add public vector segment builder
API](691cecb9a)
- [feat: add Python vector segment builder
API](a07ef6144)
- [docs: document distributed vector segment
build](0a3f230d7)

Follow-up fixes:

- [fix: expose python
create_index_uncommitted](c1d3b1666)
- [fix: format python
bindings](b86f91c4d)
- [fix: format python segment builder
bindings](bfc9e63a0)

Please review accordingly.

---

Guide:
https://github.com/lance-format/lance/blob/xuanwo/vector-staging-merge-internal/docs/src/guide/distributed_indexing.md
wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 29, 2026
This adds a first-class `IndexSegment` type and a
`commit_existing_index_segments` API so a logical index can be committed
from multiple physical segments. It is intended as a building block for
the multi-segment vector index work without changing the high-level
`create_index` flow.

Part of lance-format#6180
wjones127 added a commit to wjones127/lance that referenced this pull request Mar 29, 2026
The method was renamed in lance-format#6209 but the test call site in v2.rs was not
updated during the merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 29, 2026
This refactors distributed vector indexing into a staged segment-build
pipeline and exposes the public APIs needed to integrate an external
distributed build workflow with Lance. It defines the storage-level
model for partial shards, staged planning, built segments, and segment
commit, and documents the current distributed indexing flow.

This PR builds on the segment commit API from
lance-format#6209. The main changes are
organized into five commits:

- [test: cover distributed vector segment
build](lance-format@a86274da2)
- [refactor: internalize distributed vector segment
build](lance-format@1e5f0e15b)
- [feat: add public vector segment builder
API](lance-format@691cecb9a)
- [feat: add Python vector segment builder
API](lance-format@a07ef6144)
- [docs: document distributed vector segment
build](lance-format@0a3f230d7)

Follow-up fixes:

- [fix: expose python
create_index_uncommitted](lance-format@c1d3b1666)
- [fix: format python
bindings](lance-format@b86f91c4d)
- [fix: format python segment builder
bindings](lance-format@bfc9e63a0)

Please review accordingly.

---

Guide:
https://github.com/lance-format/lance/blob/xuanwo/vector-staging-merge-internal/docs/src/guide/distributed_indexing.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants