Skip to content

feat: support non-shared centroid vector index builds#6296

Merged
Xuanwo merged 6 commits intomainfrom
xuanwo/non-shared-centroid-index-build
Mar 27, 2026
Merged

feat: support non-shared centroid vector index builds#6296
Xuanwo merged 6 commits intomainfrom
xuanwo/non-shared-centroid-index-build

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Mar 25, 2026

This PR builds on #6294 and exposes the remaining pieces needed to construct non-shared centroid vector index builds.

It adds fragment-scoped IVF/PQ training in Rust and exports the same training flow to Python, so users can train per-segment artifacts and feed them into the existing distributed build path.

@github-actions github-actions Bot added enhancement New feature or request python labels Mar 25, 2026
@Xuanwo Xuanwo marked this pull request as draft March 25, 2026 18:15
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat: support non-shared centroid vector index builds

Clean PR overall — the plumbing of fragment_ids through IVF/PQ training and into Python is straightforward and well-tested.

Issues

P1 — Memory risk in fragment-scoped nullable sampling (rust/lance/src/index/vector/utils.rs)

When fragment_ids is Some and the column is nullable, sample_training_data falls back to scan_all_training_data (loads everything into memory) then samples with arrow_select::take. The non-fragment path uses streaming sample_nullable_fsl which is much more memory-efficient. For large fragments with nullable vectors this could OOM. At minimum, add a comment noting this limitation; ideally wire the streaming path through fragment-scoped scans too.

// current fragment_ids path for nullable:
let batch = scan_all_training_data(dataset, column, is_nullable, fragment_ids).await?;
// vs non-fragment path:
sample_nullable_fsl(column, sample_size_hint, byte_width, vector_field, scan).await

P1 — Dead guard condition (rust/lance/src/index/vector/utils.rs:~460)

The match arm DataType::FixedSizeList(_, _) if !is_nullable && fragment_ids.is_none() adds a fragment_ids.is_none() guard, but this code is unreachable when fragment_ids.is_some() because of the early return above. The added guard is dead code and misleading — remove it to keep the logic clear.

🤖 Generated with Claude Code

@github-actions github-actions Bot added the java label Mar 25, 2026
@Xuanwo Xuanwo marked this pull request as ready for review March 25, 2026 18:30
@Xuanwo Xuanwo changed the title feat: support non-shared centroid vector index builds [WIP] feat: support non-shared centroid vector index builds Mar 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@Xuanwo Xuanwo changed the base branch from xuanwo/sample-fragment-ids to main March 26, 2026 10:15
@Xuanwo Xuanwo changed the title [WIP] feat: support non-shared centroid vector index builds feat: support non-shared centroid vector index builds Mar 26, 2026
…troid-index-build

# Conflicts:
#	python/src/indices.rs
#	rust/lance/src/index/vector/utils.rs
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 44.68085% with 52 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/utils.rs 23.07% 43 Missing and 7 partials ⚠️
rust/lance/src/index/vector/ivf.rs 92.85% 0 Missing and 1 partial ⚠️
rust/lance/src/index/vector/pq.rs 92.30% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@Xuanwo Xuanwo merged commit 92c45f6 into main Mar 27, 2026
29 checks passed
@Xuanwo Xuanwo deleted the xuanwo/non-shared-centroid-index-build branch March 27, 2026 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants