Skip to content

feat!: support sampling selected fragments#6294

Merged
Xuanwo merged 5 commits intomainfrom
xuanwo/sample-fragment-ids
Mar 26, 2026
Merged

feat!: support sampling selected fragments#6294
Xuanwo merged 5 commits intomainfrom
xuanwo/sample-fragment-ids

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Mar 25, 2026

This extends Dataset::sample to accept optional fragment IDs so grouped training jobs can sample from selected fragments without reimplementing deletion-aware row offset mapping. It rejects empty fragment selections and unknown fragment IDs while preserving the existing behavior for whole-dataset sampling.

@github-actions github-actions Bot added the enhancement New feature or request label Mar 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review

Clean, well-structured change. The refactoring of row_offsets_to_row_addresses to accept &[FileFragment] instead of &Dataset is a nice generalization that enables the subset sampling without duplicating logic.

One minor observation (not blocking)

Double row counting for fragment-subset path: When fragment_ids is Some, count_rows is called for each selected fragment twice — once in the buffer_unordered(16) fold to compute num_rows, and again sequentially inside row_offsets_to_row_addresses. If count_rows isn't cached, this doubles the I/O for deletion vector reads. You could pass the precomputed total into row_offsets_to_row_addresses or pre-collect the per-fragment counts, but it's likely negligible for typical workloads.

Tests

Good coverage: subset sampling with deletions, empty fragment list rejection, unknown fragment ID rejection. The duplicate fragment ID in test_sample_with_fragment_ids ([0, 0, 2]) is a nice touch for exercising the dedup via BTreeSet.

LGTM.

@Xuanwo Xuanwo force-pushed the xuanwo/sample-fragment-ids branch from 16ffe50 to 06563d7 Compare March 25, 2026 17:26
@Xuanwo Xuanwo changed the title feat: support sampling selected fragments feat!: support sampling selected fragments Mar 25, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 87.93103% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset.rs 90.38% 0 Missing and 5 partials ⚠️
rust/lance/src/dataset/take.rs 66.66% 0 Missing and 1 partial ⚠️
rust/lance/src/index/vector/utils.rs 0.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Have a few minor suggestions on tests, but those are optional.

Comment thread rust/lance/src/dataset/tests/dataset_io.rs
Comment thread rust/lance/src/dataset/tests/dataset_io.rs
Comment thread rust/lance/src/dataset/tests/dataset_io.rs Outdated
@Xuanwo Xuanwo merged commit f0aa55f into main Mar 26, 2026
28 checks passed
@Xuanwo Xuanwo deleted the xuanwo/sample-fragment-ids branch March 26, 2026 20:33
Xuanwo added a commit that referenced this pull request Mar 27, 2026
This PR builds on #6294 and exposes the remaining pieces needed to
construct non-shared centroid vector index builds.

It adds fragment-scoped IVF/PQ training in Rust and exports the same
training flow to Python, so users can train per-segment artifacts and
feed them into the existing distributed build path.
wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants