feat!: support sampling selected fragments#6294
Conversation
PR ReviewClean, well-structured change. The refactoring of One minor observation (not blocking)Double row counting for fragment-subset path: When TestsGood coverage: subset sampling with deletions, empty fragment list rejection, unknown fragment ID rejection. The duplicate fragment ID in LGTM. |
16ffe50 to
06563d7
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
c5aed6c to
06563d7
Compare
wjones127
left a comment
There was a problem hiding this comment.
This looks good. Have a few minor suggestions on tests, but those are optional.
This PR builds on #6294 and exposes the remaining pieces needed to construct non-shared centroid vector index builds. It adds fragment-scoped IVF/PQ training in Rust and exports the same training flow to Python, so users can train per-segment artifacts and feed them into the existing distributed build path.
This extends
Dataset::sampleto accept optional fragment IDs so grouped training jobs can sample from selected fragments without reimplementing deletion-aware row offset mapping. It rejects empty fragment selections and unknown fragment IDs while preserving the existing behavior for whole-dataset sampling.