fix(index): preserve stable row-id entries during scalar index optimize#6117
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
PR ReviewSummary: This PR fixes a real bug where DesignThe approach of introducing One concern: deletion vectors not accounted forIn However, this matches the existing behavior of the fragment-based filter for address-style row IDs — TestsGood coverage: both a unit test ( Minor nitIn let mut result = RowAddrTreeMap::new();
for (_, seq) in row_id_sequences {
let map = RowAddrTreeMap::from(seq.as_ref());
result = RowAddrTreeMap::union(&result, &map);
}Not blocking — the current approach is clear and correct, just slightly more allocations than necessary. Overall this looks good. The bug analysis is solid, the fix is targeted, and the tests are appropriate. |
|
@Xuanwo could you please help review this PR when you have time? This bug is currently blocking my normal production use of LanceDB. In my case it causes indexed equality lookups to fail after I also wrote up the full analysis and fix here: Thanks a lot. |
wjones127
left a comment
There was a problem hiding this comment.
This fix looks correct to me, but I'd like the implementation to be refactored a bit so that each index index implementation isn't exposed to the details of old vs new style ids.
| // Two filtering strategies: | ||
| // - Fragments: fast path for address-style row IDs (fragment id is encoded in row_id) | ||
| // - RowIds: exact allow-list for stable row IDs (row_id bits are opaque) | ||
| let old_stream = match old_data_filter { | ||
| Some(OldIndexDataFilter::Fragments(valid_frags)) => { | ||
| filter_row_ids_by_fragments(old_stream, valid_frags) | ||
| }, | ||
| Some(OldIndexDataFilter::RowIds(valid_row_ids)) => { | ||
| filter_row_ids_by_exact_set(old_stream, valid_row_ids) | ||
| }, |
There was a problem hiding this comment.
It would be nice if OldIndexDataFilter encapsulated this distinction within.
Maybe something like:
impl OldIndexDataFilter {
pub fn filter_row_ids(&self, row_ids: &UInt64Array) -> BooleanArray {
match self {
Self::Fragments(frags) => ...,
Self::RowIds(row_ids) => ...,
}
}There was a problem hiding this comment.
Thanks, this is a good suggestion
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
KMeansParams::seed (Option<u64>) lets callers pin the RNG used for centroid initialization and empty-cluster splitting, replacing the //-TODO-flagged unconditional `SmallRng::from_os_rng()`. When set, training over the same data is reproducible — required by integration tests of IVF-based vector indexes that assert on top-K recall (without a stable seed those tests are inherently flaky because KMeans converges to different local minima per OS-entropy seed). KMeans::to_kmeans gains a `&mut dyn rand::RngCore` parameter so the caller's seeded RNG threads through to split_clusters (the only internal randomness consumer beyond init). Default behavior (seed == None) is unchanged: every call reseeds from OS entropy as before. IvfBuildParams gains a parallel `seed: Option<u64>` field that propagates into the KMeansParams constructed by do_train_ivf_model. derive_ivf_params (delta-index path) sets it to None — delta indexes inherit centroids and don't retrain. Also fixes a pre-existing rebase miss on compound_btree.rs (4 test-binary call sites of `index.update` missing the third `old_data_filter` argument added by upstream PR lance-format#6117 after our compound-index work forked). Tests: - vector::kmeans::tests::test_seed_pins_kmeans_output (new) — two trainings with the same seed produce byte-identical centroids; different seeds diverge (sanity check that the seed actually drives the RNG). - All 8 existing vector::kmeans tests still pass.
Preserve stable row-id entries during scalar index optimize
Fixes #6116
Summary
This PR fixes a bug where
optimize_indices()could drop valid BTree index entries when the dataset used stable row IDs.I hit this while building the music module of StaticFlow, my personal project built on top of Lance/LanceDB. The
songsdataset uses:enable_stable_row_ids = trueidAfter running:
compact_files()optimize_indices()full scans still returned the expected rows, but indexed equality lookups such as
id = 'song-42'returned no rows.Root Cause
The old optimize path filtered old BTree rows with logic equivalent to:
That is correct for address-style row IDs:
But it is incorrect for stable row IDs, because stable row IDs are opaque logical IDs and do not encode fragment ownership in their upper bits.
As a result, valid old index rows could be removed during optimize even though the underlying rows were still present after compaction.
What This PR Changes
Implementation Notes
The key change is to stop assuming that every
row_idcan be interpreted as a row address.For stable-row-ID datasets, the optimize path now:
This preserves the existing fast path for address-style row IDs and only uses exact row-ID filtering when the dataset actually uses stable row IDs.
Additional Context
I also wrote a longer deep dive covering the bug, the stable-row-ID model, and the full repair process:
Final Note
This PR document and parts of the implementation work were prepared with assistance from Codex GPT-5.4. The final patch has been reviewed by me personally, and it has been running normally in my production StaticFlow environment for one week.