fix(index): preserve stable row-id entries during scalar index optimize by acking-you · Pull Request #6117 · lance-format/lance

acking-you · 2026-03-06T15:24:42Z

Preserve stable row-id entries during scalar index optimize

Summary

This PR fixes a bug where optimize_indices() could drop valid BTree index entries when the dataset used stable row IDs.

I hit this while building the music module of StaticFlow, my personal project built on top of Lance/LanceDB. The songs dataset uses:

enable_stable_row_ids = true
a BTree scalar index on id

After running:

compact_files()
optimize_indices()

full scans still returned the expected rows, but indexed equality lookups such as id = 'song-42' returned no rows.

Root Cause

The old optimize path filtered old BTree rows with logic equivalent to:

valid_fragments.contains((row_id >> 32) as u32)

That is correct for address-style row IDs:

row_id = (fragment_id << 32) | row_offset

But it is incorrect for stable row IDs, because stable row IDs are opaque logical IDs and do not encode fragment ownership in their upper bits.

As a result, valid old index rows could be removed during optimize even though the underlying rows were still present after compaction.

What This PR Changes

adds an explicit old-data filter mode to scalar index update
keeps fragment-based filtering for address-style row IDs
builds an exact retained row-ID set for stable-row-ID datasets from persisted row-id sequences
filters old BTree rows by exact row-ID membership for the stable-row-ID case
adds regression coverage for both the BTree update path and the end-to-end compaction plus optimize flow

Implementation Notes

The key change is to stop assuming that every row_id can be interpreted as a row address.

For stable-row-ID datasets, the optimize path now:

computes the retained old fragments
loads their row-ID sequences
builds one exact retained row-ID set
keeps only old index rows whose row IDs are still valid

This preserves the existing fast path for address-style row IDs and only uses exact row-ID filtering when the dataset actually uses stable row IDs.

Additional Context

I also wrote a longer deep dive covering the bug, the stable-row-ID model, and the full repair process:

https://ackingliu.top/posts/lance-stable-row-id-deep-dive

Final Note

This PR document and parts of the implementation work were prepared with assistance from Codex GPT-5.4. The final patch has been reviewed by me personally, and it has been running normally in my production StaticFlow environment for one week.

github-actions · 2026-03-06T15:24:59Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

github-actions · 2026-03-06T15:26:51Z

PR Review

Summary: This PR fixes a real bug where optimize_indices() drops valid BTree entries for datasets using stable row IDs, because the existing filter incorrectly interprets opaque stable row IDs as address-encoded (fragment_id in upper 32 bits). The fix is correct and well-structured.

Design

The approach of introducing OldIndexDataFilter as an enum with Fragments vs RowIds variants is clean. It preserves the fast path for address-style row IDs and only activates the exact-membership path when needed. The trait signature change across all scalar index types is mechanical and correct — only BTree actually uses the filter.

One concern: deletion vectors not accounted for

In build_stable_row_id_filter, the retained row-ID set is built from the full row-id sequences of retained fragments, without subtracting rows that have been soft-deleted (deletion vectors). This means the allow-list may include IDs for deleted rows.

However, this matches the existing behavior of the fragment-based filter for address-style row IDs — filter_row_ids_by_fragments also doesn't exclude soft-deleted rows within valid fragments. So this is not a regression, but it may be worth a brief comment noting this parity.

Tests

Good coverage: both a unit test (test_update_with_exact_row_id_filter) exercising the BTree filter path directly, and an end-to-end integration test (test_optimize_btree_keeps_rows_with_stable_row_ids_after_compaction) covering the full write → index → compact → optimize → query flow.

Minor nit

In build_stable_row_id_filter, the pattern of building individual RowAddrTreeMap per fragment then calling union_all allocates N intermediate maps. You could avoid this by inserting directly into a single RowAddrTreeMap in a fold:

let mut result = RowAddrTreeMap::new();
for (_, seq) in row_id_sequences {
    let map = RowAddrTreeMap::from(seq.as_ref());
    result = RowAddrTreeMap::union(&result, &map);
}

Not blocking — the current approach is clear and correct, just slightly more allocations than necessary.

Overall this looks good. The bug analysis is solid, the fix is targeted, and the tests are appropriate.

acking-you · 2026-03-06T15:39:24Z

@Xuanwo could you please help review this PR when you have time?

This bug is currently blocking my normal production use of LanceDB. In my case it causes indexed equality lookups to fail after compact_files() + optimize_indices() on datasets with stable
row IDs enabled, even though the underlying rows are still present.

I also wrote up the full analysis and fix here:
https://ackingliu.top/posts/lance-stable-row-id-deep-dive

Thanks a lot.

wjones127

This fix looks correct to me, but I'd like the implementation to be refactored a bit so that each index index implementation isn't exposed to the details of old vs new style ids.

wjones127 · 2026-03-06T19:18:44Z

+        // Two filtering strategies:
+        // - Fragments: fast path for address-style row IDs (fragment id is encoded in row_id)
+        // - RowIds: exact allow-list for stable row IDs (row_id bits are opaque)
+        let old_stream = match old_data_filter {
+            Some(OldIndexDataFilter::Fragments(valid_frags)) => {
+                filter_row_ids_by_fragments(old_stream, valid_frags)
+            },
+            Some(OldIndexDataFilter::RowIds(valid_row_ids)) => {
+                filter_row_ids_by_exact_set(old_stream, valid_row_ids)
+            },


It would be nice if OldIndexDataFilter encapsulated this distinction within.

Maybe something like:

impl OldIndexDataFilter { pub fn filter_row_ids(&self, row_ids: &UInt64Array) -> BooleanArray { match self { Self::Fragments(frags) => ..., Self::RowIds(row_ids) => ..., } }

Thanks, this is a good suggestion

codecov · 2026-03-06T19:57:07Z

Codecov Report

❌ Patch coverage is 97.22222% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/index/append.rs	96.90%	1 Missing and 2 partials ⚠️
rust/lance-index/src/scalar/btree.rs	97.22%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

KMeansParams::seed (Option<u64>) lets callers pin the RNG used for centroid initialization and empty-cluster splitting, replacing the //-TODO-flagged unconditional `SmallRng::from_os_rng()`. When set, training over the same data is reproducible — required by integration tests of IVF-based vector indexes that assert on top-K recall (without a stable seed those tests are inherently flaky because KMeans converges to different local minima per OS-entropy seed). KMeans::to_kmeans gains a `&mut dyn rand::RngCore` parameter so the caller's seeded RNG threads through to split_clusters (the only internal randomness consumer beyond init). Default behavior (seed == None) is unchanged: every call reseeds from OS entropy as before. IvfBuildParams gains a parallel `seed: Option<u64>` field that propagates into the KMeansParams constructed by do_train_ivf_model. derive_ivf_params (delta-index path) sets it to None — delta indexes inherit centroids and don't retrain. Also fixes a pre-existing rebase miss on compound_btree.rs (4 test-binary call sites of `index.update` missing the third `old_data_filter` argument added by upstream PR lance-format#6117 after our compound-index work forked). Tests: - vector::kmeans::tests::test_seed_pins_kmeans_output (new) — two trainings with the same seed produce byte-identical centroids; different seeds diverge (sanity check that the seed actually drives the RNG). - All 8 existing vector::kmeans tests still pass.

acking-you added 3 commits March 6, 2026 22:55

fix(index): preserve stable row-id entries during scalar index optimize

4d8fd74

perf(index): avoid cloning effective fragment bitmap

f6730f0

docs(index): clarify stable row-id filter semantics

27c3dff

acking-you changed the title ~~Preserve stable row-id entries during scalar index optimize~~ fix(index): preserve stable row-id entries during scalar index optimize Mar 6, 2026

github-actions Bot added the bug Something isn't working label Mar 6, 2026

wjones127 requested changes Mar 6, 2026

View reviewed changes

refactor(index): encapsulate old-data row-id filtering

b86ec3f

wjones127 approved these changes Mar 6, 2026

View reviewed changes

style(index): align rustfmt layout for stable row-id patch

c83e2ce

Xuanwo approved these changes Mar 9, 2026

View reviewed changes

Xuanwo merged commit 6adcee8 into lance-format:main Mar 9, 2026
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(index): preserve stable row-id entries during scalar index optimize#6117

fix(index): preserve stable row-id entries during scalar index optimize#6117
Xuanwo merged 5 commits intolance-format:mainfrom
acking-you:fix/stable-row-id-index-filter-clean

acking-you commented Mar 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 6, 2026

Uh oh!

github-actions Bot commented Mar 6, 2026

Uh oh!

acking-you commented Mar 6, 2026

Uh oh!

wjones127 left a comment

Uh oh!

wjones127 Mar 6, 2026

Uh oh!

acking-you Mar 6, 2026

Uh oh!

codecov Bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

acking-you commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preserve stable row-id entries during scalar index optimize

Summary

Root Cause

What This PR Changes

Implementation Notes

Additional Context

Final Note

Uh oh!

github-actions Bot commented Mar 6, 2026

Uh oh!

github-actions Bot commented Mar 6, 2026

PR Review

Design

One concern: deletion vectors not accounted for

Tests

Minor nit

Uh oh!

acking-you commented Mar 6, 2026

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

wjones127 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

acking-you Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

acking-you commented Mar 6, 2026 •

edited

Loading

codecov Bot commented Mar 6, 2026 •

edited

Loading