Skip to content

fix(index): preserve stable row-id entries during scalar index optimize#6117

Merged
Xuanwo merged 5 commits intolance-format:mainfrom
acking-you:fix/stable-row-id-index-filter-clean
Mar 9, 2026
Merged

fix(index): preserve stable row-id entries during scalar index optimize#6117
Xuanwo merged 5 commits intolance-format:mainfrom
acking-you:fix/stable-row-id-index-filter-clean

Conversation

@acking-you
Copy link
Copy Markdown
Contributor

@acking-you acking-you commented Mar 6, 2026

Preserve stable row-id entries during scalar index optimize

Fixes #6116

Summary

This PR fixes a bug where optimize_indices() could drop valid BTree index entries when the dataset used stable row IDs.

I hit this while building the music module of StaticFlow, my personal project built on top of Lance/LanceDB. The songs dataset uses:

  • enable_stable_row_ids = true
  • a BTree scalar index on id

After running:

  1. compact_files()
  2. optimize_indices()

full scans still returned the expected rows, but indexed equality lookups such as id = 'song-42' returned no rows.

Root Cause

The old optimize path filtered old BTree rows with logic equivalent to:

valid_fragments.contains((row_id >> 32) as u32)

That is correct for address-style row IDs:

row_id = (fragment_id << 32) | row_offset

But it is incorrect for stable row IDs, because stable row IDs are opaque logical IDs and do not encode fragment ownership in their upper bits.

As a result, valid old index rows could be removed during optimize even though the underlying rows were still present after compaction.

What This PR Changes

  • adds an explicit old-data filter mode to scalar index update
  • keeps fragment-based filtering for address-style row IDs
  • builds an exact retained row-ID set for stable-row-ID datasets from persisted row-id sequences
  • filters old BTree rows by exact row-ID membership for the stable-row-ID case
  • adds regression coverage for both the BTree update path and the end-to-end compaction plus optimize flow

Implementation Notes

The key change is to stop assuming that every row_id can be interpreted as a row address.

For stable-row-ID datasets, the optimize path now:

  1. computes the retained old fragments
  2. loads their row-ID sequences
  3. builds one exact retained row-ID set
  4. keeps only old index rows whose row IDs are still valid

This preserves the existing fast path for address-style row IDs and only uses exact row-ID filtering when the dataset actually uses stable row IDs.

Additional Context

I also wrote a longer deep dive covering the bug, the stable-row-ID model, and the full repair process:

Final Note

This PR document and parts of the implementation work were prepared with assistance from Codex GPT-5.4. The final patch has been reviewed by me personally, and it has been running normally in my production StaticFlow environment for one week.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 6, 2026

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 6, 2026

PR Review

Summary: This PR fixes a real bug where optimize_indices() drops valid BTree entries for datasets using stable row IDs, because the existing filter incorrectly interprets opaque stable row IDs as address-encoded (fragment_id in upper 32 bits). The fix is correct and well-structured.

Design

The approach of introducing OldIndexDataFilter as an enum with Fragments vs RowIds variants is clean. It preserves the fast path for address-style row IDs and only activates the exact-membership path when needed. The trait signature change across all scalar index types is mechanical and correct — only BTree actually uses the filter.

One concern: deletion vectors not accounted for

In build_stable_row_id_filter, the retained row-ID set is built from the full row-id sequences of retained fragments, without subtracting rows that have been soft-deleted (deletion vectors). This means the allow-list may include IDs for deleted rows.

However, this matches the existing behavior of the fragment-based filter for address-style row IDs — filter_row_ids_by_fragments also doesn't exclude soft-deleted rows within valid fragments. So this is not a regression, but it may be worth a brief comment noting this parity.

Tests

Good coverage: both a unit test (test_update_with_exact_row_id_filter) exercising the BTree filter path directly, and an end-to-end integration test (test_optimize_btree_keeps_rows_with_stable_row_ids_after_compaction) covering the full write → index → compact → optimize → query flow.

Minor nit

In build_stable_row_id_filter, the pattern of building individual RowAddrTreeMap per fragment then calling union_all allocates N intermediate maps. You could avoid this by inserting directly into a single RowAddrTreeMap in a fold:

let mut result = RowAddrTreeMap::new();
for (_, seq) in row_id_sequences {
    let map = RowAddrTreeMap::from(seq.as_ref());
    result = RowAddrTreeMap::union(&result, &map);
}

Not blocking — the current approach is clear and correct, just slightly more allocations than necessary.

Overall this looks good. The bug analysis is solid, the fix is targeted, and the tests are appropriate.

@acking-you acking-you changed the title Preserve stable row-id entries during scalar index optimize fix(index): preserve stable row-id entries during scalar index optimize Mar 6, 2026
@github-actions github-actions Bot added the bug Something isn't working label Mar 6, 2026
@acking-you
Copy link
Copy Markdown
Contributor Author

@Xuanwo could you please help review this PR when you have time?

This bug is currently blocking my normal production use of LanceDB. In my case it causes indexed equality lookups to fail after compact_files() + optimize_indices() on datasets with stable
row IDs enabled, even though the underlying rows are still present.

I also wrote up the full analysis and fix here:
https://ackingliu.top/posts/lance-stable-row-id-deep-dive

Thanks a lot.

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix looks correct to me, but I'd like the implementation to be refactored a bit so that each index index implementation isn't exposed to the details of old vs new style ids.

Comment thread rust/lance-index/src/scalar/btree.rs Outdated
Comment on lines +1307 to +1316
// Two filtering strategies:
// - Fragments: fast path for address-style row IDs (fragment id is encoded in row_id)
// - RowIds: exact allow-list for stable row IDs (row_id bits are opaque)
let old_stream = match old_data_filter {
Some(OldIndexDataFilter::Fragments(valid_frags)) => {
filter_row_ids_by_fragments(old_stream, valid_frags)
},
Some(OldIndexDataFilter::RowIds(valid_row_ids)) => {
filter_row_ids_by_exact_set(old_stream, valid_row_ids)
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if OldIndexDataFilter encapsulated this distinction within.

Maybe something like:

impl OldIndexDataFilter {
    pub fn filter_row_ids(&self, row_ids: &UInt64Array) -> BooleanArray {
        match self {
            Self::Fragments(frags) => ...,
            Self::RowIds(row_ids) => ...,
    }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good suggestion

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 97.22222% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/append.rs 96.90% 1 Missing and 2 partials ⚠️
rust/lance-index/src/scalar/btree.rs 97.22% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@Xuanwo Xuanwo merged commit 6adcee8 into lance-format:main Mar 9, 2026
29 checks passed
tomsanbear added a commit to tomsanbear/lance that referenced this pull request Apr 21, 2026
KMeansParams::seed (Option<u64>) lets callers pin the RNG used for
centroid initialization and empty-cluster splitting, replacing the
//-TODO-flagged unconditional `SmallRng::from_os_rng()`. When set,
training over the same data is reproducible — required by
integration tests of IVF-based vector indexes that assert on top-K
recall (without a stable seed those tests are inherently flaky
because KMeans converges to different local minima per OS-entropy
seed).

KMeans::to_kmeans gains a `&mut dyn rand::RngCore` parameter so the
caller's seeded RNG threads through to split_clusters (the only
internal randomness consumer beyond init). Default behavior (seed
== None) is unchanged: every call reseeds from OS entropy as
before.

IvfBuildParams gains a parallel `seed: Option<u64>` field that
propagates into the KMeansParams constructed by
do_train_ivf_model. derive_ivf_params (delta-index path) sets it
to None — delta indexes inherit centroids and don't retrain.

Also fixes a pre-existing rebase miss on
compound_btree.rs (4 test-binary call sites of `index.update`
missing the third `old_data_filter` argument added by upstream
PR lance-format#6117 after our compound-index work forked).

Tests:
- vector::kmeans::tests::test_seed_pins_kmeans_output (new) —
  two trainings with the same seed produce byte-identical
  centroids; different seeds diverge (sanity check that the seed
  actually drives the RNG).
- All 8 existing vector::kmeans tests still pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] optimize_indices can drop valid BTree entries when stable row IDs are enabled

3 participants