Skip to content

feat: merge-insert with primary key dedupe#5633

Merged
jackye1995 merged 2 commits intolance-format:mainfrom
jackye1995:jack-primary-key-conflict-detection-draft
Jan 7, 2026
Merged

feat: merge-insert with primary key dedupe#5633
jackye1995 merged 2 commits intolance-format:mainfrom
jackye1995:jack-primary-key-conflict-detection-draft

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 commented Jan 6, 2026

Based on #4787

Co-authored-by: vinoyang vinoyang@apache.org

@github-actions github-actions Bot added the python label Jan 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 6, 2026

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 6, 2026

Code Review: Bloom Filter Merge-Insert Conflict Detection

This PR adds bloom filter-based conflict detection for merge-insert operations, tracking keys of newly inserted rows to detect concurrent insert conflicts.

Summary

The implementation adds:

  • Protobuf definitions for KeyExistenceFilter (exact set or bloom filter)
  • KeyExistenceFilterBuilder and KeyExistenceFilter types for tracking inserted keys
  • Integration with Operation::Update to carry the filter
  • Conflict detection in check_update_txn using filter intersection

P0: Critical - Inconsistent Hashing Between Bloom Filter Helper and Main SBBF

File: rust/lance-index/src/scalar/bloomfilter/sbbf.rs:428-454

The newly added bloom_contains_hash() helper function uses a completely different hashing scheme than the existing Sbbf implementation:

// New helper (lines 428-447) - custom rolling seed hash
pub fn bloom_contains_hash(hash: u64, bitmap: &[u8], num_bits: u32) -> bool {
    let mut seed = 0x9e3779b97f4a7c15u64;
    for _i in 0..SBBF_NUM_HASHES {
        let pos = ((hash.wrapping_add(seed)) % m) as usize;
        // ... bit_test on raw bytes
        seed = seed.rotate_left(13) ^ 0x517cc1b727220a95u64;
    }
}

The existing Sbbf uses XxHash64 and the Parquet spec's block-based approach with specific salt values. These two approaches are not compatible - you cannot build a filter with Sbbf::insert() and then query it with bloom_contains_hash().

Looking at the actual usage in KeyExistenceFilter::intersects(), the code calls bloom_contains_hash() on hashes from the ExactSet against a Bloom bitmap that was built using the Sbbf implementation. This will produce incorrect results (both false negatives and random behavior).

Recommendation: Either:

  1. Add a proper contains_hash(hash: u64) method to the Sbbf struct that works with its internal representation, or
  2. Store key hashes computed the same way Sbbf computes them in the ExactSet

P1: Test Coverage Gap for Cross-Filter Type Intersection

File: rust/lance/src/dataset/write/merge_insert.rs (tests section)

The tests cover Bloom-to-Bloom intersection via concurrent merge inserts, but don't explicitly test:

  • ExactSet vs Bloom intersection paths (used when one transaction has few keys)
  • Serialization/deserialization roundtrip with ExactSet variant

Given the P0 issue above, these paths would be broken. Consider adding explicit unit tests for KeyExistenceFilter::intersects() covering all four type combinations.


Minor Observations (non-blocking)

  1. Dead code path: FilterType::ExactSet is defined but never constructed in the production code path - only KeyExistenceFilterBuilder::build() is used, which always produces a Bloom filter.

  2. Constant sizing: DEFAULT_NUMBER_OF_ITEMS = 8192 and DEFAULT_PROBABILITY = 0.00057 are fixed. For very small inserts (e.g., 1-10 rows), this over-allocates. For very large inserts, false positive rate may be higher than expected. Consider if dynamic sizing based on actual row count is desirable.

  3. Module visibility: The diff changes mod write to pub(crate) mod write in dataset.rs. Ensure this broader visibility is intentional and won't expose internal APIs unintentionally.


The P0 hashing inconsistency should be addressed before merge as it could cause incorrect conflict detection behavior (both missed conflicts and spurious retries).

Comment thread protos/transaction.proto
Comment thread rust/lance/src/dataset/write/merge_insert.rs
@jackye1995 jackye1995 force-pushed the jack-primary-key-conflict-detection-draft branch from 2ca9e9a to 0882e74 Compare January 6, 2026 07:45
Comment thread rust/lance/src/io/commit/conflict_resolver.rs Outdated
@jackye1995 jackye1995 force-pushed the jack-primary-key-conflict-detection-draft branch 3 times, most recently from 14fde36 to 70d8816 Compare January 7, 2026 02:15
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 7, 2026

@jackye1995 jackye1995 force-pushed the jack-primary-key-conflict-detection-draft branch from 70d8816 to bfa5dc0 Compare January 7, 2026 05:19
@github-actions github-actions Bot added the java label Jan 7, 2026
@jackye1995 jackye1995 changed the title bloom filter merge-insert draft feat: merge-insert with primary key dedupe Jan 7, 2026
@github-actions github-actions Bot added the enhancement New feature or request label Jan 7, 2026
@jackye1995 jackye1995 force-pushed the jack-primary-key-conflict-detection-draft branch from bfa5dc0 to dd6bb27 Compare January 7, 2026 05:45
Copy link
Copy Markdown
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@jackye1995 jackye1995 merged commit 8dbcfd2 into lance-format:main Jan 7, 2026
28 of 31 checks passed
jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Based on lance-format#4787

Co-authored-by: vinoyang <vinoyang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants