Skip to content

feat: implement Bloom Filter concurrent conflict detection for merge insert operations#4787

Closed
yanghua wants to merge 1 commit intolance-format:mainfrom
yanghua:primary-key-conflict-detection
Closed

feat: implement Bloom Filter concurrent conflict detection for merge insert operations#4787
yanghua wants to merge 1 commit intolance-format:mainfrom
yanghua:primary-key-conflict-detection

Conversation

@yanghua
Copy link
Copy Markdown
Collaborator

@yanghua yanghua commented Sep 20, 2025

closes #4585

@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@yanghua
Copy link
Copy Markdown
Collaborator Author

yanghua commented Sep 20, 2025

This PR is only for CI. Still WIP, not ready for review.

@yanghua yanghua force-pushed the primary-key-conflict-detection branch 2 times, most recently from a007dc9 to 5de6d10 Compare September 20, 2025 07:17
@yanghua yanghua force-pushed the primary-key-conflict-detection branch from c45c79c to 8d89628 Compare September 30, 2025 08:32
@yanghua yanghua force-pushed the primary-key-conflict-detection branch from 8d89628 to 0c3d5a8 Compare October 30, 2025 07:27
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 31, 2025

@yanghua yanghua force-pushed the primary-key-conflict-detection branch 5 times, most recently from 0090e35 to 2b50ea9 Compare November 7, 2025 06:55
@github-actions github-actions Bot added the java label Nov 7, 2025
@yanghua yanghua force-pushed the primary-key-conflict-detection branch 8 times, most recently from 9cb9802 to 438ace8 Compare November 12, 2025 06:17
@yanghua yanghua force-pushed the primary-key-conflict-detection branch 3 times, most recently from 26f66f8 to 53f62af Compare November 27, 2025 09:06
@yanghua yanghua marked this pull request as ready for review November 28, 2025 13:07
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@yanghua yanghua force-pushed the primary-key-conflict-detection branch 2 times, most recently from c3b92f6 to ed905f0 Compare December 13, 2025 09:10
@yanghua yanghua force-pushed the primary-key-conflict-detection branch from ed905f0 to 847e406 Compare December 13, 2025 09:13
@yanghua
Copy link
Copy Markdown
Collaborator Author

yanghua commented Dec 26, 2025

cc @jackye1995 please take a look when you have time.

@yanghua yanghua changed the title [WIP] feat: implement Bloom Filter concurrent conflict detection for merge insert operations feat: implement Bloom Filter concurrent conflict detection for merge insert operations Jan 4, 2026
@github-actions github-actions Bot added the enhancement New feature or request label Jan 4, 2026
Comment thread protos/join_key.proto
// Join key metadata attached to a Transaction for conflict detection.
message JoinKeyMetadata {
// Names of columns participating in the join key.
repeated string columns = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be column field ID

Comment thread protos/join_key.proto
// Total number of bits in the bitmap.
uint32 bitmap_bits = 3;
// Reserved for future fields to avoid reuse.
reserved 4, 5;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have all these reserved fields? I think they are not really needed until we want to add them in the future?

Comment thread protos/join_key.proto

package lance.table;

// Value of the join key representation (reserved for future use)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unused, should just remove

Comment thread protos/join_key.proto
}

// Join key metadata attached to a Transaction for conflict detection.
message JoinKeyMetadata {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name does not feel right to me. Consider we use this for an INSERT, then there is not really a join key. What about just keyMetadata or something like KeySet or RowKeySet?

Comment thread protos/transaction.proto
map<string, string> transaction_properties = 4;

// Join key metadata using typed protobuf message. This is the sole carrier.
optional JoinKeyMetadata join_key_metadata = 6;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be in the transaction itself, it should be unique to only a few write transactions to do detection for operations like insert or merge_insert that could add new data to the table?

.expect("source stream exhausted while computing join key filter");

let join_key_metadata =
compute_join_key_metadata_from_stream(first, &self.params.on).await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not what I was expecting to implement the feature, please let me know what you think. In my mind, the bloom filter or exact set should be computed similar to how we compute the affected_rows, it's just this is for newly added rows, and affected_rows are for rows where the join key (primary key) already exists in the dataset.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of having a new conflict detector, we should just update the existing logic, so that instead of just affected_rows, we have a new field new_rows in the Update transaction model that tracks the bloom filter or exact set, and then the existing conflict resolution module should be able to just consume it like how it consumes affected_rows

jackye1995 added a commit that referenced this pull request Jan 7, 2026
Based on #4787

Co-authored-by: vinoyang <vinoyang@apache.org>
@jackye1995 jackye1995 closed this Jan 7, 2026
jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Based on lance-format#4787

Co-authored-by: vinoyang <vinoyang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MergeInsert produces duplicated rows

3 participants