feat: implement Bloom Filter concurrent conflict detection for merge insert operations#4787
feat: implement Bloom Filter concurrent conflict detection for merge insert operations#4787yanghua wants to merge 1 commit intolance-format:mainfrom
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
|
This PR is only for CI. Still WIP, not ready for review. |
a007dc9 to
5de6d10
Compare
c45c79c to
8d89628
Compare
8d89628 to
0c3d5a8
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
0090e35 to
2b50ea9
Compare
9cb9802 to
438ace8
Compare
26f66f8 to
53f62af
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
c3b92f6 to
ed905f0
Compare
…insert operations
ed905f0 to
847e406
Compare
|
cc @jackye1995 please take a look when you have time. |
| // Join key metadata attached to a Transaction for conflict detection. | ||
| message JoinKeyMetadata { | ||
| // Names of columns participating in the join key. | ||
| repeated string columns = 1; |
There was a problem hiding this comment.
this should be column field ID
| // Total number of bits in the bitmap. | ||
| uint32 bitmap_bits = 3; | ||
| // Reserved for future fields to avoid reuse. | ||
| reserved 4, 5; |
There was a problem hiding this comment.
why have all these reserved fields? I think they are not really needed until we want to add them in the future?
|
|
||
| package lance.table; | ||
|
|
||
| // Value of the join key representation (reserved for future use) |
There was a problem hiding this comment.
this is unused, should just remove
| } | ||
|
|
||
| // Join key metadata attached to a Transaction for conflict detection. | ||
| message JoinKeyMetadata { |
There was a problem hiding this comment.
this name does not feel right to me. Consider we use this for an INSERT, then there is not really a join key. What about just keyMetadata or something like KeySet or RowKeySet?
| map<string, string> transaction_properties = 4; | ||
|
|
||
| // Join key metadata using typed protobuf message. This is the sole carrier. | ||
| optional JoinKeyMetadata join_key_metadata = 6; |
There was a problem hiding this comment.
I think this should not be in the transaction itself, it should be unique to only a few write transactions to do detection for operations like insert or merge_insert that could add new data to the table?
| .expect("source stream exhausted while computing join key filter"); | ||
|
|
||
| let join_key_metadata = | ||
| compute_join_key_metadata_from_stream(first, &self.params.on).await?; |
There was a problem hiding this comment.
this is not what I was expecting to implement the feature, please let me know what you think. In my mind, the bloom filter or exact set should be computed similar to how we compute the affected_rows, it's just this is for newly added rows, and affected_rows are for rows where the join key (primary key) already exists in the dataset.
There was a problem hiding this comment.
I think instead of having a new conflict detector, we should just update the existing logic, so that instead of just affected_rows, we have a new field new_rows in the Update transaction model that tracks the bloom filter or exact set, and then the existing conflict resolution module should be able to just consume it like how it consumes affected_rows
Based on #4787 Co-authored-by: vinoyang <vinoyang@apache.org>
Based on lance-format#4787 Co-authored-by: vinoyang <vinoyang@apache.org>
closes #4585