feat: support merge_insert with source dedupe on first seen value#5603
Conversation
Code ReviewSummary: This PR adds a P1: Performance Issue in Duplicate FilteringIn let keep_mask: BooleanArray = (0..matched.num_rows())
.map(|i| Some(!duplicate_indices.contains(&i)))
.collect();This is O(n*m) where n is the batch size and m is the number of duplicates. For batches with many duplicates, this becomes expensive. Consider using a let duplicate_set: std::collections::HashSet<_> = duplicate_indices.iter().collect();
let keep_mask: BooleanArray = (0..matched.num_rows())
.map(|i| Some(!duplicate_set.contains(&i)))
.collect();Other Observations
Overall this is a straightforward and well-tested feature addition. The main actionable item is the performance concern above. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
4e97413 to
8561b43
Compare
| let row_ids = matched.column(row_id_col).as_primitive::<UInt64Type>(); | ||
|
|
||
| let mut processed_row_ids = self.processed_row_ids.lock().unwrap(); | ||
| let mut duplicate_indices = Vec::new(); |
There was a problem hiding this comment.
It would be more efficient to just track keep_indices, then later you can directly call take on the record batch with those indices. Then you don't need to construct the hashmap or boolean mask.
…nce-format#5603) Based on feedback in lance-format#5582 Simplified implementation that just keep the first value in case of duplicated `on` rows in source during merge insert. Users are expected to sort the source properly before using merge-insert.
Based on feedback in #5582
Simplified implementation that just keep the first value in case of duplicated
onrows in source during merge insert. Users are expected to sort the source properly before using merge-insert.