-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Performance of joins slows down dramatically with smaller batches.
The issue is related to slow performance of MutableDataArray::new() when passed a high number of batches. This happens when passing in all of the batches from the build side of the join and this happens once per build-side join key for each probe-side batch.
It seems to get exponentially slower as the number of arrays increases even though the number of rows is the same.
I modified hash_join.rs to have this debug code:
let start = Instant::now();
let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
let num_arrays = arrays.len();
let mut mutable = MutableArrayData::new(arrays, true, capacity);
if num_arrays > 0 {
debug!("MutableArrayData::new() with {} arrays containing {} rows took {} ms", num_arrays, row_count, start.elapsed().as_millis());
} Batch size 131072:
MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms Batch size 16384:
MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms Batch size 4096:
MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
Reporter: Andy Grove / @andygrove
Assignee: Daniël Heres / @Dandandan
PRs and other links:
Note: This issue was originally created as ARROW-11030. Please see the migration documentation for further details.