Skip to content

[Rust] [DataFusion] HashJoinExec slow with many batches #26949

@asfimport

Description

@asfimport

Performance of joins slows down dramatically with smaller batches.

The issue is related to slow performance of MutableDataArray::new() when passed a high number of batches. This happens when passing in all of the batches from the build side of the join and this happens once per build-side join key for each probe-side batch.

It seems to get exponentially slower as the number of arrays increases even though the number of rows is the same.

I modified hash_join.rs to have this debug code:

let start = Instant::now();
let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
let num_arrays = arrays.len();
let mut mutable = MutableArrayData::new(arrays, true, capacity);
if num_arrays > 0 {
    debug!("MutableArrayData::new() with {} arrays containing {} rows took {} ms", num_arrays, row_count, start.elapsed().as_millis());
} 

Batch size 131072:

MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 

Batch size 16384:

MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 

Batch size 4096:

MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 

 

 

 

 

 

Reporter: Andy Grove / @andygrove
Assignee: Daniël Heres / @Dandandan

PRs and other links:

Note: This issue was originally created as ARROW-11030. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions