Change back SmallVec to Vec for JoinHashMap - Issue 5940 #5941

yahoNanJing · 2023-04-10T11:23:12Z

Which issue does this PR close?

Closes #5940.

Rationale for this change

After applying patches based on #5866 and #5904, based on the flame graph for q17 generated by
sudo CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --freq 500 --bin tpch -- benchmark datafusion --path ./data-parquet/ --format parquet --partitions 1 -q 17 --iterations 10

we found the bottleneck happens at ### physical_plan::joins::hash_join::update_hash.

What changes are included in this PR?

Change back SmallVec to Vec for JoinHashMap.

After applying the patch proposed from this PR, the flame graph becomes

We can see the samples for the vector capacity changing significantly decreased. It's expected especially when doing joining with one big table and the other is relatively small.

To end to end query performance for q17 is improved from 7000ms to 6500ms.

Are these changes tested?

Are there any user-facing changes?

yahoNanJing · 2023-04-10T11:34:30Z

Hi @Dandandan and @alamb, could you help review this PR?

Dandandan · 2023-04-10T15:09:52Z

datafusion/core/src/physical_plan/joins/hash_join.rs

    })? / 7)
        .next_power_of_two();
-    // 32 bytes per `(u64, SmallVec<[u64; 1]>)`
+    // 32 bytes per `(u64, Vec<[u64; 1]>)`


This probably is a trade-off, joins with unique keys will probably get slower.

alamb · 2023-04-10T16:58:39Z

There is a CI check https://github.com/apache/arrow-datafusion/actions/runs/4657156069/jobs/8241558786?pr=5941 appears to be failing

Given that we have a benchmark that improves (q17) I think we should merge this PR, assuming we can resolve the CI failure satisfactorily.

Thank you for this @yahoNanJing

Dandandan · 2023-04-10T18:54:14Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+    let row_end = offset + hash_values.len();
+    for (row, hash_value) in (row_start..row_end).zip(hash_values.iter()) {
+        // the hash value is the key, always true
+        let item = hash_map.0.get_mut(*hash_value, |_| true);


The equality check might still be necessary.

Thanks @Dandandan for pointing it out. You are right. It's my bad. The RawTable utilizes the open addressing policy for looking for matched entries. I'll revert this commit.

Dandandan · 2023-04-10T18:55:02Z

It would be nice to show whether we are not getting regressions for other queries.

yahoNanJing · 2023-04-11T06:14:44Z

Thanks @Dandandan and @alamb. I tested this PR with other sqls from TPCH. It seems it's not always true of bringing performance benefits. I think we can hold on this PR.

alamb · 2023-04-11T10:42:23Z

Marking to draft so we don't accidentally merge it

Dandandan · 2024-01-19T13:15:15Z

We have improved the datastructure of join in other ways already, so closing the PR in order to clean it up

yangzhong added 2 commits April 10, 2023 19:14

Remove unnecessary equality check for JoinHashMap

fd9f2e5

Change back SmallVec to Vec for JoinHashMap

1ca9546

github-actions bot added the core Core DataFusion crate label Apr 10, 2023

Fix Cargo check

6e9d776

Dandandan reviewed Apr 10, 2023

View reviewed changes

alamb changed the title ~~Issue 5940~~ Remove unnecessary equality check for JoinHashMap - Issue 5940 Apr 10, 2023

Dandandan reviewed Apr 10, 2023

View reviewed changes

Add back equality check for JoinHashMap

0ea3c31

yahoNanJing changed the title ~~Remove unnecessary equality check for JoinHashMap - Issue 5940~~ Change back SmallVec to Vec for JoinHashMap - Issue 5940 Apr 11, 2023

alamb marked this pull request as draft April 11, 2023 10:42

Dandandan closed this Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change back SmallVec to Vec for JoinHashMap - Issue 5940 #5941

Change back SmallVec to Vec for JoinHashMap - Issue 5940 #5941

Uh oh!

yahoNanJing commented Apr 10, 2023 •

edited

Loading

Uh oh!

yahoNanJing commented Apr 10, 2023

Uh oh!

Dandandan Apr 10, 2023

Uh oh!

alamb commented Apr 10, 2023

Uh oh!

Dandandan Apr 10, 2023

Uh oh!

yahoNanJing Apr 11, 2023

Uh oh!

Dandandan commented Apr 10, 2023

Uh oh!

yahoNanJing commented Apr 11, 2023

Uh oh!

alamb commented Apr 11, 2023

Uh oh!

Dandandan commented Jan 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Change back SmallVec to Vec for JoinHashMap - Issue 5940 #5941

Change back SmallVec to Vec for JoinHashMap - Issue 5940 #5941

Uh oh!

Conversation

yahoNanJing commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

yahoNanJing commented Apr 10, 2023

Uh oh!

Dandandan Apr 10, 2023

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 10, 2023

Uh oh!

Dandandan Apr 10, 2023

Choose a reason for hiding this comment

Uh oh!

yahoNanJing Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Apr 10, 2023

Uh oh!

yahoNanJing commented Apr 11, 2023

Uh oh!

alamb commented Apr 11, 2023

Uh oh!

Dandandan commented Jan 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yahoNanJing commented Apr 10, 2023 •

edited

Loading