Skip to content

Conversation

@yahoNanJing
Copy link
Contributor

@yahoNanJing yahoNanJing commented Apr 10, 2023

Which issue does this PR close?

Closes #5940.

Rationale for this change

After applying patches based on #5866 and #5904, based on the flame graph for q17 generated by
sudo CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --freq 500 --bin tpch -- benchmark datafusion --path ./data-parquet/ --format parquet --partitions 1 -q 17 --iterations 10

flamegraph-5941-before

we found the bottleneck happens at ### physical_plan::joins::hash_join::update_hash.

What changes are included in this PR?

  1. Change back SmallVec to Vec for JoinHashMap.

After applying the patch proposed from this PR, the flame graph becomes
flamegraph-5941-after

We can see the samples for the vector capacity changing significantly decreased. It's expected especially when doing joining with one big table and the other is relatively small.

To end to end query performance for q17 is improved from 7000ms to 6500ms.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Apr 10, 2023
@yahoNanJing
Copy link
Contributor Author

Hi @Dandandan and @alamb, could you help review this PR?

})? / 7)
.next_power_of_two();
// 32 bytes per `(u64, SmallVec<[u64; 1]>)`
// 32 bytes per `(u64, Vec<[u64; 1]>)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably is a trade-off, joins with unique keys will probably get slower.

@alamb alamb changed the title Issue 5940 Remove unnecessary equality check for JoinHashMap - Issue 5940 Apr 10, 2023
@alamb
Copy link
Contributor

alamb commented Apr 10, 2023

There is a CI check https://github.com/apache/arrow-datafusion/actions/runs/4657156069/jobs/8241558786?pr=5941 appears to be failing

Given that we have a benchmark that improves (q17) I think we should merge this PR, assuming we can resolve the CI failure satisfactorily.

Thank you for this @yahoNanJing

let row_end = offset + hash_values.len();
for (row, hash_value) in (row_start..row_end).zip(hash_values.iter()) {
// the hash value is the key, always true
let item = hash_map.0.get_mut(*hash_value, |_| true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The equality check might still be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dandandan for pointing it out. You are right. It's my bad. The RawTable utilizes the open addressing policy for looking for matched entries. I'll revert this commit.

@Dandandan
Copy link
Contributor

It would be nice to show whether we are not getting regressions for other queries.

@yahoNanJing yahoNanJing changed the title Remove unnecessary equality check for JoinHashMap - Issue 5940 Change back SmallVec to Vec for JoinHashMap - Issue 5940 Apr 11, 2023
@yahoNanJing
Copy link
Contributor Author

Thanks @Dandandan and @alamb. I tested this PR with other sqls from TPCH. It seems it's not always true of bringing performance benefits. I think we can hold on this PR.

@alamb alamb marked this pull request as draft April 11, 2023 10:42
@alamb
Copy link
Contributor

alamb commented Apr 11, 2023

Marking to draft so we don't accidentally merge it

@Dandandan
Copy link
Contributor

We have improved the datastructure of join in other ways already, so closing the PR in order to clean it up

@Dandandan Dandandan closed this Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change back SmallVec to Vec for JoinHashMap

3 participants