-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Change back SmallVec to Vec for JoinHashMap - Issue 5940 #5941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @Dandandan and @alamb, could you help review this PR? |
| })? / 7) | ||
| .next_power_of_two(); | ||
| // 32 bytes per `(u64, SmallVec<[u64; 1]>)` | ||
| // 32 bytes per `(u64, Vec<[u64; 1]>)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably is a trade-off, joins with unique keys will probably get slower.
|
There is a CI check https://github.com/apache/arrow-datafusion/actions/runs/4657156069/jobs/8241558786?pr=5941 appears to be failing Given that we have a benchmark that improves (q17) I think we should merge this PR, assuming we can resolve the CI failure satisfactorily. Thank you for this @yahoNanJing |
| let row_end = offset + hash_values.len(); | ||
| for (row, hash_value) in (row_start..row_end).zip(hash_values.iter()) { | ||
| // the hash value is the key, always true | ||
| let item = hash_map.0.get_mut(*hash_value, |_| true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The equality check might still be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dandandan for pointing it out. You are right. It's my bad. The RawTable utilizes the open addressing policy for looking for matched entries. I'll revert this commit.
|
It would be nice to show whether we are not getting regressions for other queries. |
|
Thanks @Dandandan and @alamb. I tested this PR with other sqls from TPCH. It seems it's not always true of bringing performance benefits. I think we can hold on this PR. |
|
Marking to draft so we don't accidentally merge it |
|
We have improved the datastructure of join in other ways already, so closing the PR in order to clean it up |
Which issue does this PR close?
Closes #5940.
Rationale for this change
After applying patches based on #5866 and #5904, based on the flame graph for q17 generated by
sudo CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --freq 500 --bin tpch -- benchmark datafusion --path ./data-parquet/ --format parquet --partitions 1 -q 17 --iterations 10we found the bottleneck happens at ### physical_plan::joins::hash_join::update_hash.
What changes are included in this PR?
After applying the patch proposed from this PR, the flame graph becomes

We can see the samples for the vector capacity changing significantly decreased. It's expected especially when doing joining with one big table and the other is relatively small.
To end to end query performance for q17 is improved from 7000ms to 6500ms.
Are these changes tested?
Are there any user-facing changes?