-
-
Notifications
You must be signed in to change notification settings - Fork 27
Implement HashJoinP2P for merge #272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rjzamora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @phofl !
| from dask_expr._shuffle import Shuffle, _contains_index_name | ||
| from dask_expr._shuffle import AssignPartitioningIndex, Shuffle, _contains_index_name | ||
|
|
||
| _HASH_COLUMN_NAME = "__hash_partition" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it this is a "magic" string used by distributed's p2p merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
| self.how, | ||
| self.left_on, | ||
| self.right_on, | ||
| self.left._meta.drop(columns=_HASH_COLUMN_NAME), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love how the p2p merge depends on _HASH_COLUMN_NAME and drops this column, but isn't responsible for generating this column. Unfortunately, the main issue is that merge_unpack drops the shuffle column, so there is not much we can do here to make the algorithm more intuitive.
To clarify, it would be much clearer if we could just lower the merge to "add partitioning columns" + "p2p merge" + "drop partitioning column"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I mostly agree but I fixed something else there today. This is on my medium term todo list
|
Merging this |
The memory footprint of the blockwise version was terrible