-
-
Notifications
You must be signed in to change notification settings - Fork 26
Reuse identical shuffles in P2P hash join #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dask_expr/_merge.py
Outdated
| token_left = token + "-left" | ||
| token_right = token + "-right" | ||
| token_left = _tokenize_deterministic( | ||
| self.left._name, self.shuffle_left_on, self.npartitions, self._partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume how doesn't matter in the shuffle algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that will only become relevant in the actual merge, which happens after we shuffled all the data.
| token_left = token + "-left" | ||
| token_right = token + "-right" | ||
| token_left = _tokenize_deterministic( | ||
| "hash-join", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to ensure that we avoid any unwanted sharing between merges and shuffles. I don't have a good-enough overview right now to make sharing between them work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can rely on name_input_{left/right} to truly be unique and identify distinct input dataframes, I believe we can even share between merges and ordinary shuffles. I wonder where this use case would come up, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here is that shuffle_transfer and merge_transfer do different things. We'd have to refactor this so that they are in fact interchangeable and shareable. I guess there are only few cases where this would come in handy though. Maybe something like joining a dataframe on x and aggregating that same dataframe on x within the same graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they do slightly different things at the moment, because the merge layer caused all kinds of troubles before that. This is on my todo list to align more closely.
|
thx @hendrikmakait |
IFF a shuffle has the exact same input and parameters, P2P hash join should reuse it.
Follow-up to #360