Reuse identical shuffles in P2P hash join #361

hendrikmakait · 2023-10-26T09:54:04Z

IFF a shuffle has the exact same input and parameters, P2P hash join should reuse it.

Follow-up to #360

phofl · 2023-10-26T09:55:38Z

dask_expr/_merge.py

-        token_left = token + "-left"
-        token_right = token + "-right"
+        token_left = _tokenize_deterministic(
+            self.left._name, self.shuffle_left_on, self.npartitions, self._partitions


I assume how doesn't matter in the shuffle algorithm?

Yes, that will only become relevant in the actual merge, which happens after we shuffled all the data.

hendrikmakait · 2023-10-26T11:13:47Z

dask_expr/_merge.py

-        token_left = token + "-left"
-        token_right = token + "-right"
+        token_left = _tokenize_deterministic(
+            "hash-join",


Added this to ensure that we avoid any unwanted sharing between merges and shuffles. I don't have a good-enough overview right now to make sharing between them work.

If we can rely on name_input_{left/right} to truly be unique and identify distinct input dataframes, I believe we can even share between merges and ordinary shuffles. I wonder where this use case would come up, though

My main concern here is that shuffle_transfer and merge_transfer do different things. We'd have to refactor this so that they are in fact interchangeable and shareable. I guess there are only few cases where this would come in handy though. Maybe something like joining a dataframe on x and aggregating that same dataframe on x within the same graph?

Yes they do slightly different things at the moment, because the merge layer caused all kinds of troubles before that. This is on my todo list to align more closely.

phofl · 2023-10-27T16:00:49Z

thx @hendrikmakait

Reuse shuffle in P2P hash joinif it's exactly the same

b21f094

phofl reviewed Oct 26, 2023

View reviewed changes

hendrikmakait mentioned this pull request Oct 26, 2023

Reuse identical shuffles in P2P hash join dask/distributed#8306

Merged

2 tasks

Better safe than sorry

7def6a4

hendrikmakait commented Oct 26, 2023

View reviewed changes

hendrikmakait and others added 2 commits October 26, 2023 15:17

Comment

149fc46

Update test_distributed.py

2475f96

phofl approved these changes Oct 27, 2023

View reviewed changes

phofl merged commit bf8dcbe into dask:main Oct 27, 2023

hendrikmakait mentioned this pull request Dec 15, 2023

Do not reuse shuffles between merges #557

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reuse identical shuffles in P2P hash join #361

Reuse identical shuffles in P2P hash join #361

Uh oh!

hendrikmakait commented Oct 26, 2023

Uh oh!

phofl Oct 26, 2023

Uh oh!

hendrikmakait Oct 26, 2023

Uh oh!

hendrikmakait Oct 26, 2023 •

edited

Loading

Uh oh!

fjetter Oct 26, 2023

Uh oh!

hendrikmakait Oct 26, 2023

Uh oh!

phofl Oct 27, 2023

Uh oh!

phofl commented Oct 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Reuse identical shuffles in P2P hash join #361

Reuse identical shuffles in P2P hash join #361

Uh oh!

Conversation

hendrikmakait commented Oct 26, 2023

Uh oh!

phofl Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

phofl Oct 27, 2023

Choose a reason for hiding this comment

Uh oh!

phofl commented Oct 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hendrikmakait Oct 26, 2023 •

edited

Loading