Reuse identical shuffles in P2P hash join #8306

hendrikmakait · 2023-10-26T10:49:49Z

IFF a shuffle has the exact same input and parameters, P2P hash joins should reuse it.

Identical optimization as in dask/dask-expr#361

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2023-10-26T12:28:30Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      27 files ±  0       27 suites ±0 16h 41m 21s ⏱️ + 1h 13m 39s
  3 940 tests +  2   3 813 ✔️ -   6   117 💤 ±0 10 ❌ +8
49 487 runs +40 47 109 ✔️ +36 2 368 💤 - 4 10 ❌ +8

For more details on these failures, see this check.

Results for commit 596adff. ± Comparison against base commit 7d0ba9c.

♻️ This comment has been updated with latest results.

fjetter

changes LGTM. I'm just having a little difficulty wrapping my head around what exactly we're now caching

fjetter · 2023-10-26T12:44:00Z

distributed/shuffle/tests/test_merge.py



+@gen_cluster(client=True)
+async def test_merge_p2p_shuffle_reused_dataframe_with_different_parameters(c, s, a, b):


This test passes for me on main. From what I can tell, this is expected, isn't it?

Yes, there's no reuse on main. This test makes sure that we don't reuse too much (which is what originally happened in dask-expr).

fjetter · 2023-10-26T12:51:25Z

distributed/shuffle/tests/test_merge.py

+    # Generate the same shuffle IDs if the input frame is the same and all its parameters match
+    assert sum(id_from_key(k) is not None for k in out.dask) == 3


This took me a little while to wrap my head around. Maybe an additional comment is helpful.

What's happening here is

Shuffle1: shuffle ddf1

Shuffle2: shuffle ddf2

Those two shuffle will perform the first merge

Shuffle 3: ?? Which DF is now shuffled and what can be reused?

maybe a comment in the test would be helpful

Shuffle 3 is again ddf2? I'm currently wondering why there is a third shuffle. Shouldn't this all be doable with just two?

I've adjusted the test a bit, I mixed something up when moving it from dask-expr. The idea with using three shuffles is that I want to guarantee that sharing works between multiple merges irrespective of the side the dataframe is assigned.

fjetter

good find!

Reuse

624cb1b

hendrikmakait requested a review from fjetter as a code owner October 26, 2023 10:49

hendrikmakait added 2 commits October 26, 2023 12:58

Cleanup

201f58c

Better safe than sorry

9e8e6ae

fjetter reviewed Oct 26, 2023

View reviewed changes

fjetter approved these changes Oct 26, 2023

View reviewed changes

hendrikmakait added 3 commits October 26, 2023 15:09

Improved test

b91e18a

Typo

92628c7

Clearer comment

596adff

fjetter approved these changes Oct 26, 2023

View reviewed changes

hendrikmakait merged commit 1fe2206 into dask:main Oct 30, 2023

hendrikmakait mentioned this pull request Dec 15, 2023

Do not reuse shuffles between merges #8416

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reuse identical shuffles in P2P hash join #8306

Reuse identical shuffles in P2P hash join #8306

Uh oh!

hendrikmakait commented Oct 26, 2023

Uh oh!

github-actions bot commented Oct 26, 2023 •

edited

Loading

Uh oh!

fjetter left a comment

Uh oh!

fjetter Oct 26, 2023

Uh oh!

hendrikmakait Oct 26, 2023

Uh oh!

fjetter Oct 26, 2023

Uh oh!

fjetter Oct 26, 2023

Uh oh!

fjetter Oct 26, 2023

Uh oh!

hendrikmakait Oct 26, 2023

Uh oh!

fjetter left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		@gen_cluster(client=True)
		async def test_merge_p2p_shuffle_reused_dataframe_with_different_parameters(c, s, a, b):

		# Generate the same shuffle IDs if the input frame is the same and all its parameters match
		assert sum(id_from_key(k) is not None for k in out.dask) == 3

Uh oh!

Reuse identical shuffles in P2P hash join #8306

Reuse identical shuffles in P2P hash join #8306

Uh oh!

Conversation

hendrikmakait commented Oct 26, 2023

Uh oh!

github-actions bot commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

fjetter Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

fjetter Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

fjetter Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

fjetter Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Oct 26, 2023 •

edited

Loading