Skip to content

Conversation

@hendrikmakait
Copy link
Member

IFF a shuffle has the exact same input and parameters, P2P hash joins should reuse it.

Identical optimization as in dask/dask-expr#361

  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

github-actions bot commented Oct 26, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       27 files  ±  0         27 suites  ±0   16h 41m 21s ⏱️ + 1h 13m 39s
  3 940 tests +  2    3 813 ✔️  -   6     117 💤 ±0  10 +8 
49 487 runs  +40  47 109 ✔️ +36  2 368 💤  - 4  10 +8 

For more details on these failures, see this check.

Results for commit 596adff. ± Comparison against base commit 7d0ba9c.

♻️ This comment has been updated with latest results.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes LGTM. I'm just having a little difficulty wrapping my head around what exactly we're now caching



@gen_cluster(client=True)
async def test_merge_p2p_shuffle_reused_dataframe_with_different_parameters(c, s, a, b):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes for me on main. From what I can tell, this is expected, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's no reuse on main. This test makes sure that we don't reuse too much (which is what originally happened in dask-expr).

Comment on lines +144 to +145
# Generate the same shuffle IDs if the input frame is the same and all its parameters match
assert sum(id_from_key(k) is not None for k in out.dask) == 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took me a little while to wrap my head around. Maybe an additional comment is helpful.

What's happening here is

  • Shuffle1: shuffle ddf1
  • Shuffle2: shuffle ddf2
  • Those two shuffle will perform the first merge
  • Shuffle 3: ?? Which DF is now shuffled and what can be reused?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a comment in the test would be helpful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shuffle 3 is again ddf2? I'm currently wondering why there is a third shuffle. Shouldn't this all be doable with just two?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted the test a bit, I mixed something up when moving it from dask-expr. The idea with using three shuffles is that I want to guarantee that sharing works between multiple merges irrespective of the side the dataframe is assigned.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find!

@hendrikmakait hendrikmakait merged commit 1fe2206 into dask:main Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants