Ensure unique shuffle IDs in hash join #360

hendrikmakait · 2023-10-25T18:17:10Z

P2P shuffling requires shuffle IDs to be unique. This PR makes sure that they are unique even if dataframes get reused in the same or several subsequent shuffles with different parameters.

Fixes dask/distributed#8301

Naming ... Minor ... Fix

rjzamora

Change looks good to me. Any ideas for a simple test? Perhaps you can just generate the graph twice (e.g. df1.merge(df2, ...).dask) and compare keys?

phofl · 2023-10-26T09:33:50Z

Some context:

@hendrikmakait and I looked into this. The fix is valid and something I missed when initially implementing the HashJoin layer, but it should not have been necessary where this started failing (query 7). The optimization messes something up when we have more than one row group per file. I will look into this.

This will also get a follow up that is more precise about the left and right keys (similar to what is currently in distributed)

phofl · 2023-10-26T09:34:09Z

thx @hendrikmakait

Fix

2a2cf74

Naming ... Minor ... Fix

rjzamora reviewed Oct 25, 2023

View reviewed changes

hendrikmakait added 2 commits October 26, 2023 10:14

Add test

ca8b4f3

Add test

4df130e

hendrikmakait marked this pull request as ready for review October 26, 2023 08:27

phofl approved these changes Oct 26, 2023

View reviewed changes

phofl merged commit ee6d0a5 into dask:main Oct 26, 2023

hendrikmakait deleted the fix-hashjoin-p2p branch October 26, 2023 09:34

hendrikmakait mentioned this pull request Oct 26, 2023

Reuse identical shuffles in P2P hash join #361

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure unique shuffle IDs in hash join #360

Ensure unique shuffle IDs in hash join #360

Uh oh!

hendrikmakait commented Oct 25, 2023 •

edited

Loading

Uh oh!

rjzamora left a comment

Uh oh!

phofl commented Oct 26, 2023

Uh oh!

phofl commented Oct 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Ensure unique shuffle IDs in hash join #360

Ensure unique shuffle IDs in hash join #360

Uh oh!

Conversation

hendrikmakait commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

phofl commented Oct 26, 2023

Uh oh!

phofl commented Oct 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hendrikmakait commented Oct 25, 2023 •

edited

Loading