Skip to content

Conversation

@hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Oct 25, 2023

P2P shuffling requires shuffle IDs to be unique. This PR makes sure that they are unique even if dataframes get reused in the same or several subsequent shuffles with different parameters.

Fixes dask/distributed#8301

Naming

...

Minor

...

Fix
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me. Any ideas for a simple test? Perhaps you can just generate the graph twice (e.g. df1.merge(df2, ...).dask) and compare keys?

@hendrikmakait hendrikmakait marked this pull request as ready for review October 26, 2023 08:27
@phofl
Copy link
Collaborator

phofl commented Oct 26, 2023

Some context:

@hendrikmakait and I looked into this. The fix is valid and something I missed when initially implementing the HashJoin layer, but it should not have been necessary where this started failing (query 7). The optimization messes something up when we have more than one row group per file. I will look into this.

This will also get a follow up that is more precise about the left and right keys (similar to what is currently in distributed)

@phofl phofl merged commit ee6d0a5 into dask:main Oct 26, 2023
@phofl
Copy link
Collaborator

phofl commented Oct 26, 2023

thx @hendrikmakait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exception while computing hashjoin-p2p

3 participants