Skip to content

DataFrame merge should use HashJoinLayer instead of two separate P2Ps #266

@fjetter

Description

@fjetter
import dask_expr
from distributed import Client
c = Client()
df = dask_expr.datasets.timeseries()
df2 = dask_expr.datasets.timeseries()
sum_x = (
    df
    .merge(df2, on=["name", "id"])
    .x_x.sum()
)
print(sum_x.optimize().tree_repr())
Sum(TreeReduce): split_every=0
  Fused(8cf79):
  | Sum(Chunk):
  |   Projection: columns='x_x'
  |     BlockwiseMerge: left_on=['name', 'id'] right_on=['name', 'id']
  |       Projection: columns=['x', 'name', 'id']
            P2PShuffle: partitioning_index='_partitions' npartitions_out=30 ignore_index=False options=None
              Fused(d0cfd):
              | AssignPartitioningIndex: partitioning_index=['name', 'id'] index_name='_partitions' npartitions_out=30
              |   Timeseries: dtypes={'x': <class 'float'>, 'name': 'string', 'id': <class 'int'>} seed=1132422555
  |       Projection: columns=['name', 'id', 'x']
            P2PShuffle: partitioning_index='_partitions' npartitions_out=30 ignore_index=False options=None
              Fused(f9ad4):
              | AssignPartitioningIndex: partitioning_index=['name', 'id'] index_name='_partitions' npartitions_out=30
              |   Timeseries: dtypes={'name': 'string', 'id': <class 'int'>, 'x': <class 'float'>} seed=946678847

is using a BlockwiseMerge that acts on two P2PShuffle but instead it should be using a HashJoin, see dask/distributed#7514 dask/distributed#7496
(shortcoming of scheduler heuristics will not schedule two p2ps efficiently; other binary ops can be affected similarly)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions