DataFrame merge should use HashJoinLayer instead of two separate P2Ps

```python
import dask_expr
from distributed import Client
c = Client()
df = dask_expr.datasets.timeseries()
df2 = dask_expr.datasets.timeseries()
sum_x = (
    df
    .merge(df2, on=["name", "id"])
    .x_x.sum()
)
print(sum_x.optimize().tree_repr())
```

```
Sum(TreeReduce): split_every=0
  Fused(8cf79):
  | Sum(Chunk):
  |   Projection: columns='x_x'
  |     BlockwiseMerge: left_on=['name', 'id'] right_on=['name', 'id']
  |       Projection: columns=['x', 'name', 'id']
            P2PShuffle: partitioning_index='_partitions' npartitions_out=30 ignore_index=False options=None
              Fused(d0cfd):
              | AssignPartitioningIndex: partitioning_index=['name', 'id'] index_name='_partitions' npartitions_out=30
              |   Timeseries: dtypes={'x': <class 'float'>, 'name': 'string', 'id': <class 'int'>} seed=1132422555
  |       Projection: columns=['name', 'id', 'x']
            P2PShuffle: partitioning_index='_partitions' npartitions_out=30 ignore_index=False options=None
              Fused(f9ad4):
              | AssignPartitioningIndex: partitioning_index=['name', 'id'] index_name='_partitions' npartitions_out=30
              |   Timeseries: dtypes={'name': 'string', 'id': <class 'int'>, 'x': <class 'float'>} seed=946678847
```

is using a `BlockwiseMerge` that acts on two `P2PShuffle` but instead it should be using a `HashJoin`, see https://github.com/dask/distributed/pull/7514 https://github.com/dask/distributed/issues/7496
(shortcoming of scheduler heuristics will not schedule two p2ps efficiently; other binary ops can be affected similarly)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataFrame merge should use HashJoinLayer instead of two separate P2Ps #266

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DataFrame merge should use HashJoinLayer instead of two separate P2Ps #266

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions