-
-
Notifications
You must be signed in to change notification settings - Fork 27
Closed
Description
import dask_expr
from distributed import Client
c = Client()
df = dask_expr.datasets.timeseries()
df2 = dask_expr.datasets.timeseries()
sum_x = (
df
.merge(df2, on=["name", "id"])
.x_x.sum()
)
print(sum_x.optimize().tree_repr())Sum(TreeReduce): split_every=0
Fused(8cf79):
| Sum(Chunk):
| Projection: columns='x_x'
| BlockwiseMerge: left_on=['name', 'id'] right_on=['name', 'id']
| Projection: columns=['x', 'name', 'id']
P2PShuffle: partitioning_index='_partitions' npartitions_out=30 ignore_index=False options=None
Fused(d0cfd):
| AssignPartitioningIndex: partitioning_index=['name', 'id'] index_name='_partitions' npartitions_out=30
| Timeseries: dtypes={'x': <class 'float'>, 'name': 'string', 'id': <class 'int'>} seed=1132422555
| Projection: columns=['name', 'id', 'x']
P2PShuffle: partitioning_index='_partitions' npartitions_out=30 ignore_index=False options=None
Fused(f9ad4):
| AssignPartitioningIndex: partitioning_index=['name', 'id'] index_name='_partitions' npartitions_out=30
| Timeseries: dtypes={'name': 'string', 'id': <class 'int'>, 'x': <class 'float'>} seed=946678847
is using a BlockwiseMerge that acts on two P2PShuffle but instead it should be using a HashJoin, see dask/distributed#7514 dask/distributed#7496
(shortcoming of scheduler heuristics will not schedule two p2ps efficiently; other binary ops can be affected similarly)
Metadata
Metadata
Assignees
Labels
No labels