-
Notifications
You must be signed in to change notification settings - Fork 18
Imbalanced join workflow #883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e74504b
a776239
7c7f9da
4e8b15b
b4efbbc
60f7974
8221054
53625bb
3f9bfb1
356c66b
1f134a5
76e24b7
5769487
31fdb5f
3100592
b933bfe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| """ | ||
| This data represents a skewed but realistic dataset that dask has been struggling with in the past. | ||
| Workflow based on https://github.com/coiled/imbalanced-join/blob/main/test_big_join_synthetic.ipynb | ||
| """ | ||
| import dask.dataframe as dd | ||
| import pytest | ||
|
|
||
|
|
||
| @pytest.mark.client("imbalanced_join") | ||
| def test_merge(client): | ||
| """Merge large df and small df""" | ||
| large_df = dd.read_parquet("s3://test-imbalanced-join/df1/") | ||
| small_df = dd.read_parquet("s3://test-imbalanced-join/df2/") | ||
| # large dataframe has known divisions, use those | ||
| # to ensure the data is partitioned as expected | ||
| divisions = list(range(1, 40002, 10)) | ||
| large_df = large_df.set_index("bucket", drop=False, divisions=divisions) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, when using an ordinary groupby instead of the map_partitions, this |
||
|
|
||
| group_cols = ["df2_group", "bucket", "group1", "group2", "group3", "group4"] | ||
| res = large_df.merge( | ||
| right=small_df, how="inner", on=["key", "key"], suffixes=["_l", "_r"] | ||
| )[group_cols + ["value"]] | ||
|
|
||
| # group and aggregate, use split_out so that the final data | ||
| # chunks don't end up aggregating on a single worker | ||
| # TODO: workers are still getting killed, even with split_out | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to get rid of Even when it worked in a notebook, the approach without
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What was the error?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, 4000 is the number of partitions. It was |
||
| # ( | ||
| # res.groupby(group_cols, sort=False) | ||
| # .agg({"value": "sum"}, split_out=4000, shuffle=shuffle_method) | ||
| # .value.sum() | ||
| # .compute() | ||
| # ) | ||
|
|
||
| def aggregate_partition(part): | ||
| return part.groupby(group_cols, sort=False).agg({"value": "sum"}) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this groupby is a pandas groupby. For actual P2P errors I would like us to investigate.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yikes. Of course it is. I was getting errors in the commented |
||
|
|
||
| res.map_partitions(aggregate_partition).value.sum().compute() | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought (feel free to ignore) but these datasets seem opaque to me. I don't know how imbalanced they are, or really what this means. Probably the answer is to go to the repository and look at notes.
An alternative would be to use
dask.datasets.timeseriesand take the random float columns and put them through some transformation to control skew / cardinality / etc.. this would put more control of the situation into this benchmark.Again, just a thought, not a request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would indeed be a nice utility function. Maybe even a core functionality of timeseries?
I like the idea but I suggest to decouple this from this PR