-
Notifications
You must be signed in to change notification settings - Fork 18
Imbalanced join workflow #883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
13eec01 to
7ab74f8
Compare
7ab74f8 to
a776239
Compare
| large_df = dd.read_parquet("s3://test-imbalanced-join/df1/") | ||
| small_df = dd.read_parquet("s3://test-imbalanced-join/df2/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought (feel free to ignore) but these datasets seem opaque to me. I don't know how imbalanced they are, or really what this means. Probably the answer is to go to the repository and look at notes.
An alternative would be to use dask.datasets.timeseries and take the random float columns and put them through some transformation to control skew / cardinality / etc.. this would put more control of the situation into this benchmark.
Again, just a thought, not a request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would indeed be a nice utility function. Maybe even a core functionality of timeseries?
I like the idea but I suggest to decouple this from this PR
When I think of an imbalanced join, it is a join where the individual result partitions are imbalanced (e.g., very few customers account for most of the orders in an online shop). Is this dataset imbalanced because one side is larger than the other or is there actual imbalance on the partitions/join predicates? |
|
For those curious about the dataset here is the repo and the notebook on what this data has: https://github.com/coiled/imbalanced-join/blob/main/simulate_imbalanced_data.ipynb |
e8bee56 to
1f134a5
Compare
|
|
||
| # group and aggregate, use split_out so that the final data | ||
| # chunks don't end up aggregating on a single worker | ||
| # TODO: workers are still getting killed, even with split_out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to get rid of map_partitions in this workflow. Unfortunately, without map_partitions, even with split_out, I still have workers using too much memory and being killed. Somehow, I was able to get the same code to work on a Coiled cluster in a notebook with split_out=4000, but it might have been a lucky accident, because the workflow still errored.
Even when it worked in a notebook, the approach without map_partitions took 13 minutes to run, vs 4 minutes with map_partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split_out=4000 kind of makes sense, or rather split_out=df.npartitions makes kind of sense for an aggregation where we do not expect any significant data reduction.
What was the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 4000 is the number of partitions. It was RuntimeError('shuffle_transfer failed during shuffle 3ae09c441cf66f1f37a5178dc6a4e5dc'). Cluster: https://cloud.coiled.io/clusters/230155/information?account=dask-benchmarks&computation=7b39851b-61cd-43e2-ac2d-cda7c01cbed9&sinceMs=1686864101945&untilMs=1686864482058
| # ) | ||
|
|
||
| def aggregate_partition(part): | ||
| return part.groupby(group_cols, sort=False).agg({"value": "sum"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This groupby is not using shuffle. With shuffle="p2p", I kept getting errors:
RuntimeError('shuffle_barrier failed during shuffle 64ef1a2a338f4af61f6b9cb4ce441355')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this groupby is a pandas groupby. For actual P2P errors I would like us to investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes. Of course it is. I was getting errors in the commented groupby.
|
Thanks for the clarification on the imbalance, @j-bennet! |
fjetter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we merge this I would like to see
- this running with p2p
- this running an ordinary groupby / w/out the set_index
- Investigate the P2P failure
In the end I would also like to see a comparison between
tasks + map_partitions vs P2P + groupby
since this is basically "what dask did last year" vs "what dask would do right now"
|
|
||
| # group and aggregate, use split_out so that the final data | ||
| # chunks don't end up aggregating on a single worker | ||
| # TODO: workers are still getting killed, even with split_out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split_out=4000 kind of makes sense, or rather split_out=df.npartitions makes kind of sense for an aggregation where we do not expect any significant data reduction.
What was the error?
| large_df = dd.read_parquet("s3://test-imbalanced-join/df1/") | ||
| small_df = dd.read_parquet("s3://test-imbalanced-join/df2/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would indeed be a nice utility function. Maybe even a core functionality of timeseries?
I like the idea but I suggest to decouple this from this PR
| # large dataframe has known divisions, use those | ||
| # to ensure the data is partitioned as expected | ||
| divisions = list(range(1, 40002, 10)) | ||
| large_df = large_df.set_index("bucket", drop=False, divisions=divisions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, when using an ordinary groupby instead of the map_partitions, this set_index is not required.
| # ) | ||
|
|
||
| def aggregate_partition(part): | ||
| return part.groupby(group_cols, sort=False).agg({"value": "sum"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this groupby is a pandas groupby. For actual P2P errors I would like us to investigate.
I actually poked around here for a while and this is not the entire truth. What is imbalanced here is the initial partition size. When we calculate the distribution of partition sizes we can see a very strong imbalance with partitions with partition sizes ranging from a few hundred KiB up to ~3GiB (as measured by dask.sizeof) However, once the groupby is done we get an extremely homogeneous partition size of about ~17MM rows each. I'm currently running another job to see how and if the group sizes actually vary. What I can confirm is that P2P doesn't seem to be able to cope with this. I do see OOM failures because some buffers are overflowing. This is something we should look into |
Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>
I think that's why we need that |
|
A couple of things are wrong here
Apart from "things are not working / scheduled the way they are supposed to" there is also the question why set_index + groupby is not successful.
TLDR
Side note: Many / most of the problems I am discussing above can be played with on much smaller data. I ended up reducing the cluster to five workers while also reducing the data sizes by a factor of ten, e.g. # this effectively reduces the size to 10% while keeping most of the imbalance of the dataset still intact
large_df = dd.read_parquet("s3://test-imbalanced-join/df1/").partitions[::10]
from coiled import Cluster
from distributed import Client
cluster_kwargs = {
"n_workers": 5,
# Memory optimized VMs are a bit better suited for this workload. We are bottlenecked by network, not CPU load
"worker_vm_types": "r6i.2xlarge",
"compute_purchase_option": "spot_with_fallback",
}
cluster = Cluster(
**cluster_kwargs,
)
client = Client(cluster)I suggest to not merge this PR since I do not consider having the workaround with |

Closes #882.
Benchmark for merging two dataframes, one of which is a lot bigger than the other.
Based on https://github.com/coiled/imbalanced-join/.
To clarify "imbalanced":
keycolumn. It's a many-to-many join, with a lot more records on the left sidegroup_cols, and then aggregated. The size of groups is very uneven, with some groups containing hundreds of records, and some containing millions.