-
Notifications
You must be signed in to change notification settings - Fork 18
Add tpch benchmark queries #971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
| ) | ||
|
|
||
| total.reset_index().compute().sort_values(["l_returnflag", "l_linestatus"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: feels a little like cheating to compute first before sorting. In a perfect world, dask would know the data is small enough to collect and sort locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, I'll add this to my todo list for dask-expr. I'd like to avoid measuring the intermediate compute though, that's why I added it in this order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, to be more explicit: dask should not sort locally but reduce to a single worker, sort there and return your result
tests/benchmarks/test_tpch.py
Outdated
| "s_phone", | ||
| "s_comment", | ||
| ] | ||
| ].persist().sort_values("s_acctbal", ascending=False).head(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the persist?
I wonder how common smth like this is. That could be easily improved with an optimized topk-like operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, wanted to avoid the intermediate compute and have something with persist in here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be easily improved with an optimized topk-like operation
sort+head -> TopK seems like a classic optimization we should probably have at some point. Should be easy to write too (if we have nlargest around already). If we wanted to onboard someone onto the project this might be an interesting good first issue (for an experienced dask dev)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No obligation to wait though.
Also, doing the nlargest optimization removes the need for the persist/compute stuff. It'll all turn into a nicely streamable operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, just to convey optimization thoughts, this is why we do optimizations top-down. We would not want to touch the sort/shuffle/p2p lowering phase at all until the expressions above it (the head) have an opportunity to make all of that obsolete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort+head -> TopK seems like a classic optimization we should probably have at some point.
apart from the optimization, I think we currently don't even have an API for this. We have topk for arrays. That's good. But in this case we want to return entire DF rows based on the TopK. That's not harder to implement, it's just not there, I think.
My question was not targeted for the optimization but rather whether we have (or should offer) this API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you're looking for https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh, thanks 🤦
| .revenue.agg("sum") | ||
| .reset_index() | ||
| ) | ||
| result_df = result_df.compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, a premature compute that feels like cheating.
| lineitem_ds = read_data("lineitem") | ||
|
|
||
| lineitem_filtered = lineitem_ds[lineitem_ds.l_shipdate <= VAR1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a little bit biased towards dask-expr. Ordinarily, we'd point users to using filters in the read_parquet call.
That's fine, I think, just wanting to point it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordinarily, we'd point users to using filters in the read_parquet call.
I'm pretty comfortable with what's here. I rarely see users use the read_parquet kwargs in practice, even though they should.
More generally, I think that we should write things the way we think a naive users would write them, rather than the way they should write them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think they should. I've done some measurements with filters in read_parquet and they are terribly slow if you don't have metadata available for the columns you are filtering.
I've explicitly avoided pushing the filters into read_parquet in my benchmarks because that was at least 3 times as slow as doing this with dask/pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's sad and we should investigate. If this is indeed slower, read_parquet is trying to be smart where it shouldn't be or smth is just implemented poorly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slowdown is in arrow itself. We aren't doing anything fancy except passing the filters in.
tests/benchmarks/test_tpch.py
Outdated
| "s_phone", | ||
| "s_comment", | ||
| ] | ||
| ].persist().sort_values("s_acctbal", ascending=False).head(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TPCH query 2 defines the sorting on multiple columns (see https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep good call, I played with this and forgot to turn it back
tests/benchmarks/test_tpch.py
Outdated
| total.reset_index().head(10)[ | ||
| ["l_orderkey", "revenue", "o_orderdate", "o_shippriority"] | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is missing an orderby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep thx
fjetter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one query where I think we're missing an ordering but otherwise this LGTM
|
ok to merge? |

No description provided.