Add tpch benchmark queries #971

phofl · 2023-09-01T11:09:45Z

No description provided.

This reverts commit a654c8e.

fjetter · 2023-09-07T13:39:42Z

tests/benchmarks/test_tpch.py

+        }
+    )
+
+    total.reset_index().compute().sort_values(["l_returnflag", "l_linestatus"])


nit: feels a little like cheating to compute first before sorting. In a perfect world, dask would know the data is small enough to collect and sort locally

That's a good idea, I'll add this to my todo list for dask-expr. I'd like to avoid measuring the intermediate compute though, that's why I added it in this order.

sorry, to be more explicit: dask should not sort locally but reduce to a single worker, sort there and return your result

fjetter · 2023-09-07T13:41:53Z

tests/benchmarks/test_tpch.py

+            "s_phone",
+            "s_comment",
+        ]
+    ].persist().sort_values("s_acctbal", ascending=False).head(100)


why the persist?

I wonder how common smth like this is. That could be easily improved with an optimized topk-like operation

Same as above, wanted to avoid the intermediate compute and have something with persist in here

That could be easily improved with an optimized topk-like operation

sort+head -> TopK seems like a classic optimization we should probably have at some point. Should be easy to write too (if we have nlargest around already). If we wanted to onboard someone onto the project this might be an interesting good first issue (for an experienced dask dev)

No obligation to wait though.

Also, doing the nlargest optimization removes the need for the persist/compute stuff. It'll all turn into a nicely streamable operation.

Also, just to convey optimization thoughts, this is why we do optimizations top-down. We would not want to touch the sort/shuffle/p2p lowering phase at all until the expressions above it (the head) have an opportunity to make all of that obsolete.

sort+head -> TopK seems like a classic optimization we should probably have at some point.

apart from the optimization, I think we currently don't even have an API for this. We have topk for arrays. That's good. But in this case we want to return entire DF rows based on the TopK. That's not harder to implement, it's just not there, I think.

My question was not targeted for the optimization but rather whether we have (or should offer) this API

Maybe you're looking for https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html ?

ahhh, thanks 🤦

fjetter · 2023-09-07T13:43:26Z

tests/benchmarks/test_tpch.py

+        .revenue.agg("sum")
+        .reset_index()
+    )
+    result_df = result_df.compute()


again, a premature compute that feels like cheating.

fjetter · 2023-09-07T13:50:06Z

tests/benchmarks/test_tpch.py

+    lineitem_ds = read_data("lineitem")
+
+    lineitem_filtered = lineitem_ds[lineitem_ds.l_shipdate <= VAR1]


I think this is a little bit biased towards dask-expr. Ordinarily, we'd point users to using filters in the read_parquet call.
That's fine, I think, just wanting to point it out

Ordinarily, we'd point users to using filters in the read_parquet call.

I'm pretty comfortable with what's here. I rarely see users use the read_parquet kwargs in practice, even though they should.

More generally, I think that we should write things the way we think a naive users would write them, rather than the way they should write them.

Actually, I don't think they should. I've done some measurements with filters in read_parquet and they are terribly slow if you don't have metadata available for the columns you are filtering.

I've explicitly avoided pushing the filters into read_parquet in my benchmarks because that was at least 3 times as slow as doing this with dask/pandas

Well, that's sad and we should investigate. If this is indeed slower, read_parquet is trying to be smart where it shouldn't be or smth is just implemented poorly.

The slowdown is in arrow itself. We aren't doing anything fancy except passing the filters in.

fjetter · 2023-09-07T13:52:40Z

tests/benchmarks/test_tpch.py

+            "s_phone",
+            "s_comment",
+        ]
+    ].persist().sort_values("s_acctbal", ascending=False).head(100)


The TPCH query 2 defines the sorting on multiple columns (see https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdf)

Yep good call, I played with this and forgot to turn it back

fjetter · 2023-09-07T13:55:00Z

tests/benchmarks/test_tpch.py

+    total.reset_index().head(10)[
+        ["l_orderkey", "revenue", "o_orderdate", "o_shippriority"]
+    ]


I believe this is missing an orderby

fjetter

There is one query where I think we're missing an ordering but otherwise this LGTM

phofl · 2023-10-04T09:09:16Z

ok to merge?

phofl added 5 commits September 1, 2023 13:09

Add tpch benchmark queries

a886336

Remove anon

e394327

Remove unnecessary files

a654c8e

Update memory and query

ba474eb

Revert "Remove unnecessary files"

35d256f

This reverts commit a654c8e.

fjetter reviewed Sep 7, 2023

View reviewed changes

fjetter approved these changes Sep 7, 2023

View reviewed changes

phofl added 5 commits September 8, 2023 11:44

Fix query

fe26959

Update

a650024

Update client

65f4457

Merge remote-tracking branch 'origin/main' into phofl/tpch

09f8c42

Use compute

1b30944

This was referenced Sep 29, 2023

Add pyspark tpch benchmarks #1027

Closed

Tpch: Dask vs PySpark and PySpark, Polars and DuckDB single node #1044

Merged

fjetter merged commit 99a8270 into main Oct 4, 2023

fjetter deleted the phofl/tpch branch October 4, 2023 10:05

		lineitem_ds = read_data("lineitem")

		lineitem_filtered = lineitem_ds[lineitem_ds.l_shipdate <= VAR1]

Add tpch benchmark queries #971

Add tpch benchmark queries #971

Uh oh!

Conversation

phofl commented Sep 1, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

phofl commented Oct 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants