-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not every time). The query is:
l <- input_table("lineitem") %>%
select(l_orderkey, l_commitdate, l_receiptdate) %>%
filter(l_commitdate < l_receiptdate) %>%
select(l_orderkey)
o <- input_table("orders") %>%
select(o_orderkey, o_orderdate, o_orderpriority) %>%
# kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" + interval '3' month) %>%
filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < as.Date("1993-10-01")) %>%
select(o_orderkey, o_orderpriority)
# distinct after join, tested and indeed faster
lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>%
distinct() %>%
select(o_orderpriority)
aggr <- lo %>%
group_by(o_orderpriority) %>%
summarise(order_count = n()) %>%
arrange(o_orderpriority) %>%
collect()Basically, filtered lineitems, filtered orders, join those together, group_by, summarise, arrange.
This happens pretty reliably when the input_table is a dataset backed by parquet or feather fiels (e.g. input_table returns something like arrow::open_dataset("path/to/{filename}.feather", format = "feather")
One can replicate this by installing an arrowbench branch (voltrondata-labs/arrowbench#37) with, in R: remotes::install_github("ursacomputing/arrowbench@moar-tpch" and then running the following:
library(arrowbench)
results <- run_benchmark(
tpc_h,
scale_factor = 1,
cpu_count = 8,
query_id = 4,
lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a recent install of the arrow r package that supports hash joins and want to avoid building a separate copy.
format = "feather",
n_iter = 20
)Note this sometimes will finish, but frequently it will not and be stuck.
Reporter: Jonathan Keane / @jonkeane
Assignee: Michal Nowakiewicz / @michalursa
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-14197. Please see the migration documentation for further details.