[C++] Hashjoin + datasets hanging

I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not _every_ time). The query is:

```Java

l <- input_table("lineitem") %>%
    select(l_orderkey, l_commitdate, l_receiptdate) %>%
    filter(l_commitdate < l_receiptdate) %>%
    select(l_orderkey)

  o <- input_table("orders") %>%
    select(o_orderkey, o_orderdate, o_orderpriority) %>%
    # kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" + interval '3' month) %>%
    filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < as.Date("1993-10-01")) %>%
    select(o_orderkey, o_orderpriority)

  # distinct after join, tested and indeed faster
  lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>%
    distinct() %>%
    select(o_orderpriority)

  aggr <- lo %>%
    group_by(o_orderpriority) %>%
    summarise(order_count = n()) %>%
    arrange(o_orderpriority) %>% 
    collect()
```

Basically, filtered lineitems, filtered orders, join those together, group_by, summarise, arrange. 

This happens pretty reliably when the `input_table` is a dataset backed by parquet or feather fiels (e.g. `input_table` returns something like `arrow::open_dataset("path/to/{filename}.feather", format = "feather")`

One can replicate this by installing an arrowbench branch (https://github.com/ursacomputing/arrowbench/pull/37) with, in R: `remotes::install_github("ursacomputing/arrowbench@moar-tpch"` and then running the following:

```Java

library(arrowbench)

results <- run_benchmark(
  tpc_h,
  scale_factor = 1,
  cpu_count = 8,
  query_id = 4,
  lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a recent install of the arrow r package that supports hash joins and want to avoid building a separate copy.
  format = "feather",
  n_iter = 20
)
```

Note this _sometimes_ will finish, but frequently it will not and be stuck.

**Reporter**: [Jonathan Keane](https://issues.apache.org/jira/browse/ARROW-14197) / @jonkeane
**Assignee**: [Michal Nowakiewicz](https://issues.apache.org/jira/browse/ARROW-14197) / @michalursa
#### Original Issue Attachments:
- [gdb.2.log](https://issues.apache.org/jira/secure/attachment/13034568/gdb.2.log)
- [gdb.log](https://issues.apache.org/jira/secure/attachment/13034566/gdb.log)
- [sample-while-hung.out.txt](https://issues.apache.org/jira/secure/attachment/13034457/sample-while-hung.out.txt)
- [tpch_repro.cc](https://issues.apache.org/jira/secure/attachment/13034571/tpch_repro.cc)
#### PRs and other links:
- [GitHub Pull Request #11335](https://github.com/apache/arrow/pull/11335)
- [GitHub Pull Request #11350](https://github.com/apache/arrow/pull/11350)

<sub>**Note**: *This issue was originally created as [ARROW-14197](https://issues.apache.org/jira/browse/ARROW-14197). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Hashjoin + datasets hanging #29782

Original Issue Attachments:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] Hashjoin + datasets hanging #29782

Description

Original Issue Attachments:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions