Skip to content

[C++] Hashjoin + datasets hanging #29782

@asfimport

Description

@asfimport

I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not every time). The query is:

l <- input_table("lineitem") %>%
    select(l_orderkey, l_commitdate, l_receiptdate) %>%
    filter(l_commitdate < l_receiptdate) %>%
    select(l_orderkey)

  o <- input_table("orders") %>%
    select(o_orderkey, o_orderdate, o_orderpriority) %>%
    # kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" + interval '3' month) %>%
    filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < as.Date("1993-10-01")) %>%
    select(o_orderkey, o_orderpriority)

  # distinct after join, tested and indeed faster
  lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>%
    distinct() %>%
    select(o_orderpriority)

  aggr <- lo %>%
    group_by(o_orderpriority) %>%
    summarise(order_count = n()) %>%
    arrange(o_orderpriority) %>% 
    collect()

Basically, filtered lineitems, filtered orders, join those together, group_by, summarise, arrange.

This happens pretty reliably when the input_table is a dataset backed by parquet or feather fiels (e.g. input_table returns something like arrow::open_dataset("path/to/{filename}.feather", format = "feather")

One can replicate this by installing an arrowbench branch (voltrondata-labs/arrowbench#37) with, in R: remotes::install_github("ursacomputing/arrowbench@moar-tpch" and then running the following:

library(arrowbench)

results <- run_benchmark(
  tpc_h,
  scale_factor = 1,
  cpu_count = 8,
  query_id = 4,
  lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a recent install of the arrow r package that supports hash joins and want to avoid building a separate copy.
  format = "feather",
  n_iter = 20
)

Note this sometimes will finish, but frequently it will not and be stuck.

Reporter: Jonathan Keane / @jonkeane
Assignee: Michal Nowakiewicz / @michalursa

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-14197. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions