Skip to content

Range/inequality joins are slow #8393

@simonvandel

Description

@simonvandel

Describe the bug

Joins where the ON filter are not equality, but rather inequalities like <, `> etc. seem slow. Atleast compared to DuckDB which seem like a direct "competitor".

The main difference between the DuckDB and Datafusion plans seem to be that Datafusion uses a NestedLoopJoinExec, while DuckDB uses a IEJoin.

Note that the query could be written better with a ASOF-join, but Datafusion does not support that (see issue #318).

To Reproduce

Create some test data with this SQL (saved as repro-dataset.sql) in DuckDB:

CREATE
OR REPLACE TABLE pricing AS
SELECT
    t,
    RANDOM() as v
FROM
    range(
        '2022-01-01' :: TIMESTAMP,
        '2023-01-01' :: TIMESTAMP,
        INTERVAL 30 DAY
    ) ts(t);

COPY pricing to 'pricing.parquet' (format 'parquet');

CREATE
OR REPLACE TABLE timestamps AS
SELECT
    t
FROM
    range(
        '2022-01-01' :: TIMESTAMP,
        '2023-01-01' :: TIMESTAMP,
        INTERVAL 10 SECOND
    ) ts(t);

COPY timestamps to 'timestamps.parquet' (format 'parquet');
$ duckdb < repro-dataset.sql

We will compare the performance of the following query in DuckDB and Datafusion. The query is saved as repro-range-query.sql.

WITH pricing_state AS (
    SELECT
        t as valid_from,
        COALESCE(
            LEAD(t, 1) OVER (
                ORDER BY
                    t
            ),
            '9999-12-31'
        ) as valid_to,
        v
    FROM
        'pricing.parquet'
)
SELECT
    t.t,
    p.v
FROM
    pricing_state p
    LEFT JOIN 'timestamps.parquet' t ON t.t BETWEEN p.valid_from
    AND p.valid_to;

DuckDB performance:

$ time duckdb < repro-range-query.sql
...
real    0m0.999s
user    0m6.070s
sys     0m3.600s

Datafusion performance:

$ time datafusion-cli -f repro-range-query.sql
...
real    0m8.269s
user    0m6.358s
sys     0m1.907s

Expected behavior

It would be nice if the above query (or something equivalent) would be faster in Datafusion.

If someone knows of a better way to express the query, then that could also be a workaround for me.

Additional context

Machine tested on:
CPU:Ryzen 3900x
OS: Ubuntu 22.04

Versions used:

$ duckdb --version
v0.9.2 3c695d7ba9
$ datafusion-cli --version
datafusion-cli 33.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions