-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
I am trying to perform a range join on two parquet files with datafusion-cli, but it makes an unrealistic query plan.
select * from points, intervals where pt between first and last order by pt;I try to express the sort order in both parquet files when registering the parquet files with datafusion-cli.
However, I am somehow stuck with expressing this sort order as the with order (columnname ordering) clause seems to be unusable together with parquet files.
To Reproduce
The points table has 1 numeric column and the intervals table has two columns. With duckdb I generated parquet files for both:
copy (select generate_series as pt from (select * from generate_series(1, 100000000)) t) TO 'pts.parquet' (FORMAT PARQUET);
copy (select generate_series as first, first +2 as last from (select * from generate_series (2, 100, 5))) TO 'intervals.parquet' (FORMAT PARQUET);(which I both ran through parquet-rewrite to add page level statistics).
parquet-rewrite --input pts.parquet --output pts_stats.parquet --writer-version 2.0 --statistics-enabled page
parquet-rewrite --input intervals.parquet --output intervals_stats.parquet --writer-version 2.0 --statistics-enabled pageI would expect that the sort order present in both files could help very much to query efficiently (I would pick a sort-merge-join, but without the sort of the files being necessary, as the data is already sorted, so just the merge part with scanning for the overlapping parts of the table).
Hence, I tried to register the parquet files with datafusion-cli and express that the files are already sorted.
This works:
CREATE EXTERNAL TABLE points
STORED AS PARQUET
LOCATION 'pts_stats.parquet';But this does not:
CREATE EXTERNAL TABLE points
STORED AS PARQUET
LOCATION 'pts_stats.parquet'
WITH ORDER (pt ASC)
;It errors with:
Error during planning: Provide a schema before specifying the order while creating a table.
If I try to specify the schema manually:
CREATE EXTERNAL TABLE points
(pt bigint)
STORED AS PARQUET
LOCATION 'pts_stats.parquet'
WITH ORDER (pt ASC)
;It errors with:
Error during planning: Column definitions can not be specified for PARQUET files.
Which is not surprising.
Expected behavior
I would expect that the sort order can be expressed and that the query planner subsequently can execute a sort-merge-join, where the sort step would be unnecessary (as the data of both sides of the join is already sorted).
Additional context
No response