-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do
Assume we have a data lake stores as
table/year=2022/month=03/day=20/log.parquet
table/year=2022/month=03/day=21/log.parquet
Currently, CreateExternalTable supports defining columns and location (e.g. table/)
a sql query of select * from table where year = '2022' and month = '03' and day = '20' seems to scan all files under table/.
Describe the solution you'd like
CREATE EXTERNAL TABLE test (
c1 VARCHAR NOT NULL,
)
STORED AS CSV
WITH HEADER ROW
PARTITIONED BY (p1, p2)
LOCATION '/path/to/';
same as existing ListingOption, PARTITIONED BY only supports String
https://github.com/apache/arrow-datafusion/blob/5936edc2a94d5fb20702a41eab2b80695961b9dc/datafusion/src/datasource/listing/table.rs#L178
Describe alternatives you've considered
Additional context
partitioned by is also used in Trino and AWS Athena
https://trino.io/episodes/5.html
https://docs.aws.amazon.com/athena/latest/ug/create-table.html
I notice that ListingOptions supports table_partition_cols and also partition pruning, but just CreateExternalTable does not accept such input and pass through
https://github.com/apache/arrow-datafusion/blob/5936edc2a94d5fb20702a41eab2b80695961b9dc/datafusion/src/datasource/listing/table.rs#L165-L186
https://github.com/apache/arrow-datafusion/blob/5936edc2a94d5fb20702a41eab2b80695961b9dc/datafusion/src/datasource/listing/table.rs#L358-L365