-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Closed
Copy link
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
- Part of [Epic] Enable parquet metadata cache by default #17000
@nuno-faria implemented the core Parquet Metadata caching logic in the following PR: - feat: Cache Parquet metadata in built in parquet reader #16971
However, it doesn't seem to help certain queries that use statistcs. Specifically, I expect the second time the query is run it should do no network at all because the ParquetMetadata is already cached:
> set datafusion.execution.parquet.cache_metadata = true;
0 row(s) fetched.
Elapsed 0.000 seconds.
> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 4.632 seconds.
> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 2.717 seconds.Describe the solution you'd like
I would like the queries above to go faster by using the ParquetMetaData cache
Describe alternatives you've considered
I think this is related to the fact that there is a separate path to retrieve statistics for ListingTable, specifically https://github.com/apache/datafusion/blob/1452333cf0933d4d8da032af68bc5a3a05c62483/datafusion/datasource-parquet/src/file_format.rs#L975-L974
So to fix this issue, I think what we need to do is to check the FileMetadataCache first before actually fetching any ParquetMetadata
Additional context
No response
shehabgamin
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request