-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
Part of #10453
@Lordworms added a benchmark for extracting statistics from parquet files in #10610
As this code can be used to extract statistics from parquet files, we would like to make sure it is efficient (especially if we are going to extract statistics for many files at once)
The idea here is to improve the speed of the statistics extraction
Describe the solution you'd like
Make this go faster
cargo bench --bench parquet_statisticDescribe alternatives you've considered
I did some brief profiling:
I think they key would be to change these loops so they built the required Arrow Arrays directly from primitive values rather than from ScalarValue:
datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 183 to 189 in 1bf7112
| pub(crate) fn min_statistics<'a, I: Iterator<Item = Option<&'a ParquetStatistics>>>( | |
| data_type: &DataType, | |
| iterator: I, | |
| ) -> Result<ArrayRef> { | |
| let scalars = iterator | |
| .map(|x| x.and_then(|s| get_statistic!(s, min, min_bytes, Some(data_type)))); | |
| collect_scalars(data_type, scalars) |
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
