Skip to content

Improve performance of extracting statistics from parquet files #10626

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Part of #10453

@Lordworms added a benchmark for extracting statistics from parquet files in #10610

As this code can be used to extract statistics from parquet files, we would like to make sure it is efficient (especially if we are going to extract statistics for many files at once)

The idea here is to improve the speed of the statistics extraction

Describe the solution you'd like

Make this go faster

cargo bench --bench parquet_statistic

Describe alternatives you've considered

I did some brief profiling:

Screenshot 2024-05-22 at 3 37 30 PM

I think they key would be to change these loops so they built the required Arrow Arrays directly from primitive values rather than from ScalarValue:

pub(crate) fn min_statistics<'a, I: Iterator<Item = Option<&'a ParquetStatistics>>>(
data_type: &DataType,
iterator: I,
) -> Result<ArrayRef> {
let scalars = iterator
.map(|x| x.and_then(|s| get_statistic!(s, min, min_bytes, Some(data_type))));
collect_scalars(data_type, scalars)

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions