Skip to content

[C++][Parquet] min-max Statistics doesn't work well when one of min-max being truncated #43382

@mapleFU

Description

@mapleFU

Describe the bug, including details regarding any error messages, version, and platform.

The problem

The min-max statistics would being truncated during write, as the code below:

    EncodedStatistics chunk_statistics = GetChunkStatistics();
    chunk_statistics.ApplyStatSizeLimits(
        properties_->max_statistics_size(descr_->path()));
    chunk_statistics.set_is_signed(SortOrder::SIGNED == descr_->sort_order());

ApplyStatSizeLimits will try to truncate min-max if greater than properties_->max_statistics_size(descr_->path())) , which default is 4096 Bytes

  // From parquet-mr
  // Don't write stats larger than the max size rather than truncating. The
  // rationale is that some engines may use the minimum value in the page as
  // the true minimum for aggregations and there is no way to mark that a
  // value has been truncated and is a lower bound and not in the page.
  void ApplyStatSizeLimits(size_t length) {
    if (max_.length() > length) {
      has_max = false;
      max_.clear();
    }
    if (min_.length() > length) {
      has_min = false;
      min_.clear();
    }
  }

The code is right here.

But during consuming this api, the code is here:

template <typename DType>
static std::shared_ptr<Statistics> MakeTypedColumnStats(
    const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
  // If ColumnOrder is defined, return max_value and min_value
  if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER) {
    return MakeStatistics<DType>(
        descr, metadata.statistics.min_value, metadata.statistics.max_value,
        metadata.num_values - metadata.statistics.null_count,
        metadata.statistics.null_count, metadata.statistics.distinct_count,
        metadata.statistics.__isset.max_value || metadata.statistics.__isset.min_value,
        metadata.statistics.__isset.null_count,
        metadata.statistics.__isset.distinct_count);
  }
  // Default behavior
  return MakeStatistics<DType>(
      descr, metadata.statistics.min, metadata.statistics.max,
      metadata.num_values - metadata.statistics.null_count,
      metadata.statistics.null_count, metadata.statistics.distinct_count,
      metadata.statistics.__isset.max || metadata.statistics.__isset.min,
      metadata.statistics.__isset.null_count, metadata.statistics.__isset.distinct_count);
}

The problem is that || is being used for min-max statistics existence. And the final result just have a has_min_max_state.

As a result, for example, a statistics has :

min: ""
max: "..." <-- an 10000Bytes string

The stored is has_min: true, min: "", has_max: false. And the loaded stats is has_min_max:true, min="", max="", which is a bug here.

Solving

This is because currently, HasMinMax is "has min or max", we can have solvings below:

  1. Change MakeTypedColumnStats to use && rather than ||
  2. Propose a new api for HasMinAndMax, and use this api for pruning.

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions