-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
The problem
The min-max statistics would being truncated during write, as the code below:
EncodedStatistics chunk_statistics = GetChunkStatistics();
chunk_statistics.ApplyStatSizeLimits(
properties_->max_statistics_size(descr_->path()));
chunk_statistics.set_is_signed(SortOrder::SIGNED == descr_->sort_order());ApplyStatSizeLimits will try to truncate min-max if greater than properties_->max_statistics_size(descr_->path())) , which default is 4096 Bytes
// From parquet-mr
// Don't write stats larger than the max size rather than truncating. The
// rationale is that some engines may use the minimum value in the page as
// the true minimum for aggregations and there is no way to mark that a
// value has been truncated and is a lower bound and not in the page.
void ApplyStatSizeLimits(size_t length) {
if (max_.length() > length) {
has_max = false;
max_.clear();
}
if (min_.length() > length) {
has_min = false;
min_.clear();
}
}The code is right here.
But during consuming this api, the code is here:
template <typename DType>
static std::shared_ptr<Statistics> MakeTypedColumnStats(
const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
// If ColumnOrder is defined, return max_value and min_value
if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER) {
return MakeStatistics<DType>(
descr, metadata.statistics.min_value, metadata.statistics.max_value,
metadata.num_values - metadata.statistics.null_count,
metadata.statistics.null_count, metadata.statistics.distinct_count,
metadata.statistics.__isset.max_value || metadata.statistics.__isset.min_value,
metadata.statistics.__isset.null_count,
metadata.statistics.__isset.distinct_count);
}
// Default behavior
return MakeStatistics<DType>(
descr, metadata.statistics.min, metadata.statistics.max,
metadata.num_values - metadata.statistics.null_count,
metadata.statistics.null_count, metadata.statistics.distinct_count,
metadata.statistics.__isset.max || metadata.statistics.__isset.min,
metadata.statistics.__isset.null_count, metadata.statistics.__isset.distinct_count);
}The problem is that || is being used for min-max statistics existence. And the final result just have a has_min_max_state.
As a result, for example, a statistics has :
min: ""
max: "..." <-- an 10000Bytes string
The stored is has_min: true, min: "", has_max: false. And the loaded stats is has_min_max:true, min="", max="", which is a bug here.
Solving
This is because currently, HasMinMax is "has min or max", we can have solvings below:
- Change
MakeTypedColumnStatsto use&&rather than|| - Propose a new api for
HasMinAndMax, and use this api for pruning.
Component(s)
C++, Parquet