-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Milestone
Description
The Parquet spec (in parquet.thrift) says the following about handling of floating-point statistics:
* (*) Because the sorting order is not specified properly for floating
* point values (relations vs. total ordering) the following
* compatibility rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
* - When looking for NaN values, min and max should be ignored.It appears that the dataset code uses the following filter expression when doing Parquet predicate push-down (in file_parquet.cc):
return and_(greater_equal(field_expr, literal(min)),
less_equal(field_expr, literal(max)));A NaN value will fail that filter and yet may be found in the given Parquet column chunk.
We may instead need a "greater_equal_or_nan" comparison that returns true if either value is NaN.
Reporter: Antoine Pitrou / @pitrou
Assignee: Sanjiban Sengupta / @sanjibansg
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-12264. Please see the migration documentation for further details.