Skip to content

[C++][Dataset] Handle NaNs correctly in Parquet predicate push-down #28074

@asfimport

Description

@asfimport

The Parquet spec (in parquet.thrift) says the following about handling of floating-point statistics:

   * (*) Because the sorting order is not specified properly for floating
   *     point values (relations vs. total ordering) the following
   *     compatibility rules should be applied when reading statistics:
   *     - If the min is a NaN, it should be ignored.
   *     - If the max is a NaN, it should be ignored.
   *     - If the min is +0, the row group may contain -0 values as well.
   *     - If the max is -0, the row group may contain +0 values as well.
   *     - When looking for NaN values, min and max should be ignored.

It appears that the dataset code uses the following filter expression when doing Parquet predicate push-down (in file_parquet.cc):

    return and_(greater_equal(field_expr, literal(min)),
                less_equal(field_expr, literal(max)));

A NaN value will fail that filter and yet may be found in the given Parquet column chunk.

We may instead need a "greater_equal_or_nan" comparison that returns true if either value is NaN.

Reporter: Antoine Pitrou / @pitrou
Assignee: Sanjiban Sengupta / @sanjibansg

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-12264. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions