[C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

The Parquet spec (in parquet.thrift) says the following about handling of floating-point statistics:
```Java

   * (*) Because the sorting order is not specified properly for floating
   *     point values (relations vs. total ordering) the following
   *     compatibility rules should be applied when reading statistics:
   *     - If the min is a NaN, it should be ignored.
   *     - If the max is a NaN, it should be ignored.
   *     - If the min is +0, the row group may contain -0 values as well.
   *     - If the max is -0, the row group may contain +0 values as well.
   *     - When looking for NaN values, min and max should be ignored.
```

It appears that the dataset code uses the following filter expression when doing Parquet predicate push-down (in `file_parquet.cc`):
```c++

    return and_(greater_equal(field_expr, literal(min)),
                less_equal(field_expr, literal(max)));
```

A NaN value will fail that filter and yet may be found in the given Parquet column chunk.

We may instead need a "greater_equal_or_nan" comparison that returns true if either value is NaN.

**Reporter**: [Antoine Pitrou](https://issues.apache.org/jira/browse/ARROW-12264) / @pitrou
**Assignee**: [Sanjiban Sengupta](https://issues.apache.org/jira/browse/ARROW-12264) / @sanjibansg
#### Related issues:
- [Specify a well-defined sorting order for float and double types](https://issues.apache.org/jira/browse/PARQUET-1222) (relates to)
#### PRs and other links:
- [GitHub Pull Request #15125](https://github.com/apache/arrow/pull/15125)

<sub>**Note**: *This issue was originally created as [ARROW-12264](https://issues.apache.org/jira/browse/ARROW-12264). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Dataset] Handle NaNs correctly in Parquet predicate push-down #28074

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Dataset] Handle NaNs correctly in Parquet predicate push-down #28074

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions