GH-28074: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down #15125

sanjibansg · 2022-12-30T09:50:38Z

This PR fixes the issue of handling NaNs in the Parquet predicate push-down.
While computing the valid bounds for a column, if the max or min of the column is null, the range should ignore that.

Closes: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down #28074

github-actions · 2022-12-30T09:51:01Z

https://issues.apache.org/jira/browse/ARROW-12264

westonpace

This looks like the right logic but I think we should be checking for NaN and not using is_valid (which checks for null).

So something like...

bool isNan(const Scalar& scalar) {
if (IsFloat(scalar)) {
const FloatScalar& float_scalar = checked_cast<const FloatScalar&>(scalar);
return isnan(float_scalar);
} else if (IsDouble(scalar)) {
// ...
} else {
return false;
}
}

cpp/src/arrow/dataset/file_parquet.cc

westonpace

This looks correct to me. @pitrou did you want to do a quick sanity check?

How hard would it be to mock up some kind of test to check this? Unfortunately, ColumnChunkStatisticsAsExpression isn't really a public method and so it might be tricky coming up with a test case without reproducing an offending parquet file which may not be easy to do.

cpp/src/arrow/dataset/file_parquet.cc

pitrou · 2023-01-03T15:59:00Z

Unfortunately, ColumnChunkStatisticsAsExpression isn't really a public method and so it might be tricky coming up with a test case without reproducing an offending parquet file which may not be easy to do.

I agree we should ideally unit test ColumnChunkStatisticsAsExpression. Perhaps it can be exposed as an internal API?

github-actions · 2023-01-17T11:37:34Z

Closes: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down #28074

github-actions · 2023-01-17T11:37:36Z

⚠️ GitHub issue #28074 has been automatically assigned in GitHub to PR creator.

wjones127 · 2023-01-23T20:31:53Z

I believe I have generated a file that could be used for a test case here: apache/parquet-testing#35

Does that seem sufficient?

westonpace · 2023-02-03T20:59:13Z

Looks like the parquet tests are failing because they can't find the parquet file. Do you maybe need to update the submodule version? I haven't done this recently and am not sure of the right commands.

…on usage

sanjibansg · 2023-02-04T06:26:52Z

Looks like the parquet tests are failing because they can't find the parquet file. Do you maybe need to update the submodule version? I haven't done this recently and am not sure of the right commands.

@westonpace
I think by mistake, I updated the arrow-testing submodule as well while updating parquet-testing. Will it be okay?

westonpace

This looks good. I don't see a problem with updating the arrow testing repo but it might be nice to keep changes isolated if you can.

cpp/src/arrow/dataset/file_parquet.cc

sanjibansg · 2023-02-07T20:23:59Z

This looks good. I don't see a problem with updating the arrow testing repo but it might be nice to keep changes isolated if you can.

Yes, of course, and sorry, I will be more careful next time.

westonpace

One minor nit

cpp/src/arrow/dataset/file_parquet.cc

westonpace · 2023-02-09T18:36:23Z

Thanks!

ursabot · 2023-02-09T22:28:42Z

Benchmark runs are scheduled for baseline = 0a7e7fb and contender = 518fc51. 518fc51 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed] ursa-i9-9960x
[Finished ⬇️0.1% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 518fc51e ec2-t3-xlarge-us-east-2
[Failed] 518fc51e test-mac-arm
[Finished] 518fc51e ursa-i9-9960x
[Finished] 518fc51e ursa-thinkcentre-m75q
[Failed] 0a7e7fb1 ec2-t3-xlarge-us-east-2
[Failed] 0a7e7fb1 test-mac-arm
[Failed] 0a7e7fb1 ursa-i9-9960x
[Finished] 0a7e7fb1 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: C++ label Dec 30, 2022

westonpace requested changes Jan 2, 2023

View reviewed changes

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved

sanjibansg force-pushed the ARROW-12264 branch from 1089988 to b5d7d58 Compare January 2, 2023 19:37

westonpace reviewed Jan 3, 2023

View reviewed changes

pitrou requested changes Jan 3, 2023

View reviewed changes

sanjibansg requested a review from pitrou January 4, 2023 23:22

sanjibansg force-pushed the ARROW-12264 branch from 156f664 to da8a24f Compare January 5, 2023 23:36

asfimport mentioned this pull request Jan 4, 2023

[C++][Dataset] Handle NaNs correctly in Parquet predicate push-down #28074

Closed

jorisvandenbossche changed the title ~~ARROW-12264: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down~~ GH-28074: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down Jan 17, 2023

sanjibansg force-pushed the ARROW-12264 branch from da8a24f to 3c70550 Compare February 3, 2023 19:06

sanjibansg marked this pull request as ready for review February 3, 2023 19:06

sanjibansg added 8 commits February 4, 2023 11:47

fix: parquet predicate push-down handling with NaNs

7be757d

fix: checking for NaN instead of null values

cf1b41d

fix: using std namespace for nan() function

caeba9b

review: coding convention, type id check, checking validity, comment …

e3f2f5e

…on usage

fix: move if condition for NaN check on min and max

3b19ece

feat: restructure function and add test

3fa5bb7

fix: using parquet file for testing

972bfd7

feat: update submodule for parquet-testing file

8a0b47e

sanjibansg force-pushed the ARROW-12264 branch from d9b24e2 to 8a0b47e Compare February 4, 2023 06:17

westonpace reviewed Feb 6, 2023

View reviewed changes

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved

feat: using const ref arguments with EvaluateStatisticsAsExpression

c274807

sanjibansg force-pushed the ARROW-12264 branch from 6e37e22 to c274807 Compare February 7, 2023 20:32

fix: wrong indirection for statistics object

f07e4ac

sanjibansg requested review from westonpace and removed request for pitrou February 7, 2023 22:05

westonpace requested changes Feb 8, 2023

View reviewed changes

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved

feat: organise code to group together NaN handling

23bc0c0

sanjibansg requested a review from westonpace February 9, 2023 17:14

westonpace approved these changes Feb 9, 2023

View reviewed changes

westonpace merged commit 518fc51 into apache:master Feb 9, 2023

sanjibansg deleted the ARROW-12264 branch February 9, 2023 18:42

westonpace mentioned this pull request Feb 16, 2023

GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value #34112

Merged

GH-28074: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down #15125

GH-28074: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down #15125

Uh oh!

Conversation

sanjibansg commented Dec 30, 2022 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 30, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou commented Jan 3, 2023

Uh oh!

github-actions bot commented Jan 17, 2023

Uh oh!

github-actions bot commented Jan 17, 2023

Uh oh!

wjones127 commented Jan 23, 2023

Uh oh!

westonpace commented Feb 3, 2023

Uh oh!

sanjibansg commented Feb 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanjibansg commented Feb 7, 2023

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace commented Feb 9, 2023

Uh oh!

ursabot commented Feb 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sanjibansg commented Dec 30, 2022 •

edited by github-actions bot

Loading

sanjibansg commented Feb 4, 2023 •

edited

Loading