Skip to content

Conversation

@sanjibansg
Copy link
Contributor

@sanjibansg sanjibansg commented Dec 30, 2022

This PR fixes the issue of handling NaNs in the Parquet predicate push-down.
While computing the valid bounds for a column, if the max or min of the column is null, the range should ignore that.

@github-actions
Copy link

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the right logic but I think we should be checking for NaN and not using is_valid (which checks for null).

So something like...

bool isNan(const Scalar& scalar) {
if (IsFloat(scalar)) {
const FloatScalar& float_scalar = checked_cast<const FloatScalar&>(scalar);
return isnan(float_scalar);
} else if (IsDouble(scalar)) {
// ...
} else {
return false;
}
}

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me. @pitrou did you want to do a quick sanity check?

How hard would it be to mock up some kind of test to check this? Unfortunately, ColumnChunkStatisticsAsExpression isn't really a public method and so it might be tricky coming up with a test case without reproducing an offending parquet file which may not be easy to do.

@pitrou
Copy link
Member

pitrou commented Jan 3, 2023

Unfortunately, ColumnChunkStatisticsAsExpression isn't really a public method and so it might be tricky coming up with a test case without reproducing an offending parquet file which may not be easy to do.

I agree we should ideally unit test ColumnChunkStatisticsAsExpression. Perhaps it can be exposed as an internal API?

@sanjibansg sanjibansg requested a review from pitrou January 4, 2023 23:22
@jorisvandenbossche jorisvandenbossche changed the title ARROW-12264: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down GH-28074: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down Jan 17, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #28074 has been automatically assigned in GitHub to PR creator.

@wjones127
Copy link
Member

I believe I have generated a file that could be used for a test case here: apache/parquet-testing#35

Does that seem sufficient?

@sanjibansg sanjibansg marked this pull request as ready for review February 3, 2023 19:06
@westonpace
Copy link
Member

Looks like the parquet tests are failing because they can't find the parquet file. Do you maybe need to update the submodule version? I haven't done this recently and am not sure of the right commands.

@sanjibansg
Copy link
Contributor Author

sanjibansg commented Feb 4, 2023

Looks like the parquet tests are failing because they can't find the parquet file. Do you maybe need to update the submodule version? I haven't done this recently and am not sure of the right commands.

@westonpace
I think by mistake, I updated the arrow-testing submodule as well while updating parquet-testing. Will it be okay?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I don't see a problem with updating the arrow testing repo but it might be nice to keep changes isolated if you can.

@sanjibansg
Copy link
Contributor Author

This looks good. I don't see a problem with updating the arrow testing repo but it might be nice to keep changes isolated if you can.

Yes, of course, and sorry, I will be more careful next time.

@sanjibansg sanjibansg requested review from westonpace and removed request for pitrou February 7, 2023 22:05
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nit

@sanjibansg sanjibansg requested a review from westonpace February 9, 2023 17:14
@westonpace westonpace merged commit 518fc51 into apache:master Feb 9, 2023
@westonpace
Copy link
Member

Thanks!

@sanjibansg sanjibansg deleted the ARROW-12264 branch February 9, 2023 18:42
@ursabot
Copy link

ursabot commented Feb 9, 2023

Benchmark runs are scheduled for baseline = 0a7e7fb and contender = 518fc51. 518fc51 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed] ursa-i9-9960x
[Finished ⬇️0.1% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 518fc51e ec2-t3-xlarge-us-east-2
[Failed] 518fc51e test-mac-arm
[Finished] 518fc51e ursa-i9-9960x
[Finished] 518fc51e ursa-thinkcentre-m75q
[Failed] 0a7e7fb1 ec2-t3-xlarge-us-east-2
[Failed] 0a7e7fb1 test-mac-arm
[Failed] 0a7e7fb1 ursa-i9-9960x
[Finished] 0a7e7fb1 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

5 participants