-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-28074: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down #15125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the right logic but I think we should be checking for NaN and not using is_valid (which checks for null).
So something like...
bool isNan(const Scalar& scalar) {
if (IsFloat(scalar)) {
const FloatScalar& float_scalar = checked_cast<const FloatScalar&>(scalar);
return isnan(float_scalar);
} else if (IsDouble(scalar)) {
// ...
} else {
return false;
}
}
1089988 to
b5d7d58
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct to me. @pitrou did you want to do a quick sanity check?
How hard would it be to mock up some kind of test to check this? Unfortunately, ColumnChunkStatisticsAsExpression isn't really a public method and so it might be tricky coming up with a test case without reproducing an offending parquet file which may not be easy to do.
I agree we should ideally unit test |
156f664 to
da8a24f
Compare
|
|
|
I believe I have generated a file that could be used for a test case here: apache/parquet-testing#35 Does that seem sufficient? |
da8a24f to
3c70550
Compare
|
Looks like the parquet tests are failing because they can't find the parquet file. Do you maybe need to update the submodule version? I haven't done this recently and am not sure of the right commands. |
d9b24e2 to
8a0b47e
Compare
@westonpace |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I don't see a problem with updating the arrow testing repo but it might be nice to keep changes isolated if you can.
Yes, of course, and sorry, I will be more careful next time. |
6e37e22 to
c274807
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor nit
|
Thanks! |
|
Benchmark runs are scheduled for baseline = 0a7e7fb and contender = 518fc51. 518fc51 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR fixes the issue of handling NaNs in the Parquet predicate push-down.
While computing the valid bounds for a column, if the max or min of the column is null, the range should ignore that.