-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9827: [C++][Dataset] Skip parsing RowGroup metadata statistics when there is no filter #8037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz ideally we would also skip this if the flilter only involves the partition_expression and not any actual columns of the file. What is the best way to check this? (simplify the filter first with partition_expression ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplifying the filter by the partition expression would work, but it's not safe without loading the physical schema since we need to validate the filter against that. Otherwise a filter without implicit casts (for example) would result in an assertion failure. Let's leave that optimization for a follow up; the right way to do this is probably to rewrite Expression::Assume to return a Result (allowing it to fail non-catastrophically even without a schema against which to validate), but that's a much larger project
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, please add a comment explaining the purpose of the new code inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplifying the filter by the partition expression would work, but it's not safe without loading the physical schema since we need to validate the filter against that. Otherwise a filter without implicit casts (for example) would result in an assertion failure. Let's leave that optimization for a follow up; the right way to do this is probably to rewrite Expression::Assume to return a Result (allowing it to fail non-catastrophically even without a schema against which to validate), but that's a much larger project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } else { | |
| } else { | |
| // since we are not scanning this fragment with a filter, don't bother loading statistics |
0fc54ca to
99d3aec
Compare
|
@bkietz with some delay, added the suggested comment |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
https://issues.apache.org/jira/browse/ARROW-9945 for follow up to add the more general case
|
Thanks for opening the follow-up issue! |
No description provided.