Skip to content

Conversation

@jorisvandenbossche
Copy link
Member

No description provided.

@github-actions
Copy link

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz ideally we would also skip this if the flilter only involves the partition_expression and not any actual columns of the file. What is the best way to check this? (simplify the filter first with partition_expression ?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplifying the filter by the partition expression would work, but it's not safe without loading the physical schema since we need to validate the filter against that. Otherwise a filter without implicit casts (for example) would result in an assertion failure. Let's leave that optimization for a follow up; the right way to do this is probably to rewrite Expression::Assume to return a Result (allowing it to fail non-catastrophically even without a schema against which to validate), but that's a much larger project

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, please add a comment explaining the purpose of the new code inline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplifying the filter by the partition expression would work, but it's not safe without loading the physical schema since we need to validate the filter against that. Otherwise a filter without implicit casts (for example) would result in an assertion failure. Let's leave that optimization for a follow up; the right way to do this is probably to rewrite Expression::Assume to return a Result (allowing it to fail non-catastrophically even without a schema against which to validate), but that's a much larger project

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {
} else {
// since we are not scanning this fragment with a filter, don't bother loading statistics

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-9827-dataset-parquet-skip-statistics branch from 0fc54ca to 99d3aec Compare September 8, 2020 12:19
@jorisvandenbossche
Copy link
Member Author

@bkietz with some delay, added the suggested comment

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

https://issues.apache.org/jira/browse/ARROW-9945 for follow up to add the more general case

@bkietz bkietz closed this in 8c4fa35 Sep 8, 2020
@jorisvandenbossche jorisvandenbossche deleted the ARROW-9827-dataset-parquet-skip-statistics branch September 8, 2020 16:29
@jorisvandenbossche
Copy link
Member Author

Thanks for opening the follow-up issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants