-
Notifications
You must be signed in to change notification settings - Fork 4k
WIP: ARROW-1796: [Python/Parquet] RowGroup filtering on file level #2623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Just as a heads-up. This is missing some more granular tests and the exact filtering converts to Pandas and back again. Changing that to work on pure Arrow tables will be a lot more work. I will separate out some things into smaller pull requests. |
|
cc @rgruener I think you were one of the interested people in this topic. |
rgruener
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good, thanks for this!
| if isinstance(filters[0][0], six.string_types): | ||
| # We have encountered the situation where we have one nesting level too few: | ||
| # We have [(,,), ..] instead of [[(,,), ..]] | ||
| filters = [filters] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This automatic fix isn't my favorite though I see that it is necessary to not have a braking change in the api for ParquetDataset (with the filters argument). Perhaps though it would be better to throw an error here and have this fix in that specific case instead of allowing a wrong nesting level in all cases.
| inner_indexer = (ser < value) & inner_indexer | ||
| elif op == ">": | ||
| inner_indexer = (ser > value) & inner_indexer | ||
| elif op == "in": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the operator "not in"?
| return min_value < val | ||
| elif op == ">": | ||
| return max_value > val | ||
| elif op == "in": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about "not in"
| self.validate_schemas() | ||
|
|
||
| if filters: | ||
| if filters is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filters can now be either List[Tuple] or List[List[Tuple]] and can either filter on partitions or row groups depending on if the dataset is partitioned. Either way the current state should be mentioned in the docstring (unless that will be worked out before this PR is merged)
|
@xhochy is this still WIP? Anything I can help with? |
|
@emkornfield I'm not actively working on this at the moment. Feel free to pick this up or just to integrate parts of the PR into arrow. |
|
Closing this for now. I'll refer to this PR in the course of implementing Scanner logic in the Datasets project |
No description provided.