WIP: ARROW-1796: [Python/Parquet] RowGroup filtering on file level #2623

xhochy · 2018-09-25T09:39:26Z

No description provided.

xhochy · 2018-09-25T09:40:28Z

Just as a heads-up. This is missing some more granular tests and the exact filtering converts to Pandas and back again. Changing that to work on pure Arrow tables will be a lot more work.

I will separate out some things into smaller pull requests.

xhochy · 2018-09-25T09:40:52Z

cc @rgruener I think you were one of the interested people in this topic.

rgruener

Changes look good, thanks for this!

rgruener · 2018-09-26T17:25:26Z

python/pyarrow/parquet.py

+        if isinstance(filters[0][0], six.string_types):
+            # We have encountered the situation where we have one nesting level too few:
+            # We have [(,,), ..] instead of [[(,,), ..]]
+            filters = [filters]


This automatic fix isn't my favorite though I see that it is necessary to not have a braking change in the api for ParquetDataset (with the filters argument). Perhaps though it would be better to throw an error here and have this fix in that specific case instead of allowing a wrong nesting level in all cases.

rgruener · 2018-09-26T17:31:32Z

python/pyarrow/parquet.py

+                inner_indexer = (ser < value) & inner_indexer
+            elif op == ">":
+                inner_indexer = (ser > value) & inner_indexer
+            elif op == "in":


What about the operator "not in"?

rgruener · 2018-09-26T17:33:35Z

python/pyarrow/parquet.py

+        return min_value < val
+    elif op == ">":
+        return max_value > val
+    elif op == "in":


what about "not in"

rgruener · 2018-09-26T17:54:27Z

python/pyarrow/parquet.py

            self.validate_schemas()

-        if filters:
+        if filters is not None:


filters can now be either List[Tuple] or List[List[Tuple]] and can either filter on partitions or row groups depending on if the dataset is partitioned. Either way the current state should be mentioned in the docstring (unless that will be worked out before this PR is merged)

emkornfield · 2019-02-24T08:37:32Z

@xhochy is this still WIP? Anything I can help with?

xhochy · 2019-03-03T11:54:05Z

@emkornfield I'm not actively working on this at the moment. Feel free to pick this up or just to integrate parts of the PR into arrow.

wesm · 2019-06-14T18:38:50Z

Closing this for now. I'll refer to this PR in the course of implementing Scanner logic in the Datasets project

ARROW-1796: [Python/Parquet] RowGroup filtering on file level

ae959e7

rgruener reviewed Sep 26, 2018

View reviewed changes

xhochy mentioned this pull request Sep 29, 2018

ARROW-3363: [C++/Python] Add helper functions to detect scalar Python types #2659

Closed

wesm force-pushed the master branch from 3088183 to 0c6b2d2 Compare February 18, 2019 19:34

wesm closed this Jun 14, 2019

asfimport mentioned this pull request Jun 11, 2020

[Python] RowGroup filtering on file level #17793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: ARROW-1796: [Python/Parquet] RowGroup filtering on file level #2623

WIP: ARROW-1796: [Python/Parquet] RowGroup filtering on file level #2623

Uh oh!

xhochy commented Sep 25, 2018

Uh oh!

xhochy commented Sep 25, 2018

Uh oh!

xhochy commented Sep 25, 2018

Uh oh!

rgruener left a comment

Uh oh!

rgruener Sep 26, 2018

Uh oh!

rgruener Sep 26, 2018

Uh oh!

rgruener Sep 26, 2018

Uh oh!

rgruener Sep 26, 2018

Uh oh!

emkornfield commented Feb 24, 2019

Uh oh!

xhochy commented Mar 3, 2019

Uh oh!

wesm commented Jun 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WIP: ARROW-1796: [Python/Parquet] RowGroup filtering on file level #2623

WIP: ARROW-1796: [Python/Parquet] RowGroup filtering on file level #2623

Uh oh!

Conversation

xhochy commented Sep 25, 2018

Uh oh!

xhochy commented Sep 25, 2018

Uh oh!

xhochy commented Sep 25, 2018

Uh oh!

rgruener left a comment

Choose a reason for hiding this comment

Uh oh!

rgruener Sep 26, 2018

Choose a reason for hiding this comment

Uh oh!

rgruener Sep 26, 2018

Choose a reason for hiding this comment

Uh oh!

rgruener Sep 26, 2018

Choose a reason for hiding this comment

Uh oh!

rgruener Sep 26, 2018

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Feb 24, 2019

Uh oh!

xhochy commented Mar 3, 2019

Uh oh!

wesm commented Jun 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants