Python: Add visitor to DNF expr into Dask/PyArrow format #6566

Fokko · 2023-01-11T17:03:38Z

After reading https://www.coiled.io/blog/parquet-file-column-pruning-predicate-pushdown I noticed that we still need to filter everything.

rdblue · 2023-01-16T17:23:09Z

python/pyiceberg/expressions/visitors.py

+        return [(term.ref().field.name, "<=", literal.value)]
+
+    def visit_true(self) -> List[Tuple[str, str, Any]]:
+        return []  # Not supported


I think this is okay. Zero filters basically do the same thing as true. The problem is converting false into the same thing. I think for that, we should throw an exception because it cannot be safely handled.

I'm okay with raising an error. Keep in mind that for Dask (and also PyArrow) this is only used to skip Parquet pages, we have to do a final pass to filter on a row level anyway:

(from the abovementioned blog)

This also allows us to rewrite an expression to read PyArrow tables with filters: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

…o-dash-format

rdblue · 2023-01-27T18:51:17Z

python/pyiceberg/expressions/visitors.py

+        Formatter filter compatible with Dask and PyArrow
+    """
+    # In the form of expr1 ∨ expr2 ∨ ... ∨ exprN
+    return [visit(expression, ExpressionToPlainFormat()) for expression in expressions]


Should this reuse the ExpressionToPlainFormat instance?

Yes, good call! I've pulled this out of the loop

rdblue · 2023-01-27T23:03:41Z

python/pyiceberg/expressions/visitors.py

+        return [(term.ref().field.name, "==", float("nan"))]
+
+    def visit_not_nan(self, term: BoundTerm[L]) -> List[Tuple[str, str, Any]]:
+        return [(term.ref().field.name, "!=", float("nan"))]


NaN is always not equal to itself, at least in Java. Are we sure that this works?

rdblue · 2023-01-27T23:04:07Z

python/pyiceberg/expressions/visitors.py

+        return [(term.ref().field.name, "==", None)]
+
+    def visit_not_null(self, term: BoundTerm[L]) -> List[Tuple[str, str, Any]]:
+        return [(term.ref().field.name, "!=", None)]


I don't see anything in the docs that indicate this is the right way to pass this, so we should make sure there are tests for it.

I agree, and I have #6398 lined up to exactly test this. I'll revive the PR tomorrow.

rdblue

Looks good, but is under tested until we get the follow up PR in.

github-actions bot added the python label Jan 11, 2023

Fokko force-pushed the fd-format-into-dash-format branch from 0b0ba26 to ff5fc81 Compare January 11, 2023 22:39

Python: Add visitor to DNF expr into Dask format

816cbe5

Fokko force-pushed the fd-format-into-dash-format branch from ff5fc81 to 816cbe5 Compare January 12, 2023 13:14

rdblue reviewed Jan 16, 2023

View reviewed changes

Fokko added this to the Python 0.4.0 release milestone Jan 25, 2023

Fokko added 3 commits January 25, 2023 22:27

Merge branch 'master' of github.com:apache/iceberg into fd-format-int…

ae7ee93

…o-dash-format

Raise exception

9452f00

Fix naming

b44504e

Fokko changed the title ~~Python: Add visitor to DNF expr into Dask format~~ Python: Add visitor to DNF expr into Dask/PyArrow format Jan 25, 2023

Fokko added 2 commits January 26, 2023 10:15

Merge branch 'master' of github.com:apache/iceberg into fd-format-int…

3a0f97d

…o-dash-format

Fix naming (again)

c7bbe18

Fokko force-pushed the fd-format-into-dash-format branch from cb97baa to c7bbe18 Compare January 26, 2023 17:10

Fokko mentioned this pull request Jan 27, 2023

Python: Optimize PyArrow reads #6673

Merged

rdblue reviewed Jan 27, 2023

View reviewed changes

rdblue approved these changes Jan 27, 2023

View reviewed changes

rdblue merged commit a76724f into apache:master Jan 27, 2023

Fokko deleted the fd-format-into-dash-format branch January 30, 2023 21:19

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023

Python: Add visitor to DNF expr into Dask/PyArrow format (apache#6566)

dcfc874

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Add visitor to DNF expr into Dask/PyArrow format #6566

Python: Add visitor to DNF expr into Dask/PyArrow format #6566

Uh oh!

Fokko commented Jan 11, 2023 •

edited

Loading

Uh oh!

rdblue Jan 16, 2023

Uh oh!

Fokko Jan 25, 2023 •

edited

Loading

Uh oh!

Fokko Jan 25, 2023

Uh oh!

rdblue Jan 27, 2023

Uh oh!

Fokko Jan 30, 2023 •

edited

Loading

Uh oh!

rdblue Jan 27, 2023

Uh oh!

rdblue Jan 27, 2023

Uh oh!

Fokko Jan 30, 2023

Uh oh!

rdblue left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Python: Add visitor to DNF expr into Dask/PyArrow format #6566

Python: Add visitor to DNF expr into Dask/PyArrow format #6566

Uh oh!

Conversation

Fokko commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue Jan 16, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 25, 2023

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 30, 2023

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fokko commented Jan 11, 2023 •

edited

Loading

Fokko Jan 25, 2023 •

edited

Loading

Fokko Jan 30, 2023 •

edited

Loading