-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: Add visitor to DNF expr into Dask/PyArrow format #6566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0b0ba26 to
ff5fc81
Compare
ff5fc81 to
816cbe5
Compare
| return [(term.ref().field.name, "<=", literal.value)] | ||
|
|
||
| def visit_true(self) -> List[Tuple[str, str, Any]]: | ||
| return [] # Not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay. Zero filters basically do the same thing as true. The problem is converting false into the same thing. I think for that, we should throw an exception because it cannot be safely handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also allows us to rewrite an expression to read PyArrow tables with filters: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
cb97baa to
c7bbe18
Compare
| Formatter filter compatible with Dask and PyArrow | ||
| """ | ||
| # In the form of expr1 ∨ expr2 ∨ ... ∨ exprN | ||
| return [visit(expression, ExpressionToPlainFormat()) for expression in expressions] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this reuse the ExpressionToPlainFormat instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good call! I've pulled this out of the loop
| return [(term.ref().field.name, "==", float("nan"))] | ||
|
|
||
| def visit_not_nan(self, term: BoundTerm[L]) -> List[Tuple[str, str, Any]]: | ||
| return [(term.ref().field.name, "!=", float("nan"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaN is always not equal to itself, at least in Java. Are we sure that this works?
| return [(term.ref().field.name, "==", None)] | ||
|
|
||
| def visit_not_null(self, term: BoundTerm[L]) -> List[Tuple[str, str, Any]]: | ||
| return [(term.ref().field.name, "!=", None)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything in the docs that indicate this is the right way to pass this, so we should make sure there are tests for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and I have #6398 lined up to exactly test this. I'll revive the PR tomorrow.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but is under tested until we get the follow up PR in.

After reading https://www.coiled.io/blog/parquet-file-column-pruning-predicate-pushdown I noticed that we still need to filter everything.