Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Jan 11, 2023

After reading https://www.coiled.io/blog/parquet-file-column-pruning-predicate-pushdown I noticed that we still need to filter everything.

@Fokko Fokko force-pushed the fd-format-into-dash-format branch from 0b0ba26 to ff5fc81 Compare January 11, 2023 22:39
@Fokko Fokko force-pushed the fd-format-into-dash-format branch from ff5fc81 to 816cbe5 Compare January 12, 2023 13:14
return [(term.ref().field.name, "<=", literal.value)]

def visit_true(self) -> List[Tuple[str, str, Any]]:
return [] # Not supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay. Zero filters basically do the same thing as true. The problem is converting false into the same thing. I think for that, we should throw an exception because it cannot be safely handled.

Copy link
Contributor Author

@Fokko Fokko Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with raising an error. Keep in mind that for Dask (and also PyArrow) this is only used to skip Parquet pages, we have to do a final pass to filter on a row level anyway:

image

(from the abovementioned blog)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also allows us to rewrite an expression to read PyArrow tables with filters: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

@Fokko Fokko added this to the Python 0.4.0 release milestone Jan 25, 2023
@Fokko Fokko changed the title Python: Add visitor to DNF expr into Dask format Python: Add visitor to DNF expr into Dask/PyArrow format Jan 25, 2023
@Fokko Fokko force-pushed the fd-format-into-dash-format branch from cb97baa to c7bbe18 Compare January 26, 2023 17:10
Formatter filter compatible with Dask and PyArrow
"""
# In the form of expr1 ∨ expr2 ∨ ... ∨ exprN
return [visit(expression, ExpressionToPlainFormat()) for expression in expressions]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this reuse the ExpressionToPlainFormat instance?

Copy link
Contributor Author

@Fokko Fokko Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call! I've pulled this out of the loop

return [(term.ref().field.name, "==", float("nan"))]

def visit_not_nan(self, term: BoundTerm[L]) -> List[Tuple[str, str, Any]]:
return [(term.ref().field.name, "!=", float("nan"))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN is always not equal to itself, at least in Java. Are we sure that this works?

return [(term.ref().field.name, "==", None)]

def visit_not_null(self, term: BoundTerm[L]) -> List[Tuple[str, str, Any]]:
return [(term.ref().field.name, "!=", None)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything in the docs that indicate this is the right way to pass this, so we should make sure there are tests for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I have #6398 lined up to exactly test this. I'll revive the PR tomorrow.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but is under tested until we get the follow up PR in.

@rdblue rdblue merged commit a76724f into apache:master Jan 27, 2023
@Fokko Fokko deleted the fd-format-into-dash-format branch January 30, 2023 21:19
krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants