-
Notifications
You must be signed in to change notification settings - Fork 3k
[CORE] Add in a NOT_STARTS_WITH operator (including Spark 3.2) #2062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE] Add in a NOT_STARTS_WITH operator (including Spark 3.2) #2062
Conversation
api/src/main/java/org/apache/iceberg/expressions/BoundLiteralPredicate.java
Show resolved
Hide resolved
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments on my thoughts about the existing code, as well as why I made some only tangentially related changes. Happy to make any updates requested 🙂
api/src/main/java/org/apache/iceberg/expressions/BoundLiteralPredicate.java
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
Show resolved
Hide resolved
| for (T item : dictionary) { | ||
| if (!item.toString().startsWith(lit.value().toString())) { | ||
| return ROWS_MIGHT_MATCH; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's one more case where we're using .toString and I wonder if we should be using one of the built in Literal Comparators instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed elsewhere in this PR, we'd like to keep the code similar to the existing semantics. I'm going to resolve this comment to make it easier for others to digest this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be good to follow up with a change to use comparators, but this should be okay for now. It isn't in a tight loop (row groups are usually >= 128MB) and it short-circuits quickly in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving this as a reminder to myself to follow up on this.
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/Evaluator.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/ExpressionVisitors.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/transforms/ProjectionUtil.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveManifestEvaluator.java
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestPredicateBinding.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/transforms/TestNotStartsWith.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/transforms/TestNotStartsWith.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
Show resolved
Hide resolved
51feaeb to
8e1505e
Compare
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
Show resolved
Hide resolved
| // Iceberg does not implement SQL 3-boolean logic. Therefore, for all null values, we have decided to | ||
| // return ROWS_MIGHT_MATCH in order to allow the query engine to further evaluate this partition, as | ||
| // null does not start with any non-null value. | ||
| if (fieldStats.containsNull() || fieldStats.lowerBound() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fieldStats.lowerBound() == null is checked in the if below, so no need to duplicate that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we need the null check here. Null values do not match, but we can't tell whether all the values are null or not from these stats. So all we can do is ignore this and check the bounds.
You should be able to just remove this if block entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, this is correct as I mentioned above in the inclusive metrics evaluator. I'd probably change the comment and remove the part about 3-value logic.
|
|
||
| // notStartsWith will match unless all values must start with the prefix. this happens when the lower and upper | ||
| // bounds both start with the prefix. | ||
| if (lower != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to check both lower and upper here. I'd also move the prefix handling into the block, after we know that fieldStats.lowerBound() and fieldStats.upperBound() are both non-null. I don't think that lit.toByteBuffer() is that expensive, but it seems like a good idea to move it just in case.
| // Allow query engine to make its own decisions regarding SQL 3-valued boolean logic. | ||
| if (dictionary.contains(null)) { | ||
| return ROWS_MIGHT_MATCH; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dictionary will never contain null, so you can remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it.
| Binary lower = colStats.genericGetMin(); | ||
| // notStartsWith will match unless all values must start with the prefix. this happens when the lower and upper | ||
| // bounds both start with the prefix. | ||
| if (lower != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well, we may want to validate that both lower and upper are non-null before doing any comparison, but this is very minor.
|
Note: We'll want to add a test in here if the other PR gets approved: https://github.com/apache/iceberg/pull/3757/files |
|
Overall, the tests and implementation all look correct to me. I think there are a few minor things we could do, but I'm ready to commit this. |
I'll backport this to Spark 3.1 and 3.0 after we've merged then. |
|
Thanks, @kbendick! Great to have this in before 0.13.0. |
|
Thanks for the work @kbendick, will test it out once it is released! |
* apache/iceberg#3723 * apache/iceberg#3732 * apache/iceberg#3749 * apache/iceberg#3766 * apache/iceberg#3787 * apache/iceberg#3796 * apache/iceberg#3809 * apache/iceberg#3820 * apache/iceberg#3878 * apache/iceberg#3890 * apache/iceberg#3892 * apache/iceberg#3944 * apache/iceberg#3976 * apache/iceberg#3993 * apache/iceberg#3996 * apache/iceberg#4008 * apache/iceberg#3758 and 3856 * apache/iceberg#3761 * apache/iceberg#2062 * apache/iceberg#3422 * remove restriction related to legacy parquet file list
Adds support for a
NOT_STARTS_WITHoperator and closes #1952.This also ensures that pushdown happens when evaluating Parquet dictionaries as well as Parquet row groups. It also ensures that Spark will push this filter down, which is particularly important for queries to remove string partition columns, especially in the case of the identity partition spec or in the truncation partition spec when truncation length is less than or equal to the notStartsWith predicate term.
I've added quite a number of tests. Admittedly, many of them were added in order to aide my own understanding of the codebase so that I could better contribute in the future. So please feel free to suggest any that should be removed in order to spare CI running time and cut down on potential code rot.
I also added a few tests around
startsWith, which I'd be happy to factor out into their own PR. I'm adding some comments to explain my reasoning for the changes.cc @shardulm94 @RussellSpitzer @rdblue