[CORE] Add in a NOT_STARTS_WITH operator (including Spark 3.2) #2062

kbendick · 2021-01-11T03:30:27Z

Adds support for a NOT_STARTS_WITH operator and closes #1952.

This also ensures that pushdown happens when evaluating Parquet dictionaries as well as Parquet row groups. It also ensures that Spark will push this filter down, which is particularly important for queries to remove string partition columns, especially in the case of the identity partition spec or in the truncation partition spec when truncation length is less than or equal to the notStartsWith predicate term.

I've added quite a number of tests. Admittedly, many of them were added in order to aide my own understanding of the codebase so that I could better contribute in the future. So please feel free to suggest any that should be removed in order to spare CI running time and cut down on potential code rot.

I also added a few tests around startsWith, which I'd be happy to factor out into their own PR. I'm adding some comments to explain my reasoning for the changes.

cc @shardulm94 @RussellSpitzer @rdblue

api/src/main/java/org/apache/iceberg/expressions/BoundLiteralPredicate.java

kbendick

Left some comments on my thoughts about the existing code, as well as why I made some only tangentially related changes. Happy to make any updates requested 🙂

api/src/main/java/org/apache/iceberg/expressions/BoundLiteralPredicate.java

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java

api/src/main/java/org/apache/iceberg/expressions/ResidualEvaluator.java

api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java

api/src/test/java/org/apache/iceberg/transforms/TestStartsWith.java

api/src/test/java/org/apache/iceberg/transforms/TestTruncate.java

kbendick · 2021-01-11T03:56:24Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java

+      for (T item : dictionary) {
+        if (!item.toString().startsWith(lit.value().toString())) {
+          return ROWS_MIGHT_MATCH;
+        }


Here's one more case where we're using .toString and I wonder if we should be using one of the built in Literal Comparators instead.

As discussed elsewhere in this PR, we'd like to keep the code similar to the existing semantics. I'm going to resolve this comment to make it easier for others to digest this PR.

It would probably be good to follow up with a change to use comparators, but this should be okay for now. It isn't in a tight loop (row groups are usually >= 128MB) and it short-circuits quickly in most cases.

Unresolving this as a reminder to myself to follow up on this.

spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

api/src/main/java/org/apache/iceberg/expressions/Evaluator.java

api/src/main/java/org/apache/iceberg/expressions/Expression.java

api/src/main/java/org/apache/iceberg/expressions/ExpressionVisitors.java

api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java

site/docs/api.md

api/src/main/java/org/apache/iceberg/transforms/Truncate.java

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

api/src/main/java/org/apache/iceberg/transforms/ProjectionUtil.java

api/src/main/java/org/apache/iceberg/transforms/Truncate.java

api/src/test/java/org/apache/iceberg/expressions/TestEvaluator.java

api/src/test/java/org/apache/iceberg/expressions/TestInclusiveManifestEvaluator.java

api/src/test/java/org/apache/iceberg/expressions/TestPredicateBinding.java

api/src/test/java/org/apache/iceberg/transforms/TestNotStartsWith.java

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

…k and Parquet

…valuator with an example

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

rdblue · 2021-12-20T20:47:54Z

api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java

+      // Iceberg does not implement SQL 3-boolean logic. Therefore, for all null values, we have decided to
+      // return ROWS_MIGHT_MATCH in order to allow the query engine to further evaluate this partition, as
+      // null does not start with any non-null value.
+      if (fieldStats.containsNull() || fieldStats.lowerBound() == null) {


fieldStats.lowerBound() == null is checked in the if below, so no need to duplicate that here.

I don't think that we need the null check here. Null values do not match, but we can't tell whether all the values are null or not from these stats. So all we can do is ignore this and check the bounds.

You should be able to just remove this if block entirely.

Nevermind, this is correct as I mentioned above in the inclusive metrics evaluator. I'd probably change the comment and remove the part about 3-value logic.

rdblue · 2021-12-20T20:48:59Z

api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java

+
+      // notStartsWith will match unless all values must start with the prefix. this happens when the lower and upper
+      // bounds both start with the prefix.
+      if (lower != null) {


We may want to check both lower and upper here. I'd also move the prefix handling into the block, after we know that fieldStats.lowerBound() and fieldStats.upperBound() are both non-null. I don't think that lit.toByteBuffer() is that expensive, but it seems like a good idea to move it just in case.

rdblue · 2021-12-20T22:58:52Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java

+      // Allow query engine to make its own decisions regarding SQL 3-valued boolean logic.
+      if (dictionary.contains(null)) {
+        return ROWS_MIGHT_MATCH;
+      }


The dictionary will never contain null, so you can remove this.

Removed it.

rdblue · 2021-12-20T23:02:23Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java

+        Binary lower = colStats.genericGetMin();
+        // notStartsWith will match unless all values must start with the prefix. this happens when the lower and upper
+        // bounds both start with the prefix.
+        if (lower != null) {


Here as well, we may want to validate that both lower and upper are non-null before doing any comparison, but this is very minor.

kbendick · 2021-12-20T23:02:41Z

Note: We'll want to add a test in here if the other PR gets approved: https://github.com/apache/iceberg/pull/3757/files

rdblue · 2021-12-20T23:07:29Z

Overall, the tests and implementation all look correct to me. I think there are a few minor things we could do, but I'm ready to commit this.

…lter

kbendick · 2021-12-20T23:37:12Z

Overall, the tests and implementation all look correct to me. I think there are a few minor things we could do, but I'm ready to commit this.

I'll backport this to Spark 3.1 and 3.0 after we've merged then.

rdblue · 2021-12-21T00:44:51Z

Thanks, @kbendick! Great to have this in before 0.13.0.

cccs-eric · 2021-12-21T00:46:40Z

Thanks for the work @kbendick, will test it out once it is released!

* apache/iceberg#3723 * apache/iceberg#3732 * apache/iceberg#3749 * apache/iceberg#3766 * apache/iceberg#3787 * apache/iceberg#3796 * apache/iceberg#3809 * apache/iceberg#3820 * apache/iceberg#3878 * apache/iceberg#3890 * apache/iceberg#3892 * apache/iceberg#3944 * apache/iceberg#3976 * apache/iceberg#3993 * apache/iceberg#3996 * apache/iceberg#4008 * apache/iceberg#3758 and 3856 * apache/iceberg#3761 * apache/iceberg#2062 * apache/iceberg#3422 * remove restriction related to legacy parquet file list

github-actions bot added API core data ORC parquet spark labels Jan 11, 2021

kbendick commented Jan 11, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/BoundLiteralPredicate.java Show resolved Hide resolved

kbendick commented Jan 11, 2021

View reviewed changes

shangxinli reviewed Jan 11, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java Show resolved Hide resolved

kbendick mentioned this pull request Jan 11, 2021

java.lang.IllegalArgumentException: Truncate length should be positive #2065

Closed

RussellSpitzer reviewed Jan 11, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/Evaluator.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Jan 11, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/Expression.java Show resolved Hide resolved

RussellSpitzer reviewed Jan 11, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/ExpressionVisitors.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 11, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java Show resolved Hide resolved

yyanyy mentioned this pull request Jan 12, 2021

Nit - Align all comment indentation in TestStrictMetricsEvaluator #2058

Closed

kbendick mentioned this pull request Jan 13, 2021

Allow binary truncation length to be zero to handle evaluators that encounter empty string values #2081

Merged

github-actions bot added the docs label Jan 18, 2021

kbendick commented Jan 18, 2021

View reviewed changes

site/docs/api.md Show resolved Hide resolved

jun-he reviewed Jan 27, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/transforms/Truncate.java Outdated Show resolved Hide resolved