Skip to content

Conversation

@kbendick
Copy link
Contributor

Adds support for a NOT_STARTS_WITH operator and closes #1952.

This also ensures that pushdown happens when evaluating Parquet dictionaries as well as Parquet row groups. It also ensures that Spark will push this filter down, which is particularly important for queries to remove string partition columns, especially in the case of the identity partition spec or in the truncation partition spec when truncation length is less than or equal to the notStartsWith predicate term.

I've added quite a number of tests. Admittedly, many of them were added in order to aide my own understanding of the codebase so that I could better contribute in the future. So please feel free to suggest any that should be removed in order to spare CI running time and cut down on potential code rot.

I also added a few tests around startsWith, which I'd be happy to factor out into their own PR. I'm adding some comments to explain my reasoning for the changes.

cc @shardulm94 @RussellSpitzer @rdblue

Copy link
Contributor Author

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on my thoughts about the existing code, as well as why I made some only tangentially related changes. Happy to make any updates requested 🙂

for (T item : dictionary) {
if (!item.toString().startsWith(lit.value().toString())) {
return ROWS_MIGHT_MATCH;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one more case where we're using .toString and I wonder if we should be using one of the built in Literal Comparators instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed elsewhere in this PR, we'd like to keep the code similar to the existing semantics. I'm going to resolve this comment to make it easier for others to digest this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be good to follow up with a change to use comparators, but this should be okay for now. It isn't in a tight loop (row groups are usually >= 128MB) and it short-circuits quickly in most cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving this as a reminder to myself to follow up on this.

@kbendick kbendick force-pushed the add-notstartswith-operator-and-push-down branch from 51feaeb to 8e1505e Compare December 20, 2021 20:40
// Iceberg does not implement SQL 3-boolean logic. Therefore, for all null values, we have decided to
// return ROWS_MIGHT_MATCH in order to allow the query engine to further evaluate this partition, as
// null does not start with any non-null value.
if (fieldStats.containsNull() || fieldStats.lowerBound() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fieldStats.lowerBound() == null is checked in the if below, so no need to duplicate that here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we need the null check here. Null values do not match, but we can't tell whether all the values are null or not from these stats. So all we can do is ignore this and check the bounds.

You should be able to just remove this if block entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, this is correct as I mentioned above in the inclusive metrics evaluator. I'd probably change the comment and remove the part about 3-value logic.


// notStartsWith will match unless all values must start with the prefix. this happens when the lower and upper
// bounds both start with the prefix.
if (lower != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to check both lower and upper here. I'd also move the prefix handling into the block, after we know that fieldStats.lowerBound() and fieldStats.upperBound() are both non-null. I don't think that lit.toByteBuffer() is that expensive, but it seems like a good idea to move it just in case.

// Allow query engine to make its own decisions regarding SQL 3-valued boolean logic.
if (dictionary.contains(null)) {
return ROWS_MIGHT_MATCH;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dictionary will never contain null, so you can remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it.

Binary lower = colStats.genericGetMin();
// notStartsWith will match unless all values must start with the prefix. this happens when the lower and upper
// bounds both start with the prefix.
if (lower != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, we may want to validate that both lower and upper are non-null before doing any comparison, but this is very minor.

@kbendick
Copy link
Contributor Author

Note: We'll want to add a test in here if the other PR gets approved: https://github.com/apache/iceberg/pull/3757/files

@rdblue
Copy link
Contributor

rdblue commented Dec 20, 2021

Overall, the tests and implementation all look correct to me. I think there are a few minor things we could do, but I'm ready to commit this.

@kbendick
Copy link
Contributor Author

Overall, the tests and implementation all look correct to me. I think there are a few minor things we could do, but I'm ready to commit this.

I'll backport this to Spark 3.1 and 3.0 after we've merged then.

@kbendick kbendick changed the title [CORE] Add in a NOT_STARTS_WITH operator [CORE] Add in a NOT_STARTS_WITH operator (including Spark 3.2) Dec 20, 2021
@rdblue rdblue merged commit bf9a227 into apache:master Dec 21, 2021
@rdblue
Copy link
Contributor

rdblue commented Dec 21, 2021

Thanks, @kbendick! Great to have this in before 0.13.0.

@cccs-eric
Copy link
Contributor

Thanks for the work @kbendick, will test it out once it is released!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failure evaluating expressions for NOT + STARTS_WITH predicate

6 participants