Refactoring and bug fixes on top of unnest. The allowList now is not passed …#13922
Refactoring and bug fixes on top of unnest. The allowList now is not passed …#13922clintropolis merged 7 commits intoapache:masterfrom
Conversation
…inside the unnest cursors. Added tests for scenarios such as 1. filter on unnested column which involves a left filter rewrite 2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite 3. not filters 4. SQL functions applied on top of unnested column 5. null present in first row of the column to be unnested
|
Removed a set of tests that involved the allowSet and we now have the ability to pass a filter on the unnest cursor. Added extra tests in |
paul-rogers
left a comment
There was a problem hiding this comment.
Did a superficial review. LGTM except for a few nits.
| advance(); | ||
| } | ||
| } | ||
| if (allowFilter != null) { |
There was a problem hiding this comment.
Nit: reverse polarity.
if (allowFilter = null) {
this.valueMatcher = BooleanValueMatcher.of(true);
} else {
this.valueMatcher = allowFilter.makeMatcher(getColumnSelectorFactory());
}| if (match || baseCursor.isDone()) { | ||
| return; | ||
| } | ||
| } |
There was a problem hiding this comment.
Nit:
if (valueMatcher.matches() || baseCursor.isDone()) {| } | ||
| } | ||
| index = 0; | ||
| if (allowFilter != null) { |
| retVal, | ||
| virtualColumns, | ||
| filterPair.rhs | ||
| null |
There was a problem hiding this comment.
This isn't right, since post-unnest filters may refer to virtual columns. So anything from filterPair.rhs that refers to a virtual column needs to be placed here, or else it won't properly see them.
What is the rationale for moving the filter into the UnnestColumnValueSelectorCursor? It seems like the logic there is doing something similar to what PostJoinCursor does: creating a ValueMatcher that wraps the column selector factory. I wonder what's wrong with letting PostJoinCursor do that, which would keep the code in UnnestColumnValueSelectorCursor simpler.
There was a problem hiding this comment.
The rationale was that going forward the filter on the right will be available on top of the uncollect and Eric and I were discussing if we should pull it into the UnnestDataSource. I agree that the filter can be on the PostJoinCursor. I was also planning of moving in the virtualColumns into the cursors. If keeping it in PostJoin cursor is simpler and we are doing the same amount of work, I'll be happy moving it back
There was a problem hiding this comment.
Also with the valueMatcher inside the UnnestDimensionCursor I was thinking about using a value matcher to lazily build a bitset and return getValueCardinality() correctly which currently returns the getValueCardinality of the dimension which is incorrect in presence of a filter on the unnested column
There was a problem hiding this comment.
I would keep it happening in PostJoinCursor for a couple reasons:
- It may never be useful to push filters into the unnest cursor, because with
rewriteFilterOnUnnestColumnIfPossiblewe are pushing them even further: all the way to the underlyingStorageAdapter. - Even if it does end up being useful to push filters into the unnest cursor, if you aren't planning to do these optimizations immediately, it's IMO better to keep the code simpler.
There was a problem hiding this comment.
The implementation that we had talked about would be push the filter down all the way to the StorageAdapter without any figuring out of which filters to maybe push down (Calcite has already figured that out for us by separating the filters on the two sides of the LogicalCorrelate).
The intent of the implementation (maybe it wasn't done that way though) was that, for applying the filter on the read, it would just be a ValueMatcher happening at read-time which can reuse other code like the PostJoinCursor.
The point of attaching the Filter on the UnnestColumn is to make ordering of things more explicit, with the intention that native queries are supposed to be explicit "you told me to do X, so I do X" things.
| @@ -310,12 +314,15 @@ private void getNextRow() | |||
| private void initialize() | |||
There was a problem hiding this comment.
The javadoc on this method seems out of date (it refers to allowList).
| } else { | ||
| postFilters.add(filter); | ||
| } | ||
| // This is needed as a filter on an MV String Dimension returns the entire row matching the filter |
There was a problem hiding this comment.
fwiw, it's not just about MV string columns. When we support doing this for arrays, the same thing applies to arrays. The pre-unnest filter is an array_contains and the post-unnest filter is a regular =.
There was a problem hiding this comment.
Above comment is still relevant.
…ate the filter inside the data source
|
@gianm I have rolled back and have made the code simpler. The only change is to remove the allowList. Please review when you have time |
gianm
left a comment
There was a problem hiding this comment.
In addition the line comments, please update querying/datasource.md. It refers to allowList which no longer exists.
| ); | ||
| } | ||
| // This is needed at this moment for nested queries | ||
| // Future developer would want to move the virtual columns |
There was a problem hiding this comment.
Why would a future developer want to do this? (Not a rhetorical question: I really don't know.) Please add some rationale to the comment so people know what you have in mind.
There was a problem hiding this comment.
My bad removed the stale comments and updated the docs
| } else { | ||
| postFilters.add(filter); | ||
| } | ||
| // This is needed as a filter on an MV String Dimension returns the entire row matching the filter |
There was a problem hiding this comment.
Above comment is still relevant.
|
Something I don't understand with this structuring of the code. When we look at the actions taken in planning and running these queries we get
It seems weird to me that we would explicitly undo the thing that Calcite figured out for us so that we can attempt to re-do it in the native query. I'd propose that we take a Is there some reason that we have to throw away the work that Calcite already did for us only to redo it? |
…of project before the filter on top of uncollect
|
A query like Added an additional rule that removes this Project step (which does a CAST). Additional tests introduced for selector filters on STRING and NUMERIC values after unnest |
|
@gianm @cheddar this PR already contains Gian's changes where the filter after rewrite is added to both pre and post filters. Lack of that change is causing queries to give incorrect results. While I work on the changes as suggested by Eric can we get that part merged, if not through this PR but a separate PR ? Or maybe just merge in #13919 to have right results on the master ? |
clintropolis
left a comment
There was a problem hiding this comment.
👍 going to ignore coverage bot and merge since I will be adding coverage for the missing cases to tests I am in the process of adding to #13803 and I need this since it fixes one of the correctness issues
I vote we do additional refactoring as a follow-up PR
…inside the unnest cursors. Added tests for scenarios such as
This PR has: