Skip to content

Refactoring and bug fixes on top of unnest. The allowList now is not passed …#13922

Merged
clintropolis merged 7 commits intoapache:masterfrom
somu-imply:unnest_changes
Mar 14, 2023
Merged

Refactoring and bug fixes on top of unnest. The allowList now is not passed …#13922
clintropolis merged 7 commits intoapache:masterfrom
somu-imply:unnest_changes

Conversation

@somu-imply
Copy link
Copy Markdown
Contributor

@somu-imply somu-imply commented Mar 10, 2023

…inside the unnest cursors. Added tests for scenarios such as

  1. filter on unnested column which involves a left filter rewrite
  2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite
  3. not filters
  4. SQL functions applied on top of unnested column
  5. null present in first row of the column to be unnested

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

…inside the unnest cursors. Added tests for scenarios such as

1. filter on unnested column which involves a left filter rewrite
2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite
3. not filters
4. SQL functions applied on top of unnested column
5. null present in first row of the column to be unnested
@somu-imply
Copy link
Copy Markdown
Contributor Author

somu-imply commented Mar 10, 2023

Removed a set of tests that involved the allowSet and we now have the ability to pass a filter on the unnest cursor. Added extra tests in CalciteArraysQueryTest

Copy link
Copy Markdown
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a superficial review. LGTM except for a few nits.

advance();
}
}
if (allowFilter != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: reverse polarity.

    if (allowFilter = null) {
      this.valueMatcher = BooleanValueMatcher.of(true);
    } else {
      this.valueMatcher = allowFilter.makeMatcher(getColumnSelectorFactory());
    }

if (match || baseCursor.isDone()) {
return;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

      if (valueMatcher.matches() || baseCursor.isDone()) {

}
}
index = 0;
if (allowFilter != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: reverse priority

retVal,
virtualColumns,
filterPair.rhs
null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right, since post-unnest filters may refer to virtual columns. So anything from filterPair.rhs that refers to a virtual column needs to be placed here, or else it won't properly see them.

What is the rationale for moving the filter into the UnnestColumnValueSelectorCursor? It seems like the logic there is doing something similar to what PostJoinCursor does: creating a ValueMatcher that wraps the column selector factory. I wonder what's wrong with letting PostJoinCursor do that, which would keep the code in UnnestColumnValueSelectorCursor simpler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale was that going forward the filter on the right will be available on top of the uncollect and Eric and I were discussing if we should pull it into the UnnestDataSource. I agree that the filter can be on the PostJoinCursor. I was also planning of moving in the virtualColumns into the cursors. If keeping it in PostJoin cursor is simpler and we are doing the same amount of work, I'll be happy moving it back

Copy link
Copy Markdown
Contributor Author

@somu-imply somu-imply Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also with the valueMatcher inside the UnnestDimensionCursor I was thinking about using a value matcher to lazily build a bitset and return getValueCardinality() correctly which currently returns the getValueCardinality of the dimension which is incorrect in presence of a filter on the unnested column

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep it happening in PostJoinCursor for a couple reasons:

  1. It may never be useful to push filters into the unnest cursor, because with rewriteFilterOnUnnestColumnIfPossible we are pushing them even further: all the way to the underlying StorageAdapter.
  2. Even if it does end up being useful to push filters into the unnest cursor, if you aren't planning to do these optimizations immediately, it's IMO better to keep the code simpler.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation that we had talked about would be push the filter down all the way to the StorageAdapter without any figuring out of which filters to maybe push down (Calcite has already figured that out for us by separating the filters on the two sides of the LogicalCorrelate).

The intent of the implementation (maybe it wasn't done that way though) was that, for applying the filter on the read, it would just be a ValueMatcher happening at read-time which can reuse other code like the PostJoinCursor.

The point of attaching the Filter on the UnnestColumn is to make ordering of things more explicit, with the intention that native queries are supposed to be explicit "you told me to do X, so I do X" things.

@@ -310,12 +314,15 @@ private void getNextRow()
private void initialize()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc on this method seems out of date (it refers to allowList).

} else {
postFilters.add(filter);
}
// This is needed as a filter on an MV String Dimension returns the entire row matching the filter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, it's not just about MV string columns. When we support doing this for arrays, the same thing applies to arrays. The pre-unnest filter is an array_contains and the post-unnest filter is a regular =.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above comment is still relevant.

@somu-imply
Copy link
Copy Markdown
Contributor Author

@gianm I have rolled back and have made the code simpler. The only change is to remove the allowList. Please review when you have time

@somu-imply somu-imply changed the title Refactoring and bug fixes on top of unnest. The filter now is passed … Refactoring and bug fixes on top of unnest. The allowList now is not passed … Mar 11, 2023
Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition the line comments, please update querying/datasource.md. It refers to allowList which no longer exists.

);
}
// This is needed at this moment for nested queries
// Future developer would want to move the virtual columns
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a future developer want to do this? (Not a rhetorical question: I really don't know.) Please add some rationale to the comment so people know what you have in mind.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad removed the stale comments and updated the docs

} else {
postFilters.add(filter);
}
// This is needed as a filter on an MV String Dimension returns the entire row matching the filter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above comment is still relevant.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@imply-cheddar
Copy link
Copy Markdown
Contributor

imply-cheddar commented Mar 14, 2023

Something I don't understand with this structuring of the code. When we look at the actions taken in planning and running these queries we get

  1. SQL is parsed into parse tree and converted to logical DAG
  2. Logical DAG is optimized such that filters are applied to each side of the UNNEST correlate. That is, Calcite figures out which filters apply to the unnested column (rhs of the LogicalCorrelate with the Uncollect) and which filters apply to the base query (lhs of the LogicalCorrelate with the Uncollect)
  3. We have rules that push all of the filters that Calcite already figured out for us such that they are above theLogicalCorrelate.
  4. We build a native query with the filters all meshed together
  5. The native query then has code that figures out, once again, whether some of the filters can be rewritten to be running against the underlying columns

It seems weird to me that we would explicitly undo the thing that Calcite figured out for us so that we can attempt to re-do it in the native query.

I'd propose that we take a Filter object on the UnnestDatasource. The UnnestCursor can pretty easily attempt the re-write and pushdown of that RHS filter and also attach it as a ValueMatcher on the read. This also seems like a much more natural way to plan the query, no?

Is there some reason that we have to throw away the work that Calcite already did for us only to redo it?

@somu-imply
Copy link
Copy Markdown
Contributor Author

somu-imply commented Mar 14, 2023

A query like SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3='b' as Calcite was adding an additional level of project before the filter causing our filter transpose rules to not fire. The calcite plan was

126:LogicalProject(d3=[$17])
  124:LogicalCorrelate(subset=[rel#125:Subset#6.NONE.[]], correlation=[$cor0], joinType=[inner], requiredColumns=[{3}])
    8:LogicalTableScan(subset=[rel#114:Subset#0.NONE.[]], table=[[druid, numfoo]])
    122:LogicalProject(subset=[rel#123:Subset#5.NONE.[]], d3=[CAST('d':VARCHAR):VARCHAR])
      120:LogicalFilter(subset=[rel#121:Subset#4.NONE.[]], condition=[=($0, 'd')])
        118:Uncollect(subset=[rel#119:Subset#3.NONE.[]])
          116:LogicalProject(subset=[rel#117:Subset#2.NONE.[]], EXPR$0=[MV_TO_ARRAY($cor0.dim3)])
            9:LogicalValues(subset=[rel#115:Subset#1.NONE.[0]], tuples=[[{ 0 }]])

Added an additional rule that removes this Project step (which does a CAST). Additional tests introduced for selector filters on STRING and NUMERIC values after unnest

@somu-imply
Copy link
Copy Markdown
Contributor Author

somu-imply commented Mar 14, 2023

@gianm @cheddar this PR already contains Gian's changes where the filter after rewrite is added to both pre and post filters. Lack of that change is causing queries to give incorrect results. While I work on the changes as suggested by Eric can we get that part merged, if not through this PR but a separate PR ? Or maybe just merge in #13919 to have right results on the master ?

…a layer of project before the filter on top of uncollect"

This reverts commit 112fb54.
Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 going to ignore coverage bot and merge since I will be adding coverage for the missing cases to tests I am in the process of adding to #13803 and I need this since it fixes one of the correctness issues

I vote we do additional refactoring as a follow-up PR

@clintropolis clintropolis merged commit a7ba361 into apache:master Mar 14, 2023
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants