Refactoring and bug fixes on top of unnest. The allowList now is not passed … by somu-imply · Pull Request #13922 · apache/druid

somu-imply · 2023-03-10T19:26:26Z

…inside the unnest cursors. Added tests for scenarios such as

filter on unnested column which involves a left filter rewrite
filter on unnested virtual column which pushes the filter to the right only and involves no rewrite
not filters
SQL functions applied on top of unnested column
null present in first row of the column to be unnested

This PR has:

…inside the unnest cursors. Added tests for scenarios such as 1. filter on unnested column which involves a left filter rewrite 2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite 3. not filters 4. SQL functions applied on top of unnested column 5. null present in first row of the column to be unnested

somu-imply · 2023-03-10T19:28:09Z

Removed a set of tests that involved the allowSet and we now have the ability to pass a filter on the unnest cursor. Added extra tests in CalciteArraysQueryTest

paul-rogers

Did a superficial review. LGTM except for a few nits.

paul-rogers · 2023-03-10T20:56:14Z

-          advance();
-        }
-      }
+    if (allowFilter != null) {


Nit: reverse polarity.

if (allowFilter = null) { this.valueMatcher = BooleanValueMatcher.of(true); } else { this.valueMatcher = allowFilter.makeMatcher(getColumnSelectorFactory()); }

paul-rogers · 2023-03-10T20:57:54Z

+      if (match || baseCursor.isDone()) {
+        return;
+      }
+    }


Nit:

if (valueMatcher.matches() || baseCursor.isDone()) {

paul-rogers · 2023-03-10T20:58:33Z

-        }
-      }
+    index = 0;
+    if (allowFilter != null) {


Nit: reverse priority

gianm · 2023-03-11T00:17:41Z

              retVal,
              virtualColumns,
-              filterPair.rhs
+              null


This isn't right, since post-unnest filters may refer to virtual columns. So anything from filterPair.rhs that refers to a virtual column needs to be placed here, or else it won't properly see them.

What is the rationale for moving the filter into the UnnestColumnValueSelectorCursor? It seems like the logic there is doing something similar to what PostJoinCursor does: creating a ValueMatcher that wraps the column selector factory. I wonder what's wrong with letting PostJoinCursor do that, which would keep the code in UnnestColumnValueSelectorCursor simpler.

The rationale was that going forward the filter on the right will be available on top of the uncollect and Eric and I were discussing if we should pull it into the UnnestDataSource. I agree that the filter can be on the PostJoinCursor. I was also planning of moving in the virtualColumns into the cursors. If keeping it in PostJoin cursor is simpler and we are doing the same amount of work, I'll be happy moving it back

Also with the valueMatcher inside the UnnestDimensionCursor I was thinking about using a value matcher to lazily build a bitset and return getValueCardinality() correctly which currently returns the getValueCardinality of the dimension which is incorrect in presence of a filter on the unnested column

I would keep it happening in PostJoinCursor for a couple reasons:

It may never be useful to push filters into the unnest cursor, because with rewriteFilterOnUnnestColumnIfPossible we are pushing them even further: all the way to the underlying StorageAdapter.

Even if it does end up being useful to push filters into the unnest cursor, if you aren't planning to do these optimizations immediately, it's IMO better to keep the code simpler.

The implementation that we had talked about would be push the filter down all the way to the StorageAdapter without any figuring out of which filters to maybe push down (Calcite has already figured that out for us by separating the filters on the two sides of the LogicalCorrelate).

The intent of the implementation (maybe it wasn't done that way though) was that, for applying the filter on the read, it would just be a ValueMatcher happening at read-time which can reuse other code like the PostJoinCursor.

The point of attaching the Filter on the UnnestColumn is to make ordering of things more explicit, with the intention that native queries are supposed to be explicit "you told me to do X, so I do X" things.

gianm · 2023-03-11T00:18:49Z

@@ -310,12 +314,15 @@ private void getNextRow()
  private void initialize()


The javadoc on this method seems out of date (it refers to allowList).

gianm · 2023-03-11T00:22:08Z

-          } else {
-            postFilters.add(filter);
          }
+          // This is needed as a filter on an MV String Dimension returns the entire row matching the filter


fwiw, it's not just about MV string columns. When we support doing this for arrays, the same thing applies to arrays. The pre-unnest filter is an array_contains and the post-unnest filter is a regular =.

Above comment is still relevant.

…ate the filter inside the data source

somu-imply · 2023-03-11T01:49:08Z

@gianm I have rolled back and have made the code simpler. The only change is to remove the allowList. Please review when you have time

gianm

In addition the line comments, please update querying/datasource.md. It refers to allowList which no longer exists.

gianm · 2023-03-13T18:25:10Z

            );
          }
+          // This is needed at this moment for nested queries
+          // Future developer would want to move the virtual columns


Why would a future developer want to do this? (Not a rhetorical question: I really don't know.) Please add some rationale to the comment so people know what you have in mind.

My bad removed the stale comments and updated the docs

gianm · 2023-03-13T18:25:32Z

-          } else {
-            postFilters.add(filter);
          }
+          // This is needed as a filter on an MV String Dimension returns the entire row matching the filter


Above comment is still relevant.

gianm

LGTM

imply-cheddar · 2023-03-14T02:12:43Z

Something I don't understand with this structuring of the code. When we look at the actions taken in planning and running these queries we get

SQL is parsed into parse tree and converted to logical DAG
Logical DAG is optimized such that filters are applied to each side of the UNNEST correlate. That is, Calcite figures out which filters apply to the unnested column (rhs of the LogicalCorrelate with the Uncollect) and which filters apply to the base query (lhs of the LogicalCorrelate with the Uncollect)
We have rules that push all of the filters that Calcite already figured out for us such that they are above theLogicalCorrelate.
We build a native query with the filters all meshed together
The native query then has code that figures out, once again, whether some of the filters can be rewritten to be running against the underlying columns

It seems weird to me that we would explicitly undo the thing that Calcite figured out for us so that we can attempt to re-do it in the native query.

I'd propose that we take a Filter object on the UnnestDatasource. The UnnestCursor can pretty easily attempt the re-write and pushdown of that RHS filter and also attach it as a ValueMatcher on the read. This also seems like a much more natural way to plan the query, no?

Is there some reason that we have to throw away the work that Calcite already did for us only to redo it?

…of project before the filter on top of uncollect

somu-imply · 2023-03-14T05:40:22Z

A query like SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3='b' as Calcite was adding an additional level of project before the filter causing our filter transpose rules to not fire. The calcite plan was

126:LogicalProject(d3=[$17])
  124:LogicalCorrelate(subset=[rel#125:Subset#6.NONE.[]], correlation=[$cor0], joinType=[inner], requiredColumns=[{3}])
    8:LogicalTableScan(subset=[rel#114:Subset#0.NONE.[]], table=[[druid, numfoo]])
    122:LogicalProject(subset=[rel#123:Subset#5.NONE.[]], d3=[CAST('d':VARCHAR):VARCHAR])
      120:LogicalFilter(subset=[rel#121:Subset#4.NONE.[]], condition=[=($0, 'd')])
        118:Uncollect(subset=[rel#119:Subset#3.NONE.[]])
          116:LogicalProject(subset=[rel#117:Subset#2.NONE.[]], EXPR$0=[MV_TO_ARRAY($cor0.dim3)])
            9:LogicalValues(subset=[rel#115:Subset#1.NONE.[0]], tuples=[[{ 0 }]])

Added an additional rule that removes this Project step (which does a CAST). Additional tests introduced for selector filters on STRING and NUMERIC values after unnest

somu-imply · 2023-03-14T17:55:16Z

@gianm @cheddar this PR already contains Gian's changes where the filter after rewrite is added to both pre and post filters. Lack of that change is causing queries to give incorrect results. While I work on the changes as suggested by Eric can we get that part merged, if not through this PR but a separate PR ? Or maybe just merge in #13919 to have right results on the master ?

…a layer of project before the filter on top of uncollect" This reverts commit 112fb54.

This reverts commit 06f7cfb.

clintropolis

👍 going to ignore coverage bot and merge since I will be adding coverage for the missing cases to tests I am in the process of adding to #13803 and I need this since it fixes one of the correctness issues

I vote we do additional refactoring as a follow-up PR

paul-rogers approved these changes Mar 10, 2023

View reviewed changes

gianm mentioned this pull request Mar 10, 2023

Fix filter pushdown on unnest column: need to filter twice. #13919

Closed

gianm reviewed Mar 11, 2023

View reviewed changes

Not pushing filters in now, will be done if needed later when we migr…

cf0f6a2

…ate the filter inside the data source

somu-imply changed the title ~~Refactoring and bug fixes on top of unnest. The filter now is passed …~~ Refactoring and bug fixes on top of unnest. The allowList now is not passed … Mar 11, 2023

gianm reviewed Mar 13, 2023

View reviewed changes

Removing stale comments and updating docs

a0309f6

github-actions Bot added the Area - Documentation label Mar 13, 2023

gianm approved these changes Mar 13, 2023

View reviewed changes

somu-imply added 2 commits March 13, 2023 21:32

Temp changes for selector filter

06f7cfb

Handling rules for a case where selector filters adds an extra layer …

112fb54

…of project before the filter on top of uncollect

github-advanced-security AI found potential problems Mar 14, 2023

View reviewed changes

Comment thread .../main/java/org/apache/druid/sql/calcite/rule/CorrelateProjectOnFIlterRightTransposeRule.java Fixed

somu-imply added 2 commits March 14, 2023 13:22

Revert "Handling rules for a case where selector filters adds an extr…

8919043

…a layer of project before the filter on top of uncollect" This reverts commit 112fb54.

Revert "Temp changes for selector filter"

10b9776

This reverts commit 06f7cfb.

clintropolis approved these changes Mar 14, 2023

View reviewed changes

clintropolis merged commit a7ba361 into apache:master Mar 14, 2023

clintropolis added this to the 26.0 milestone Apr 10, 2023

techdocsmith mentioned this pull request Apr 17, 2023

[DRAFT] 26.0.0 release notes #14064

Closed

		@@ -310,12 +314,15 @@ private void getNextRow()
		private void initialize()

Conversation

somu-imply commented Mar 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

somu-imply commented Mar 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paul-rogers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

somu-imply Mar 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

somu-imply commented Mar 11, 2023

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

imply-cheddar commented Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

somu-imply commented Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

somu-imply commented Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

somu-imply commented Mar 10, 2023 •

edited

Loading

somu-imply commented Mar 10, 2023 •

edited

Loading

somu-imply Mar 11, 2023 •

edited

Loading

imply-cheddar commented Mar 14, 2023 •

edited

Loading

somu-imply commented Mar 14, 2023 •

edited

Loading

somu-imply commented Mar 14, 2023 •

edited

Loading