Unnest changes for moving the filter on right side of correlate to inside the unnest datasource by somu-imply · Pull Request #13934 · apache/druid

somu-imply · 2023-03-14T22:09:56Z

Addresses the following:

Removed the right filter transform
Since Calcite already figured out the filter on top of uncollect that is pushed inside the data source
Rest of the filters on the left behave as earlier
Slight change in logic in filter rewrite as we not have to iterate over filter list to find filters on unnested column
Fix for selector filters by removing the intermediate project above the DruidUnnestRel only when it is doing a CAST or LITERAL
Handling special case for OR filters involving unnested column and regular left data source columns which are planned on top of Correlate

This PR has:

…inside the unnest cursors. Added tests for scenarios such as 1. filter on unnested column which involves a left filter rewrite 2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite 3. not filters 4. SQL functions applied on top of unnested column 5. null present in first row of the column to be unnested

…ate the filter inside the data source

…of project before the filter on top of uncollect

…h the project LITERAL path

…ating a new or on input dimension with the entire filter appearing post unnest also

imply-cheddar

I think there are some logic bugs still. Likely, we need a data set that has more rows and matches/skips various different rows based on different conditions.

imply-cheddar · 2023-03-14T22:55:21Z

        final Set<String> requiredColumns = filter.getRequiredColumns();

-        // Run filter post-correlate if it refers to any virtual columns.
+        // Run filter post-unnest if it refers to any virtual columns.


Make this

// Run filter post-unnest if it refers to any virtual columns. This is a conservative judgement call
// that perhaps forces the code to use a ValueMatcher where an index would've been available,
// which can have real performance implications. This is an interim choice made to value correctness
// over performance. When we need to optimize this performance, we should be able to
// create a VirtualColumnDatasource that contains all of the virtual columns, in which case the query
// itself would stop carrying them and everything should be able to be pushed down.

imply-cheddar · 2023-03-14T22:57:41Z

+    if (unnestFilter instanceof AndFilter) {
+      for (Filter filter : ((AndFilter) unnestFilter).getFilters()) {
+        filterSplitter.addPostFilterWithPreFilterIfRewritePossible(filter);
      }
    } else {
-      filterSplitter.add(queryFilter);
+      filterSplitter.addPostFilterWithPreFilterIfRewritePossible(unnestFilter);
    }


Is there a specific reason that we are breaking up the unnestFilter here? Or is this a reflection of a C&P of the logic for the normal filters?

We break it up in the case of AND so that we can find out if one of them can be added to the pre filters through re-write. Adding to pre-filters takes place inside addPostFilterWithPreFilterIfRewritePossible method

This can be made without breaking up, refactored

imply-cheddar · 2023-03-14T23:03:31Z

 * Allows subquery elimination.
 *
- * @see CorrelateFilterRTransposeRule similar, but for right-hand side filters
+ * @see DruidFilterUnnestRule similar, but for right-hand side filters


Is this comment legitimate?

The filter unnest rule does handle the filters on the right which is the top of uncollect though

Sorry, is this comment telling me that I should expect the DruidFilterUnnestRule to be the same code as this? I'm not sure why this is telling me to see that other one?

I think I removed it

imply-cheddar · 2023-03-15T07:25:44Z

+    if (unnestFilter != null) {
+      return virtualColumn.equals(that.virtualColumn) && unnestFilter.equals(that.unnestFilter)
+             && base.equals(that.base);
+    } else {
+      return virtualColumn.equals(that.virtualColumn) && base.equals(that.base);
+    }


This is an odd equals implementation. If the current object's unnestFilter is null, then the that object can have any unnest filter it wants?

Generally speaking, I suggest letting IntelliJ generate your equals and hashCode. (I'm assuming that perhaps you adjusted it yourself?)

imply-cheddar · 2023-03-15T07:44:06Z

+    List<Filter> preFilterList = new ArrayList<>();
+    List<Filter> postFilterList = new ArrayList<>();


The preFilterList, postFilterList here is very confusing and easy to mix up with the filterSplitter... I'm not sure how to separate it exactly, but I almost thought that these were getting AND'd together because I confused them with the FilterSplitter filters.

imply-cheddar · 2023-03-15T07:45:43Z

+      for (Filter filter : ((OrFilter) queryFilter).getFilters()) {
+        if (filter.getRequiredColumns().contains(outputColumnName)) {
          final Filter newFilter = rewriteFilterOnUnnestColumnIfPossible(filter, inputColumn, inputColumnCapabilites);
          if (newFilter != null) {


you are doing an if/else with a negative condition. Please invert it. If you are doing an if/else please always make the condition carry as few negations as possible.

imply-cheddar · 2023-03-15T07:46:55Z

-            // any rows that do not match this filter at all.
-            preFilters.add(newFilter);
+            preFilterList.add(newFilter);
+            postFilterList.add(newFilter);


postFilter should be the original filter, I think?

This has been refactored with comments

imply-cheddar · 2023-03-15T07:54:21Z

+            preFilterList.add(newFilter);
+            postFilterList.add(newFilter);
+          } else {
+            postFilterList.add(filter);


If it's not re-writable, then I don't think the OR can be pushed down at all. If you only pushdown the things that don't refer to the output of the unnest, you run the risk of not including some rows that should've been included. It should be relatively easy to test for this. If there's a query with an MV_TO_STRING(unnested_dim, 'a') = 'b' OR other_dim = 'a', I don't think it'll believe that it can remap it. If it pushes down the other_dim = a without attempting to push down the first part of the OR, it will not include a row where nested_dim is ["b"] and other_dim is "c".

Added test cases to mimic all the different cases

imply-cheddar · 2023-03-15T07:55:34Z

        }
      }
    }
+    if (!preFilterList.isEmpty()) {


Anotehr if/else on negative logic. Keep it positive please.

imply-cheddar · 2023-03-15T07:56:34Z

+    }
+
+    if (!postFilterList.isEmpty()) {
+      AndFilter andFilter = new AndFilter(postFilterList);


Why are these AND'd? Didn't they come from an OR?

somu-imply · 2023-03-15T21:13:43Z

Simplified the code for adding filters, added comments on how the rewriting is done and changed some variable names to make things easier to read

…Filter instanceof checks

imply-cheddar · 2023-03-16T05:28:23Z

+  public void test_unnest_adapters_with_no_base_filter_active_unnest_filter()
+  {
+
+    Sequence<Cursor> cursorSequence = UNNEST_STORAGE_ADAPTER2.makeCursors(
+        null,
+        UNNEST_STORAGE_ADAPTER2.getInterval(),
+        VirtualColumns.EMPTY,
+        Granularities.ALL,
+        false,
+        null
+    );
+
+    cursorSequence.accumulate(null, (accumulated, cursor) -> {
+      ColumnSelectorFactory factory = cursor.getColumnSelectorFactory();
+
+      DimensionSelector dimSelector = factory.makeDimensionSelector(DefaultDimensionSpec.of(OUTPUT_COLUMN_NAME));
+      int count = 0;
+      while (!cursor.isDone()) {
+        Object dimSelectorVal = dimSelector.getObject();
+        if (dimSelectorVal == null) {
+          Assert.assertNull(dimSelectorVal);
+        }
+        cursor.advance();
+        count++;
+      }
+      Assert.assertEquals(1, count);
+      Filter unnestFilter = new SelectorDimFilter(OUTPUT_COLUMN_NAME, "1", null).toFilter();
+      VirtualColumn vc = new ExpressionVirtualColumn(
+          OUTPUT_COLUMN_NAME,
+          "\"" + COLUMNNAME + "\"",
+          null,
+          ExprMacroTable.nil()
+      );
+      final String inputColumn = UNNEST_STORAGE_ADAPTER2.getUnnestInputIfDirectAccess(vc);
+      Pair<Filter, Filter> filterPair = UNNEST_STORAGE_ADAPTER2.computeBaseAndPostUnnestFilters(
+          null,
+          unnestFilter,
+          VirtualColumns.EMPTY,
+          inputColumn,
+          INCREMENTAL_INDEX_STORAGE_ADAPTER.getColumnCapabilities(inputColumn)
+      );
+      SelectorFilter left = ((SelectorFilter) filterPair.lhs);
+      SelectorFilter right = ((SelectorFilter) filterPair.rhs);
+      Assert.assertEquals(inputColumn, left.getDimension());
+      Assert.assertEquals(OUTPUT_COLUMN_NAME, right.getDimension());
+      Assert.assertEquals(right.getValue(), left.getValue());
+      return null;
+    });
+  }
+


The structuring of this test for the purposes of validating that the filters are being managed properly seems a bit off. All of the decisioning of what should be pushed down and what shouldn't will have been done when you call makeCursors and you should be able to see it from there.

Specifically, we want to validate that it chose to push down specific filters and that it attached other filters to a PostJoinCursor.

You can have a TestStorageAdapter that exists just to have makeCursor called on it. You can store the values that were called and then return a Sequence<Cursor> where the Cursor objects are also just empty implementations.

Then, in the test, you can ask your TestStorageAdapter for which filters it was called with and validate that it's the ones you expect to be pushed down. You then take the cursors and validate that they are PostJoinCursor objects, then reach in and grab their filters (or it might need to be the ValueMatchers) and ensure that those filters are what you expect.

My goal is to see if the lhs and rhs of filter splitter are correct because those are the parts that gets pushed down or to the PostJoinCursor and this kind of does that. I can also validate that the cursor I get after makeCursor is actually a PostJoinCursor. Atm we do not have a method on the cursor to get the filter passed to it so I was calling the method which modifies the filters to validate. Thanks for the feedback, I'll try to refactor the tests based on the suggestions

Yeah, this test is calling the methods that the implementation should call. Essentially trying to mimic the actual implementation and making sure that it does the right thing. That's a fine test, if you assume that the base implementation will never change outside of these methods. But as soon as someone changes the implementation outside of one of these methods, the tests will no longer validate what we want to validate.

It's better to focus on testing the contract of the object. The contract in this case is that it is given some filters on makeCursors and some of those should be rewritten and passed down (to a delegate makeCursors call) and some of them should be placed on the return cursor for filtering. If you write the test to validate that contract, then it will be a cleaner test that is more helpful in the face of code changes. Fwiw, there are two reasons to have tests

To give you confidence that the changes you are making work

To give everybody on the project confidence that their changes work

Goal 2 is infinitely more important than number 1 and is best served by testing to contract and not implementation as much as possible.

Makes sense, I'm working on the refactoring

imply-cheddar · 2023-03-16T05:31:33Z

+      final Project rightP = call.rel(0);
+      final SqlKind rightProjectKind = rightP.getChildExps().get(0).getKind();
+      final DruidUnnestRel unnestDatasourceRel = call.rel(1);
+
+      if (rightP.getProjects().size() == 1 && (rightProjectKind == SqlKind.CAST || rightProjectKind == SqlKind.LITERAL)) {
+        call.transformTo(unnestDatasourceRel);
+      }


Should we do the check of whether or not this actually does a rewrite in the matches call instead of the onMatch call?

Yes, will followup on this, will add the matches code

Added the matches part along with new examples

imply-cheddar · 2023-03-16T05:34:18Z

+  public void testUnnestWithMultipleOrFiltersOnUnnestedColumnsAndOnOriginalColumn()
+  {
+    skipVectorize();
+    cannotVectorize();
+    testQuery(
+        "SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3='b' or dim3='d' ",
+        QUERY_CONTEXT_UNNEST,
+        ImmutableList.of(
+            Druids.newScanQueryBuilder()
+                  .dataSource(UnnestDataSource.create(
+                      new TableDataSource(CalciteTests.DATASOURCE3),
+                      expressionVirtualColumn("j0.unnest", "\"dim3\"", ColumnType.STRING),
+                      null
+                  ))
+                  .intervals(querySegmentSpec(Filtration.eternity()))
+                  .resultFormat(ScanQuery.ResultFormat.RESULT_FORMAT_COMPACTED_LIST)
+                  .filters(
+                      or(
+                          selector("j0.unnest", "b", null),
+                          selector("dim3", "d", null)
+                      )
+                  )
+                  .legacy(false)
+                  .context(QUERY_CONTEXT_UNNEST)
+                  .columns(ImmutableList.of("j0.unnest"))
+                  .build()
+        ),
+        ImmutableList.of(
+            new Object[]{"b"},
+            new Object[]{"b"},
+            new Object[]{"d"}
+        )
+    );
+  }


This test has got me wondering what will happen if you run

"SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3='b' or dim3='a' "

Instead. I expect that it should match all of the ["a", "b"] rows and thus be 4 results of a, b, a, b. But I'm curious.

Should be the rows that are a and b. The table has 3 such rows, so would be 3 results a,b,b.

Oh, I thought there were 2 rows with ["a", "b"], my bad... Does it work the same if you invert it? If you say dim3='b' OR d3 = 'a'?

Btw, you might as well add these as tests as well. They might not be testing any new corners, but anything that's like "I wonder what happens if" is worth including.

…he base cursor while using unnest with or and regular filters

somu-imply added 6 commits March 14, 2023 15:05

Not pushing filters in now, will be done if needed later when we migr…

586f3ee

…ate the filter inside the data source

Removing stale comments and updating docs

75e8b77

Temp changes for selector filter

dda7872

Handling rules for a case where selector filters adds an extra layer …

3cf5e72

…of project before the filter on top of uncollect

Trying to move filter inside unnest part 1

76aaed1

github-actions Bot added the Area - Documentation label Mar 14, 2023

Merge remote-tracking branch 'upstream/master' into unnest_changes1

dd4529f

github-actions Bot removed the Area - Documentation label Mar 14, 2023

somu-imply added 4 commits March 14, 2023 16:27

Some cleanup after merging with master

6227595

checkstyle fix and 1 test case with selector on virtual column throug…

65a6397

…h the project LITERAL path

Adding support for OR filters on unnested column and other columns

7e68d4c

Redesigning or filters for unnest, now an or will be optimized by cre…

1bd289b

…ating a new or on input dimension with the entire filter appearing post unnest also

imply-cheddar reviewed Mar 15, 2023

View reviewed changes

Refactoring for OR filter case and adding more comments and examples

f930890

somu-imply added 2 commits March 15, 2023 14:23

A change in the comments to be in sync with code

606396d

New test cases with or filters and slight refactoring bu removing And…

59ce1ba

…Filter instanceof checks

somu-imply marked this pull request as ready for review March 16, 2023 01:45

somu-imply added 2 commits March 15, 2023 19:02

Minor nits in comment + new test

9f5b276

More tests now in storage adapter to check filers and improve coverage

3838d6c

imply-cheddar reviewed Mar 16, 2023

View reviewed changes

somu-imply added 3 commits March 17, 2023 10:26

Using matches in the rule now

b3c02e7

Fixing a NPE

96524df

Refactoring in tests to allow validating the filters pushed down to t…

b446228

…he base cursor while using unnest with or and regular filters

somu-imply requested a review from imply-cheddar March 22, 2023 17:34

imply-cheddar approved these changes Mar 23, 2023

View reviewed changes

cheddar merged commit 2ad133c into apache:master Mar 23, 2023

somu-imply deleted the unnest_changes1 branch March 23, 2023 03:22

somu-imply mentioned this pull request Mar 23, 2023

Now unnest allows bound, in and selector filters on the unnested column #13799

Closed

10 tasks

somu-imply mentioned this pull request Mar 23, 2023

Fixing a data correctness issue in unnest when first row of an MVD is null #13764

Closed

10 tasks

clintropolis added this to the 26.0 milestone Apr 10, 2023

techdocsmith mentioned this pull request Apr 17, 2023

[DRAFT] 26.0.0 release notes #14064

Closed

		List<Filter> preFilterList = new ArrayList<>();
		List<Filter> postFilterList = new ArrayList<>();

Conversation

somu-imply commented Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imply-cheddar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

somu-imply commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

somu-imply Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imply-cheddar Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

somu-imply commented Mar 14, 2023 •

edited

Loading

somu-imply commented Mar 15, 2023 •

edited

Loading

somu-imply Mar 16, 2023 •

edited

Loading

imply-cheddar Mar 16, 2023 •

edited

Loading