Skip to content

Now unnest allows bound, in and selector filters on the unnested column#13799

Closed
somu-imply wants to merge 17 commits intoapache:masterfrom
somu-imply:filtersOnUnnest
Closed

Now unnest allows bound, in and selector filters on the unnested column#13799
somu-imply wants to merge 17 commits intoapache:masterfrom
somu-imply:filtersOnUnnest

Conversation

@somu-imply
Copy link
Copy Markdown
Contributor

@somu-imply somu-imply commented Feb 14, 2023

Previously even in the native query, Druid did not support filtering on the unnested column. In this PR we have introduced the following:

  1. Added an allowFilter to use value matchers and advance cursors accordingly
  2. Removed the allowList as we can support the same with filters

Unit tests have been added to address the following filters on the unnested column for both actual dimension and virtual columns

  1. Selector filters
  2. Bound filters
  3. In filters

Unit tests have also been added for cases where multiple filters on unnested dimension and other dimensions on the input table are used together.

TBD: Pushing the filter on the unnested column as a filter on the dimension to be unnested on the left data source
This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@somu-imply somu-imply marked this pull request as ready for review February 21, 2023 17:45
@somu-imply somu-imply marked this pull request as draft February 21, 2023 18:26
@somu-imply somu-imply marked this pull request as ready for review February 22, 2023 02:47
@somu-imply
Copy link
Copy Markdown
Contributor Author

The filter on the unnested dimension or dimensions(in case of an expression using multiple columns) also needs to be pushed down to the datasource to limit the number of rows the cursor goes over. This will be done in a followup PR

@somu-imply
Copy link
Copy Markdown
Contributor Author

@317brian @techdocsmith the docs are behaving flaky with

> spellcheck
> mdspell --en-us --ignore-numbers --report '../docs/**/*.md' || (./script/notify-spellcheck-issues && false)

    ../docs/tutorials/tutorial-jupyter-index.md
       41 | Jupyter both try to use port `8888`, so start Jupyter on a different por 

>> 1 spelling error found in 227 files

The document has port instead of por so this should not be thrown. I tried running spellcheck locally and things pass

} else {
return;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

  while (true) {
    boolean match = valueMatcher.matches();
    advanceAndUpdate();
    if (match || baseCursor.isDone()) {
        return;
    }
  } 

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this should be

while (true) {
      advanceAndUpdate();
      boolean match = valueMatcher.matches();
      if (match || baseCursor.isDone()) {
        return;
      }
    }

The match changes after the update so the call should be below. I'll add it

// If the first value the index is pointing to does not match the filter
// advance the index to the first value which will match
if (!valueMatcher.matches()) {
advance();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant. advanceUninterruptibly() already does the needed checks.

this.needInitialization = true;
this.allowSet = allowSet;
this.allowedBitSet = new BitSet();
this.allowFilter = allowFilter;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we allow a null filter. In the code above, we provided an "accept all" filter in this case. Should we do so here to avoid if null checks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand this comment, can you please elaborate. On another note I'll remove the bitset it is not needed anymore

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paul's comment is asking you to re-evaluate if you need to store the reference to the Filter. In the ColumnValue version of the cursor, you do not store the reference to the Filter and instead generate a ValueMatcher that matches everything. But this code is keeping the reference, it would appear so that it can do an instanceof check.

I think the answer to Paul's comment is that you still need to do the logic to add the BitSet back to this implementation and, once you do that, you won't keep the reference to the Filter object and things will clean up a bit.

Copy link
Copy Markdown
Contributor Author

@somu-imply somu-imply Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I am working on the logic to add the bitset back. This will be a 1-time operation where I iterate over 0 to baseDimensionSelector.getValueCardinality() and look over each element to check if they match the filter or not and add to bitset if they match

if (allowFilter instanceof InDimFilter) {
return ((InDimFilter) allowFilter).getValues().size();
} else if (allowFilter instanceof AndFilter) {
return ((AndFilter) allowFilter).getFilters().size();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? For an in filter, we know that the value has to be one of those in the list. But, for and, I can have anything can't I? x > 0 AND x < 1000 would have a cardinality of 1000 for a long type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point, we can only estimate it correctly for In filters (no in the in filter) and selector filter (should always be 1). This can get complicated for a filter such as d3='a' AND d3='b' where the cardinality will be equal to the size of the filters. I think Calcite would change that to d3 IN (a,b) so we should only handle the 2 cases

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot eliminate the bitset here. You need to be producing the bitset by applying the value matcher to the dictionary of values.

return;
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See suggestion above.

// if there is a filter as d2 > 1 and unnest-d2 < 10
// Calcite would push the filter on d2 into the data source
// and only the filter on unnest-d2 < 10 will appear here
forBaseFilter = filter.getRequiredColumns().contains(outputColumnName) ? null : filter;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

final Filter forBaseFilter = filter == null || filter.getRequiredColumns().contains(outputColumnName) ? null : filter;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems to assume a single filter. In the example, unnest-d2 < 10. But, what if my condition is unnest-d2 < 10 AND unnest_d2 > 5. Will this code work? And if only the unnest-d2 appears here, do we need to check the output column name?

To help understand this, it would be good to give a bit more of the query. What, exactly, is d2 here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more comment. d2 is the column to be unnested. What I mean is that all filters on columns that exist in the datasource are pushed on to the data source. Only the filters on the unnested column appears here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter passed in here is all of the filters that have been pushed to the segment, including filters on the unnest column and other columns

It seems like you could potentially decompose the filter to partially push down anything that is is not filtering on the output column name into the base cursor so that the query can still use indexes and such, since otherwise the implication here is that if any filter is on the unnest column then it means we aren't going to use any indexes since we aren't pushing filter down to the base cursor.

QueryableIndexCursorSequenceBuilder does something like this with FilterAnalysis.analyzeFilter to partition the Filter into 'pre' and 'post' filters, the former which have bitmap indexes and go into computing the underying Offset, and the latter do the value matcher thing we are doing here.

dimensionToUnnest,
outputColumnName,
allowSet
filter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, we may have passed the filter into the "base". If we did, do we want to reapply the same filter here? Or, should the above code actually be:

    final Filter forBaseFilter;
    final Filter localFilter;
    if (filter == null || filter.getRequiredColumns().contains(outputColumnName) ) {
      forBaseFilter = filter;
      localFilter = null;
    } else {
      forBaseFilter = filter;
      localFilter = null;
   }

Then pass localFilter where we're passing filter?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter that appears here will be on the unnested column only. For example if you are unnesting dim3 to ud3, any filter on ud3 will appear here. The filter on ud3 cannot be pushed into base but can only be sent to the unnest cursor as an allow filter. There is scope of optimization here to rewrite the same filter on dim3 and push it to the base

"vc",
QueryRunnerTestHelper.PLACEMENTISH_DIMENSION_UNNEST,
null
QueryRunnerTestHelper.PLACEMENTISH_DIMENSION_UNNEST
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what is a PLACEMENTISH?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a dimension in our test setup that is of the format placement,<some_string>

Assert.assertEquals(k, 10);
}

@Test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these tests be retained, but updated to whatever the new format is? Presumably these addressed some earlier concern, and we should ensure that they still work. Or, are there new tests that cover the same cases that the deleted tests cover?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are new tests that cover these. They are added in the CalciteArraysQueryTest

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally there should be tests added here too... I would expect tests here to be checking stuff like the cursor spits out the expected filtered number of rows, etc.

I guess the storage adapter test partially covers this, but the more tests the better

"dummy",
OUTPUT_NAME,
IGNORE_SET
null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tests a wide range of predicates (filters) on the unnested value? I worry that the code above handles only simple cases: unnested_value op constant. Do they also handle complex cases such as fn(unnested_value) > some_other_col? fn(unnested_value) > 10 AND unnested_value * 3 > 12? Would be good to have tests that exercise these non-trivial cases.

@somu-imply
Copy link
Copy Markdown
Contributor Author

The one failing test here will be resolved once #13860 gets merged

skipVectorize();
cannotVectorize();
testQuery(
"SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3 IN ('d','c') ",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, is the d3 in these tests is referring to unnested.d3 or druid.numfoo.d3? it seems maybe a bit confusing to the casual reader to name the unnested thing the same thing as the underlying column name which was trasnformed with MV_TO_ARRAY (though maybe nice to intentionally cover this in at least one test with comment clarifying the confusing aliases is actually filtering on the unnested output column and not d3)

Also, should there be tests filtering on d3 where the native query plans to filtering on d3 directly instead of the unnested column, similar to the tests with filters directly on the numeric array elements in other tests? That said, that doesn't really go through the new codepaths here, so is more just a bonus coverage.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i got confused from all the repetitive patterns in this query, d3 isn't dim3, another instance of me not being able to read 😭

ScanQueryRunnerTest.verify(ascendingExpectedResults, results);
}

@Test
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be replaced with scan queries which filter on the unnested output column name

Assert.assertEquals(k, 10);
}

@Test
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally there should be tests added here too... I would expect tests here to be checking stuff like the cursor spits out the expected filtered number of rows, etc.

I guess the storage adapter test partially covers this, but the more tests the better

// if there is a filter as d2 > 1 and unnest-d2 < 10
// Calcite would push the filter on d2 into the data source
// and only the filter on unnest-d2 < 10 will appear here
forBaseFilter = filter.getRequiredColumns().contains(outputColumnName) ? null : filter;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter passed in here is all of the filters that have been pushed to the segment, including filters on the unnest column and other columns

It seems like you could potentially decompose the filter to partially push down anything that is is not filtering on the output column name into the base cursor so that the query can still use indexes and such, since otherwise the implication here is that if any filter is on the unnest column then it means we aren't going to use any indexes since we aren't pushing filter down to the base cursor.

QueryableIndexCursorSequenceBuilder does something like this with FilterAnalysis.analyzeFilter to partition the Filter into 'pre' and 'post' filters, the former which have bitmap indexes and go into computing the underying Offset, and the latter do the value matcher thing we are doing here.


@Test
public void test_unnest_adapters_with_allowList()
public void test_unnest_adapters_basic_with_in_filter()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo please add more tests here, including tests with filters on other columns if possible, since the unnest cursor wraps a base cursor, just to make sure no funny stuff is happening.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a test suite for "normal" cursor stuff, it would be great to be able to plug this into one of those. Those should all be able to pass without any errors as long as they aren't referencing the actual unnested column.

I haven't looked to see how easy it will be to reuse that test suite, but it's worth it to explore what can be done to be able to reuse it.

skipVectorize();
cannotVectorize();
testQuery(
"SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3 IN ('foo','bar') ",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test where there are filters on other columns too (a mix of filter on unnested column and filter on some other column)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is 1 already SELECT d12 FROM druid.numfoo, UNNEST(ARRAY[m1, m2]) as unnested (d12) where d12 IN ('1','2') AND m1 < 10

Comment on lines +232 to +240
} else if (partialQuery.getUnnestFilter() != null) {
filter = Preconditions.checkNotNull(
computeUnnestFilter(
partialQuery,
plannerContext,
sourceRowSignature,
virtualColumnRegistry
)
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you not have both a where filter an an unnest filter? is an unnest filter basically a where filter that got picked up somewhere else?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes an unnest filter is same as a where or having that gets picked up somewhere else

@somu-imply
Copy link
Copy Markdown
Contributor Author

somu-imply commented Feb 28, 2023

It seems like you could potentially decompose the filter to partially push down anything that is is not filtering on the output column name into the base cursor so that the query can still use indexes and such, since otherwise the implication here is that if any filter is on the unnest column then it means we aren't going to use any indexes since we aren't pushing filter down to the base cursor.

Calcite is smart enough to push the filter on an existing dimension to the underlying datasource. For example a query like

with t as (select * from druid.numfoo,UNNEST(MV_TO_ARRAY(dim3)) as foo(d3))
select d3 from t where d3 >1 AND dim2 IN ('a','b')

Calcite plans it as

157:LogicalProject(d3=[$17])
  155:LogicalCorrelate(subset=[rel#156:Subset#7.NONE.[]], correlation=[$cor0], joinType=[inner], requiredColumns=[{3}])
    144:LogicalFilter(subset=[rel#145:Subset#1.NONE.[]], condition=[OR(=($2, 'a'), =($2, 'b'))])
      9:LogicalTableScan(subset=[rel#143:Subset#0.NONE.[]], table=[[druid, numfoo]])
    151:LogicalFilter(subset=[rel#152:Subset#5.NONE.[]], condition=[>(CAST($0):INTEGER, 1)])
      149:Uncollect(subset=[rel#150:Subset#4.NONE.[]])
        147:LogicalProject(subset=[rel#148:Subset#3.NONE.[]], EXPR$0=[MV_TO_ARRAY($cor0.dim3)])
          10:LogicalValues(subset=[rel#146:Subset#2.NONE.[0]], tuples=[[{ 0 }]])

Copy link
Copy Markdown
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that this is planning is going to create very sub-optimal runs and limitations on query running that shouldn't exist.

The code strategy should've been to replace the allow list with a Filter object and then work with that. We cannot merge this code as is and must change it around to replacing the allow list with a Filter object and then applying that.

if (allowFilter instanceof InDimFilter) {
return ((InDimFilter) allowFilter).getValues().size();
} else if (allowFilter instanceof AndFilter) {
return ((AndFilter) allowFilter).getFilters().size();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot eliminate the bitset here. You need to be producing the bitset by applying the value matcher to the dictionary of values.

}
} else {
if (index >= indexedIntsForCurrentRow.size() - 1) {
if (indexForRow >= indexedIntsForCurrentRow.size() - 1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexedInt is a class in the Druid code base. It's logically equivalent to an int[]. The values are references to the dictionary. There is some javadoc that attempts to explain these things on interfaces like DimensionSelector and DimensionDictionarySelector

Comment on lines 58 to 56
String columnName,
String outputName,
LinkedHashSet<String> allowList
String outputName
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had expected you to change the allowList to a Filter but keep it on this object. It looks like you are assuming that the filter passed into the call to make the cursor is good enough. In the "best" case, you are planning things to attach the filter on the unnested thing to the other filters being included, in which case, as clint mentions, the entire filter is moving to a value matcher with absolutely nothing being pushed down. In the worst case, some of the filters are being lost in planning and incorrect results can occur. I haven't validated which one of these is likely occuring yet.

Given that Calcite very clearly puts the filter on the UNCOLLECT, you should be able to just attach a Filter object here instead of putting it on the query as a whole. The logic inside of this should remain more or less the same as what it was before with the allow list, just, you are applying a value predicate to the dictionaries instead of building a bitset.

The filter used to make the cursor should continue to be passed down (though it should also have the filter that is attached here adjusted and pushed down as well).

Comment on lines +3460 to +3522
public void testUnnestWithFilteringOnUnnestedVirtualCol()
{
skipVectorize();
cannotVectorize();
testQuery(
"SELECT d12 FROM druid.numfoo, UNNEST(ARRAY[m1, m2]) as unnested (d12) where d12 IN ('1','2') AND m1 < 10",
ImmutableList.of(
Druids.newScanQueryBuilder()
.dataSource(UnnestDataSource.create(
new QueryDataSource(
newScanQueryBuilder()
.dataSource(
new TableDataSource(CalciteTests.DATASOURCE3)
)
.intervals(querySegmentSpec(Filtration.eternity()))
.resultFormat(ScanQuery.ResultFormat.RESULT_FORMAT_COMPACTED_LIST)
.legacy(false)
.filters(bound("m1", null, "10", false, true, null, StringComparators.NUMERIC))
.columns(
"__time",
"cnt",
"d1",
"d2",
"dim1",
"dim2",
"dim3",
"dim4",
"dim5",
"dim6",
"f1",
"f2",
"l1",
"l2",
"m1",
"m2",
"unique_dim1"
)
.context(QUERY_CONTEXT_DEFAULT)
.build()
),
"v0",
"EXPR$0"
))
.intervals(querySegmentSpec(Filtration.eternity()))
.resultFormat(ScanQuery.ResultFormat.RESULT_FORMAT_COMPACTED_LIST)
.legacy(false)
.context(QUERY_CONTEXT_DEFAULT)
.virtualColumns(expressionVirtualColumn("v0", "array(\"m1\",\"m2\")", ColumnType.FLOAT_ARRAY))
.filters(new InDimFilter("EXPR$0", ImmutableList.of("1.0", "2.0"), null))
.columns(ImmutableList.of(
"EXPR$0"
))
.build()
),

ImmutableList.of(
new Object[]{1.0f},
new Object[]{1.0f},
new Object[]{2.0f},
new Object[]{2.0f}
)
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, from looking at how things are planned, I thought that we were throwing away the actual where clause. Instead, this test showed me what is actually happening: when we have both filters, the unnest is being planned on top of a scan query which carries the original filter.

This is a bad plan, it is going to needlessly force things to run on the Broker instead of being pushed down. The unnest should be pushed down onto the actual TableDataSource and the native query here should be a singular scan on top of an unnest data source of a table reference.

@somu-imply somu-imply marked this pull request as draft March 2, 2023 16:54
@somu-imply
Copy link
Copy Markdown
Contributor Author

The following things have been addressed:

  1. Filter moved inside unnest data source
  2. A transformation is introduced that creates the partial query on the correlate from the partial query on the left
  3. We only create a query data source if there is a LIMIT as the limit needs to be honored before unnest
  4. Updated test cases.

Things to be done:

  1. Add more tests
  2. Modify getValueCardinality for the dimension cursor
  3. Transform the filter on the unnested output column to filters on the left and pass them to the base cursor


UNNEST_PROJECT
UNNEST_PROJECT,
UNNEST_FILTER
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be removed

Copy link
Copy Markdown
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some comments here. As I did the review, I think that you are likely still in the process of adjusting things as some of the stuff that we've discussed directly isn't yet reflected here.

this.needInitialization = true;
this.allowSet = allowSet;
if (filter != null) {
this.valueMatcher = filter.makeMatcher(getColumnSelectorFactory());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is called in the constructor and then also returned as a public method, maybe switch it around to have the constructor create the ColumnSelectorFactory, store the reference and then use that here/return it from the getColumnSelectorFactory method.

this.needInitialization = true;
this.allowSet = allowSet;
this.allowedBitSet = new BitSet();
this.allowFilter = allowFilter;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paul's comment is asking you to re-evaluate if you need to store the reference to the Filter. In the ColumnValue version of the cursor, you do not store the reference to the Filter and instead generate a ValueMatcher that matches everything. But this code is keeping the reference, it would appear so that it can do an instanceof check.

I think the answer to Paul's comment is that you still need to do the logic to add the BitSet back to this implementation and, once you do that, you won't keep the reference to the Filter object and things will clean up a bit.


@Test
public void test_unnest_adapters_with_allowList()
public void test_unnest_adapters_basic_with_in_filter()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a test suite for "normal" cursor stuff, it would be great to be able to plug this into one of those. Those should all be able to pass without any errors as long as they aren't referencing the actual unnested column.

I haven't looked to see how easy it will be to reuse that test suite, but it's worth it to explore what can be done to be able to reuse it.

Comment on lines +146 to +148
// If there is a LIMIT in the left query
// It should be honored before unnest
// Create a query data source in that case
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of query is that? Do we have a test covering it? I'm not sure I believe this comment is true, but I also am not able to imagine a query that can do this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we have a test case, the query is like
SELECT d3 FROM (select * from druid.numfoo where dim2 IN ('a','b','ab','abc') LIMIT 2), UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3)

Here the limit from the left cannot be applied outside

Comment on lines +257 to +276
private PartialDruidQuery updateCorrPartialQueryFromLeft(PartialDruidQuery partialQueryFromLeft)
{
// The DruidCorrelateRule already creates the project and pushes it on the top level
// So get select project from partialQuery
// The filters are present on the partial query of the left
// The group by and having clauses would be on the top level
// Same for the sort
PartialDruidQuery corrQuery = PartialDruidQuery.create(correlateRel);
corrQuery = corrQuery.withWhereFilter(partialQueryFromLeft.getWhereFilter())
.withSelectProject(partialQuery.getSelectProject());
if (partialQuery.getAggregate() != null) {
corrQuery = corrQuery.withAggregate(partialQuery.getAggregate())
.withHavingFilter(partialQuery.getHavingFilter());
}
if (partialQuery.getSort() != null || partialQuery.getSortProject() != null) {
corrQuery = corrQuery.withSort(partialQuery.getSort());
}
return corrQuery;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like we are throwing away the left-hand side and instead trying to build a query from the unnest part? Taht seems wrong to me, we should be using the unnest on the RHS to augment the query from the LHS (attach to the datasource). the LHS is the actual working query, the RHS is just the addition of the unnest portion.

Copy link
Copy Markdown
Contributor Author

@somu-imply somu-imply Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DruidCorrelateUnnestRule already put the select project on top of the correlateRel by adding all the things in right to the left. Also the group by, order by on the query are already present on the correlateRel. So we create the query by adding the select project (the one which is already projected on top with left + right), the filters from the left + aggregates (if present) + sorts if present on the top level

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Mar 7, 2023

@somu-imply could you please have a look at #13892? That has some changes that I started doing as part of looking into upgrading Calcite. The upgrade borked a bunch of the existing UNNEST tests and exposed some issues with how the planning works, which led to invalid plans. The changes in the #13892 should fix up the planning. IMO the logic is also cleaner.

final DruidRel<?> newDruidRelFilter;
final DruidRel<?> newDruidRel;
final List<RexNode> newProjectExprs = new ArrayList<>();
final List<RexNode> newWheres = new ArrayList<>();

Check notice

Code scanning / CodeQL

Unread local variable

Variable 'List<RexNode> newWheres' is never read.
@Override
public double getDouble()
{
return Double.parseDouble(dimSelector.lookupName(idRef.get()));

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException

Potential uncaught 'java.lang.NumberFormatException'.
@somu-imply
Copy link
Copy Markdown
Contributor Author

Closed in favor of #13934

@somu-imply somu-imply closed this Mar 23, 2023
@somu-imply somu-imply deleted the filtersOnUnnest branch March 23, 2023 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants