Speed up SQL IN using SCALAR_IN_ARRAY.#16388
Conversation
Main changes: 1) DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of the IN is above inFunctionThreshold. The default value of inFunctionThreshold is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE. 2) SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular expression, when the size of the SEARCH is above inFunctionExprThreshold. The default value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting it to Integer.MAX_VALUE. 3) ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is greater than inFunctionThreshold.
asdf2014
left a comment
There was a problem hiding this comment.
Overall LGTM, also need to replace || with <code>||</code> in Markdown table to display it correctly
Co-authored-by: Benedict Jin <asdf2014@apache.org>
| if (valuesNode.size() > plannerContext.queryContext().getInFunctionThreshold() | ||
| && valuesNode.stream().allMatch(node -> node.getKind() == SqlKind.LITERAL && !SqlUtil.isNull(node))) { |
There was a problem hiding this comment.
why not handle mixed versions as well? literals could be handled with this - but leave the other problematic stuff outside in an OR
the NULL case would be also less problematic - as those will be left outside as well...
or there is something wrong with:
x IN (1,2,3,y,null) => DRUID_IN(x,[1,2,3]) OR x = y OR x = null
There was a problem hiding this comment.
Extending this to split the call up into multiple calls would add complexity, and I was trying to keep the logic simple. I figured it would not be common to include NULL or nonliterals in the IN.
There was a problem hiding this comment.
sure - it can be added later if needed!
| reverseLookupKey.negate, | ||
|
|
||
| // Use regular equals, or SCALAR_IN_ARRAY, depending on inFunctionThreshold. | ||
| reversedMatchValues.size() >= plannerContext.queryContext().getInFunctionThreshold(), |
There was a problem hiding this comment.
I wonder if it would look simpler to pass plannerContext instead or inFunctionThreshold - and let this logic live inside makeIn
There was a problem hiding this comment.
Ah, it's like this because different usages of this method use different thresholds. Sometimes it's the inFunctionThreshold, sometimes it's the inFunctionExprThreshold.
| cannotVectorize(); | ||
|
|
||
| testQuery( | ||
| "SELECT dim1 NOT IN ('abc', 'def', 'ghi') AND dim1 < 'zzz', COUNT(*)\n" |
There was a problem hiding this comment.
I think it would be more interesting have these tests apply inequality which could have filtered out some IN literal(s)
There was a problem hiding this comment.
Cool idea. I added a test for this as well: testNotInOrEqualToOneOfThemExpression.
kgyrtkirk
left a comment
There was a problem hiding this comment.
it seems like something odd have happened to you branch; there are some pom.xml changes
6674c35 to
e44a389
Compare
|
I think something got messed up when I pulled the commit from github itself: 492c80c. I just fixed it up and force pushed. The only change since the original patch is the new test |
|
@asdf2014 thanks for reviewing! @kgyrtkirk thanks as well! please let me know if you have any additional comments; if not I'll merge this. |
kgyrtkirk
left a comment
There was a problem hiding this comment.
no more comments/questions etc :)
Main changes:
DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
the IN is above inFunctionThreshold. The default value of inFunctionThreshold
is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.
SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
it to Integer.MAX_VALUE.
ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
greater than inFunctionThreshold.
Benchmarks follow. Overall planning for large
INis much faster. Two new ones are marked withDNFonmaster, where I gave up and canceled them after they ran for a few minutes. Those same test cases completed in 100ms each with the patch.