Speed up SQL IN using SCALAR_IN_ARRAY. by gianm · Pull Request #16388 · apache/druid

gianm · 2024-05-03T21:54:57Z

Main changes:

DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
the IN is above inFunctionThreshold. The default value of inFunctionThreshold
is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.
SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
it to Integer.MAX_VALUE.
ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
greater than inFunctionThreshold.

Benchmarks follow. Overall planning for large IN is much faster. Two new ones are marked with DNF on master, where I gave up and canceled them after they ran for a few minutes. Those same test cases completed in 100ms each with the patch.

InPlanningBenchmark
===================

inClauseLiteralsCount = 1000
inSubQueryThreshold = 2147483647
rowsPerSegment = 100

## master

Benchmark                                                        Score    Error  Units
InPlanningBenchmark.queryEqualOrInSql                          734.148 ± 46.319  ms/op
InPlanningBenchmark.queryInSql                                 243.272 ± 59.093  ms/op
InPlanningBenchmark.queryJoinEqualOrInSql                      758.371 ± 60.192  ms/op
InPlanningBenchmark.queryMultiEqualOrInSql                     739.495 ± 21.526  ms/op
InPlanningBenchmark.queryStringFunctionInSql                   484.096 ± 46.358  ms/op
InPlanningBenchmark.queryStringFunctionIsNotNullAndNotInSql        DNF
InPlanningBenchmark.queryStringFunctionIsNullOrInSql               DNF

## patch

Benchmark                                                        Score    Error  Units
InPlanningBenchmark.queryEqualOrInSql                           27.063 ±  2.291  ms/op
InPlanningBenchmark.queryInSql                                  24.686 ±  2.113  ms/op
InPlanningBenchmark.queryJoinEqualOrInSql                       29.158 ±  4.165  ms/op
InPlanningBenchmark.queryMultiEqualOrInSql                      29.845 ±  2.914  ms/op
InPlanningBenchmark.queryStringFunctionInSql                    92.489 ±  6.070  ms/op
InPlanningBenchmark.queryStringFunctionIsNotNullAndNotInSql    104.064 ± 31.440  ms/op
InPlanningBenchmark.queryStringFunctionIsNullOrInSql           100.475 ±  9.404  ms/op

SqlReverseLookupBenchmark
=========================

numKeys = 5000000
keysPerValue = 5000
lookupType = immutable

## master

Benchmark                                                        Score     Error  Units
SqlReverseLookupBenchmark.planEquals                           214.932 ±   5.827  ms/op
SqlReverseLookupBenchmark.planEqualsInsideAndOutsideCase      1613.542 ± 182.853  ms/op
SqlReverseLookupBenchmark.planNotEquals                        224.494 ±  19.920  ms/op

## patch

Benchmark                                                        Score     Error  Units
SqlReverseLookupBenchmark.planEquals                            26.214 ±   1.315  ms/op
SqlReverseLookupBenchmark.planEqualsInsideAndOutsideCase       317.464 ±  19.836  ms/op
SqlReverseLookupBenchmark.planNotEquals                         27.020 ±   1.694  ms/op

Main changes: 1) DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of the IN is above inFunctionThreshold. The default value of inFunctionThreshold is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE. 2) SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular expression, when the size of the SEARCH is above inFunctionExprThreshold. The default value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting it to Integer.MAX_VALUE. 3) ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is greater than inFunctionThreshold.

asdf2014

Overall LGTM, also need to replace || with <code>||</code> in Markdown table to display it correctly

Co-authored-by: Benedict Jin <asdf2014@apache.org>

kgyrtkirk · 2024-05-09T15:11:56Z

+      if (valuesNode.size() > plannerContext.queryContext().getInFunctionThreshold()
+          && valuesNode.stream().allMatch(node -> node.getKind() == SqlKind.LITERAL && !SqlUtil.isNull(node))) {


why not handle mixed versions as well? literals could be handled with this - but leave the other problematic stuff outside in an OR
the NULL case would be also less problematic - as those will be left outside as well...

or there is something wrong with:
x IN (1,2,3,y,null) => DRUID_IN(x,[1,2,3]) OR x = y OR x = null

Extending this to split the call up into multiple calls would add complexity, and I was trying to keep the logic simple. I figured it would not be common to include NULL or nonliterals in the IN.

sure - it can be added later if needed!

kgyrtkirk · 2024-05-09T15:19:21Z

              reverseLookupKey.negate,
+
+              // Use regular equals, or SCALAR_IN_ARRAY, depending on inFunctionThreshold.
+              reversedMatchValues.size() >= plannerContext.queryContext().getInFunctionThreshold(),


I wonder if it would look simpler to pass plannerContext instead or inFunctionThreshold - and let this logic live inside makeIn

Ah, it's like this because different usages of this method use different thresholds. Sometimes it's the inFunctionThreshold, sometimes it's the inFunctionExprThreshold.

kgyrtkirk · 2024-05-09T15:27:12Z

+    cannotVectorize();
+
+    testQuery(
+        "SELECT dim1 NOT IN ('abc', 'def', 'ghi') AND dim1 < 'zzz', COUNT(*)\n"


I think it would be more interesting have these tests apply inequality which could have filtered out some IN literal(s)

Cool idea. I added a test for this as well: testNotInOrEqualToOneOfThemExpression.

kgyrtkirk

it seems like something odd have happened to you branch; there are some pom.xml changes

gianm · 2024-05-13T17:10:29Z

I think something got messed up when I pulled the commit from github itself: 492c80c.

I just fixed it up and force pushed. The only change since the original patch is the new test testNotInOrEqualToOneOfThemExpression.

gianm · 2024-05-14T08:11:22Z

@asdf2014 thanks for reviewing!

@kgyrtkirk thanks as well! please let me know if you have any additional comments; if not I'll merge this.

kgyrtkirk

no more comments/questions etc :)

github-actions Bot added Area - Documentation Area - Querying labels May 3, 2024

gianm added 3 commits May 3, 2024 15:20

Revert test.

11940ae

Merge branch 'master' into sql-use-scalar-in-array

86bff63

Additional coverage.

d95ef2c

asdf2014 added the Performance label May 9, 2024

asdf2014 approved these changes May 9, 2024

View reviewed changes

Comment thread docs/querying/sql-query-context.md Outdated

Update docs/querying/sql-query-context.md

492c80c

Co-authored-by: Benedict Jin <asdf2014@apache.org>

kgyrtkirk reviewed May 9, 2024

View reviewed changes

github-actions Bot added Area - Batch Ingestion Area - Dependencies Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels May 10, 2024

kgyrtkirk reviewed May 13, 2024

View reviewed changes

gianm added 2 commits May 13, 2024 10:08

New test.

7d63105

Merge branch 'master' into sql-use-scalar-in-array

e44a389

gianm force-pushed the sql-use-scalar-in-array branch from 6674c35 to e44a389 Compare May 13, 2024 17:09

kgyrtkirk approved these changes May 14, 2024

View reviewed changes

gianm merged commit 72432c2 into apache:master May 14, 2024

gianm deleted the sql-use-scalar-in-array branch May 14, 2024 15:09

kfaraz added this to the 31.0.0 milestone Oct 4, 2024

kfaraz mentioned this pull request Oct 11, 2024

[DRAFT] 31.0.0 Release Notes #17332

Closed

gianm mentioned this pull request Apr 11, 2025

Query performance significantly degrades after upgrading from 22 to 27 #17891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up SQL IN using SCALAR_IN_ARRAY.#16388

Speed up SQL IN using SCALAR_IN_ARRAY.#16388
gianm merged 7 commits intoapache:masterfrom
gianm:sql-use-scalar-in-array

gianm commented May 3, 2024 •

edited

Loading

Uh oh!

asdf2014 left a comment

Uh oh!

Uh oh!

kgyrtkirk May 9, 2024

Uh oh!

gianm May 10, 2024

Uh oh!

kgyrtkirk May 13, 2024

Uh oh!

kgyrtkirk May 9, 2024

Uh oh!

gianm May 10, 2024

Uh oh!

kgyrtkirk May 9, 2024

Uh oh!

gianm May 10, 2024

Uh oh!

kgyrtkirk left a comment

Uh oh!

gianm commented May 13, 2024

Uh oh!

gianm commented May 14, 2024

Uh oh!

kgyrtkirk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		if (valuesNode.size() > plannerContext.queryContext().getInFunctionThreshold()
		&& valuesNode.stream().allMatch(node -> node.getKind() == SqlKind.LITERAL && !SqlUtil.isNull(node))) {

Conversation

gianm commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asdf2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kgyrtkirk May 9, 2024

Choose a reason for hiding this comment

Uh oh!

gianm May 10, 2024

Choose a reason for hiding this comment

Uh oh!

kgyrtkirk May 13, 2024

Choose a reason for hiding this comment

Uh oh!

kgyrtkirk May 9, 2024

Choose a reason for hiding this comment

Uh oh!

gianm May 10, 2024

Choose a reason for hiding this comment

Uh oh!

kgyrtkirk May 9, 2024

Choose a reason for hiding this comment

Uh oh!

gianm May 10, 2024

Choose a reason for hiding this comment

Uh oh!

kgyrtkirk left a comment

Choose a reason for hiding this comment

Uh oh!

gianm commented May 13, 2024

Uh oh!

gianm commented May 14, 2024

Uh oh!

kgyrtkirk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gianm commented May 3, 2024 •

edited

Loading