vectorize logical operators and boolean functions by clintropolis · Pull Request #11184 · apache/druid

clintropolis · 2021-04-29T21:43:31Z

Description

This PR adds vectorization support for the Druid native expression logical operators, !, &&, ||, as well as boolean functions isnull, notnull, and nvl.

The &&, ||, and nvl implementations are not quite as optimal as they could be, since they will evaluate all arguments up front, but I will fix this up in another branch I have as a follow-up (split from this one because it was starting to get big). The follow-up will add vectorized conditional expressions, and introduces a filtered vector binding that uses vector matches to allow processing only a subset of input rows. This filtered binding will allow slightly modifying these implementations so that || only needs to evaluate the rhs rows for which the lhs is false or null, && where lhs is true or null, and nvl only when lhs is null.

In the process of doing this PR and writing tests, I came to the conclusion that our logical operators behave very strangely all the time, and I think quite wrong in SQL compatible null handling mode since null was always treated as "false" instead of "unknown", so I've proposed changing the behavior in this PR.

The first change is around output. Previously the logical operators would pass values through. For Druid numeric values, any value greater than 0 is true, so passing through values would result in some strange but technically correct outputs. The new behavior homogenizes output to always be LONG boolean values, e.g. 1 or 0 for boolean operations involving any types.

Previous behavior:

100 && 11 -> 11
0.7 || 0.3 -> 0.7
100 && 0 -> 0
'troo' && 'true' -> 'troo'
'troo' || 'true' -> 'true'

New behavior:

100 && 11 -> 1
0.7 || 0.3 -> 1
100 && 0 -> 0
'troo' && 'true' -> 0
'troo' || 'true' -> 1

etc.

The implicit conversion of STRING, DOUBLE, and LONG values to booleans remains in effect:

LONG or DOUBLE - any value greater than 0 is considered true, else false
STRING - the value 'true' (case insensitive) is considered true, everything else is false.

and has been documented.

The second change is that now the logical operators in SQL compatible mode will treat null as "unknown" when druid.generic.useDefaultValueForNull is set as false (SQL compatible null handling mode).

For the "or" operator:

true || null, null || true -> 1
false || null, null || false, null || null-> null

For the "and" operator:

true && null, null && true, null && null -> null
false && null, null && false -> 0

Since this new behavior changes query results, subtly in the case of default mode since the results will be different (but equivalent in terms of true or false result), and fairly significantly in SQL compatible mode since it will now respect null values and treat them differently than always false, this PR also adds a new configuration, druid.expressions.useStrictBooleans, which defaults to false to use the legacy behavior, but I encourage everyone to switch to the new behavior in this PR to get better performance and SQL compatible behavior. I don't really love this flag existing, but don't see a way around it unless we are ok with changing query results to use the new behavior only. && and || will only be vectorized when druid.expressions.useStrictBooleans=true.

benchmarks:

      // 30: logical and operator
      "SELECT CAST(long1 as BOOLEAN) AND CAST (long2 as BOOLEAN), COUNT(*) FROM foo GROUP BY 1 ORDER BY 2",
      // 31: isnull, notnull
      "SELECT long5 IS NULL, long3 IS NOT NULL, count(*) FROM foo GROUP BY 1,2 ORDER BY 3"

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlExpressionBenchmark.querySql       30           5000000        false  avgt    5  761.667 ± 35.745  ms/op
SqlExpressionBenchmark.querySql       30           5000000        force  avgt    5  152.102 ±  9.390  ms/op
SqlExpressionBenchmark.querySql       31           5000000        false  avgt    5  431.104 ± 32.951  ms/op
SqlExpressionBenchmark.querySql       31           5000000        force  avgt    5  100.884 ±  8.919  ms/op

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

…-expr

suneet-s · 2021-07-15T01:54:09Z

I tested what calcite does if you just issue a select statement in the web-console, and it's behavior is similar to what is spelt out in the PR description. I'm in agreement with the behavior change in the description. The one thing I did notice is that calcite treats 'true' and true differently. NOTE to self: check for this subtle difference
SELECT 'true' || null returns 'true'
SELECT true || null returns 1

The feature flag does seem like the only option since this change does affect query results. I am reading through the code now.

suneet-s

Overall looks good. One question and a bunch of nits. I'm going to experiment with this patch and then I'll be happy to approve

suneet-s · 2021-07-15T03:55:41Z

 * other: `parse_long` is supported for numeric and string types
+
+## Legacy logical operator mode
+In earlier releases of Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values.


nit: Perhaps we can be more specific about which version.

Suggested change

In earlier releases of Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values.

Prior to the 0.22 release of Apache Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values.

suneet-s · 2021-07-15T04:24:22Z

+        }
+        allSame &= argType == currentType;
+      }
+      return allSame;


if I'm understanding this correctly - a constant expression of a string and a constant expression of null will not be considered the same type. Is this the behavior we want?

I was thinking that a null could in theory be any type so null && String coulbe the same type

As I read it, the intent is more like areDefinitelyTheSameType (i.e. it should return false if it cannot prove that the exprs are all the same type).

@clintropolis if that's correct, clarifying javadoc would be useful.

suneet-s · 2021-07-15T04:43:42Z

+    Assert.assertEquals(0L, eval("0 && null", bindings).value());
+    Assert.assertEquals(null, eval("null && null", bindings).value());
+    // reset
+    NullHandling.initializeForTests();


I think you will want this in a try - finally block so that if 1 test fails here, it doesn't throw off any other tests that run in the JVM

suneet-s · 2021-07-15T04:51:12Z

+          for (int i = 0; i < currentSize; i++) {
+            if (nulls != null && nulls[i]) {


nit: is the optimizer smart enough to optimize this to

Suggested change

for (int i = 0; i < currentSize; i++) {

if (nulls != null && nulls[i]) {

for (int i = 0; nulls !=null && i < currentSize; i++) {

if (nulls[i]) {

since if nulls == null we don't have to iterate over the list because long arrays default to 0L iirc.
Similar comment elsewhere this pattern is used

suneet-s · 2021-07-15T05:01:39Z

+              // true/null, null/true, null/null -> true
+              // false/null, null/false -> null


It seems like this comment doesn't match the implementation

null /null -> null which is what's described in docs/misc/math-expr.md but this comment says it should be true.

…-expr

gianm

This LGTM except for the default behavior (which is the only reason I didn't ✅). The existing behavior, while not very SQLy and something that I agree we should move away from, may have people that depend on it. How do you feel about defaulting to legacy behavior, but updating the bundled common.runtime.properties files to set legacy = false? That way, most new users would get the new behavior, but people upgrading will retain existing behavior. In a future release, we could then change the default to legacy = false. Maybe at the same time as we swap the null handling default? The idea would be to minimize the number of releases that change default behaviors, by bundling up those changes.

What do you think?

gianm · 2021-11-22T08:07:07Z

+* `true && null`, `null && true`, `null && null` -> `null`
+* `false && null`, `null && false` -> `0`
+
+To revert to the behavior of previous Druid versions, `druid.expressions.useLegacyLogicalOperators` can be set to `true` in your Druid configuration.


This should also go in configuration/index.md, so all the configuration things are in one place.

gianm · 2021-11-22T08:10:17Z


  /**
-   * whether nulls should be replaced with default value.
+   * [['is expression support for'],['nested arrays'],['enabled?']]


gianm · 2021-11-22T08:10:46Z

+  {
+    // this should only be null in a unit test context, in production this will be injected by the null handling module
+    if (INSTANCE == null) {
+      throw new IllegalStateException("NullHandling module not initialized, call NullHandling.initializeForTests()");


Module name should be ExpressionProcessing.

gianm · 2021-11-22T19:07:54Z

+        }
+        allSame &= argType == currentType;
+      }
+      return allSame;


As I read it, the intent is more like areDefinitelyTheSameType (i.e. it should return false if it cannot prove that the exprs are all the same type).

@clintropolis if that's correct, clarifying javadoc would be useful.

gianm · 2021-11-22T19:16:32Z

- * common machinery for processing two input operators and functions, which should always treat null inputs as null
- * output, and are backed by a primitive values instead of an object values (and need to use the null vectors instead of
- * checking the vector themselves for nulls)
+ * Basic vector processor that processes 2 inputs and works for both primtive value vectors and object vectors.


primitive (spelling)

gianm · 2021-11-22T19:24:36Z

                             : allowNestedArrays;
+    if (useLegacyLogicalOperators == null) {
+      this.useLegacyLogicalOperators = Boolean.parseBoolean(
+          System.getProperty(NULL_HANDLING_LEGACY_LOGICAL_OPS_STRING, "false")


Can we set the default to "true" for some time? (to minimize sudden disruption)

clintropolis · 2021-11-23T03:54:59Z

The existing behavior, while not very SQLy and something that I agree we should move away from, may have people that depend on it.

heh, the current behavior reminds me a lot of https://www.destroyallsoftware.com/talks/wat 😅

How do you feel about defaulting to legacy behavior, but updating the bundled common.runtime.properties files to set legacy = false? That way, most new users would get the new behavior, but people upgrading will retain existing behavior. In a future release, we could then change the default to legacy = false. Maybe at the same time as we swap the null handling default?

I don't love it, but I guess it would be ok to swap the default whenever we swap to SQL compatible null handling (which I also hope isn't so far from now). Vectorization for virtual columns is also not currently on by default, so unless that is also explicitly set people wouldn't get the benefit from the new behavior... other than saner results. The performance increase for these expressions being vectorized would maybe change my stance to be a bit more in favor of turning it on by default, and I do think the current behavior is ... not good for SQL, but I guess not having disruptions of running clusters in an upgrade is nice.

I guess I should write some more docs to try to encourage people to enable this new mode and we should shout it out in the release notes so that operators who do want SQL compatible behavior know to turn on this setting, and the vectorization is a bit of a motivator to make the switch (I don't think the current behavior should be vectorized or maybe even could be vectorized because the output type is potentially varying row to row depending on the truthy/falsy values of inputs)

…-expr

gianm · 2021-11-23T22:00:33Z

Thanks for considering the suggestion to be more compatible. I think it's good that the bundled configs set it to false, and I agree that the release notes and docs should push people towards considering setting it to false for existing deployments.

IMO, it'd make sense to switch this behavior and a few others (like SQL compatible null handling) at the same time in a future release.

gianm · 2021-11-24T20:12:10Z

I like the new "strict booleans" property name.

paul-rogers · 2021-11-29T21:56:35Z

@clintropolis, the NULL handling proposed is standard SQL trinary logic, so +1 on that change.

SQL also applies NULL handling to all functions and operators: in general, f(NULL) -> NULL for every function f, unless special cased, as in the logical operators.

Does the change also cover the negation operator? NOT a where a is NULL results in NULL. (Note that a IS NOT NULL is entirely different!)

Did a prior (or will a future) change handle nulls in math operators? 10 + NULL -> NULL?

paul-rogers · 2021-11-29T22:02:39Z

  {
    ExprEval leftVal = left.eval(bindings);
-    return leftVal.asBoolean() ? right.eval(bindings) : leftVal;
+    if (!ExpressionProcessing.useStrictBooleans()) {


This is an inner loop. Can the !ExpressionProcessing.useStrictBooleans() be evaluated on setup and reused here rather than making this call every time? The value can't change. Maybe the JVM will inline the call, but better to just eliminate it in the inner-loop code path.

yeah, this is how most of the non-vectorized processing is currently, it would be possible with some refactoring I think to make specialized implementations for various cases, but we've been mainly focusing on doing that where possible in the vectorized expression processing, since operating on batches is significantly faster and has less overhead

paul-rogers · 2021-11-29T22:04:01Z

+
+    // if left is false, always false
+    if (leftVal.value() != null && !leftVal.asBoolean()) {
+      return ExprEval.ofLongBoolean(false);


This looks like a constant. Can it be defined as such? static final ExprEval.FALSE = ExprEval.ofLongBoolean(false) to avoid recomputing the value in an inner loop? Here and below.

yeah, it seems reasonable to make true and false constants, I may make this change in this PR or in a follow-up later if nothing else comes up that needs immediate change

paul-rogers · 2021-11-29T22:05:32Z

+      return ExprEval.ofLongBoolean(false);
+    }
+    ExprEval rightVal;
+    if (NullHandling.sqlCompatible() || Types.is(leftVal.type(), ExprType.STRING)) {


Same comment, can the NullHandling.sqlCompatible() be cached?

Better, since the value never changes, can there be separate implementations for the two cases so we pick the one we want up front and never have to check again? ("Up-front" would be the start of this part of a query, if the value can change per query.)

Picking the exact right code is not quite code gen, but is pretty close in terms of performance.

paul-rogers · 2021-11-29T22:08:49Z

+
+    ExpressionType type = ExpressionTypeConversion.autoDetect(leftVal, rightVal);
+    boolean result;
+    switch (type.getType()) {


Even better is to have different implementations for each type so the switch is done at setup time, not in the inner loop.

What you describe is the way vector expression processing works, but the non-vectorized implementations of eval are filled with all sorts of branches (in part due to not always being able to know the input types at setup time), so this implementation is just being consistent with other non-vectorized eval implementations. Using vector processors with a vector size of 1 for the non-vectorized engine does seem to offer a light performance increase, since the majority of branches can be eliminated at setup time as well as being stronger typed so numeric primitives can avoid boxing/unboxing, (but there is still a lot of overhead wrapper objects that show their weight when there is one per row instead of one per batch).

This whole area is ripe for code generation of some sort, i've been trying to get the base vectorized implementation in place so we have a baseline to compare different strategies against, so maybe in that world we can have better implementations for non-vectorized expression processing as well, but we're not quite there yet I think.

paul-rogers · 2021-11-29T22:11:20Z

+        if (currentType == null) {
+          currentType = argType;
+        }
+        allSame &= Objects.equals(argType, currentType);


Simpler to do a short-circuit AND:

for ... { if (!Objects.equals(argType, currentType)) { return false; } } return true;

clintropolis · 2021-12-01T02:27:20Z

Does the change also cover the negation operator? NOT a where a is NULL results in NULL. (Note that a IS NOT NULL is entirely different!)

This was already behaving correctly in SQL compatible null handling mode, so no changes were needed

Did a prior (or will a future) change handle nulls in math operators? 10 + NULL -> NULL?

As far as I know, these should also already behave correctly prior to this PR

SQL also applies NULL handling to all functions and operators: in general, f(NULL) -> NULL for every function f, unless special cased, as in the logical operators.

I didn't do a complete survey, but i believe many functions are doing the correct thing with regards to null handling, (but it would probably be worth validating and correcting any behavior that isn't consistent)

clintropolis added 2 commits April 29, 2021 06:12

vectorize logical operators and boolean functions

1eaa144

fix docs

f1ce3e5

clintropolis added Performance Area - Querying Incompatible Release Notes Design Review labels Apr 29, 2021

clintropolis added 6 commits April 30, 2021 04:05

spelling, add missing docs for case_searched/case_simple

82c15d4

Merge remote-tracking branch 'upstream/master' into vectorize-logical…

8211742

…-expr

fix up conflicts

42d28e5

Merge remote-tracking branch 'upstream/master' into vectorize-logical…

78b2527

…-expr

benchmarks

a56e857

Merge remote-tracking branch 'upstream/master' into vectorize-logical…

5c9dc00

…-expr

suneet-s reviewed Jul 15, 2021

View reviewed changes

clintropolis added 3 commits November 11, 2021 00:29

Merge remote-tracking branch 'upstream/master' into vectorize-logical…

940f6b2

…-expr

longs are now the offical boolean type of expressions

3143819

one more example

719143e

suneet-s closed this Nov 11, 2021

suneet-s reopened this Nov 11, 2021

fixes

808b992

gianm reviewed Nov 22, 2021

View reviewed changes

clintropolis added 3 commits November 23, 2021 02:53

swap default

fb4b55f

Merge remote-tracking branch 'upstream/master' into vectorize-logical…

20c61e4

…-expr

missed one

d909aa5

gianm approved these changes Nov 23, 2021

View reviewed changes

clintropolis added 2 commits November 23, 2021 19:34

fix bug, more tests

a56fa5a

revert

c3033b7

clintropolis added 3 commits November 23, 2021 23:05

oops

8bfcb1b

javadoc for InputBindingInspector

459b8dd

invert

246aa8e

clintropolis mentioned this pull request Nov 24, 2021

Improve the output of SQL explain message #11908

Merged

9 tasks

someday, this PR will be finished...

9d7a63e

paul-rogers reviewed Nov 29, 2021

View reviewed changes

jihoonson approved these changes Dec 1, 2021

View reviewed changes

clintropolis merged commit 84b4bf5 into apache:master Dec 3, 2021

clintropolis deleted the vectorize-logical-expr branch December 3, 2021 00:40

abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022

clintropolis removed the Incompatible label Jun 2, 2022

abhishekagarwal87 mentioned this pull request Jun 6, 2022

[Draft] 0.23.0 Release notes #12510

Closed

gianm mentioned this pull request Apr 24, 2023

Implement 3-Valued logic for Null handling #10102

Closed

clintropolis mentioned this pull request Aug 2, 2023

set druid.expressions.useStrictBooleans to true by default #14734

Merged

4 tasks

	In earlier releases of Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values.
	Prior to the 0.22 release of Apache Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values.

		for (int i = 0; i < currentSize; i++) {
		if (nulls != null && nulls[i]) {

		// true/null, null/true, null/null -> true
		// false/null, null/false -> null

Conversation

clintropolis commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

suneet-s commented Jul 15, 2021

Uh oh!

suneet-s left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis commented Nov 23, 2021

Uh oh!

gianm commented Nov 23, 2021

Uh oh!

gianm commented Nov 24, 2021

Uh oh!

paul-rogers commented Nov 29, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis commented Dec 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

clintropolis commented Apr 29, 2021 •

edited

Loading

clintropolis commented Dec 1, 2021 •

edited

Loading