Skip to content

vectorize logical operators and boolean functions#11184

Merged
clintropolis merged 21 commits intoapache:masterfrom
clintropolis:vectorize-logical-expr
Dec 3, 2021
Merged

vectorize logical operators and boolean functions#11184
clintropolis merged 21 commits intoapache:masterfrom
clintropolis:vectorize-logical-expr

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Apr 29, 2021

Description

This PR adds vectorization support for the Druid native expression logical operators, !, &&, ||, as well as boolean functions isnull, notnull, and nvl.

The &&, ||, and nvl implementations are not quite as optimal as they could be, since they will evaluate all arguments up front, but I will fix this up in another branch I have as a follow-up (split from this one because it was starting to get big). The follow-up will add vectorized conditional expressions, and introduces a filtered vector binding that uses vector matches to allow processing only a subset of input rows. This filtered binding will allow slightly modifying these implementations so that || only needs to evaluate the rhs rows for which the lhs is false or null, && where lhs is true or null, and nvl only when lhs is null.

In the process of doing this PR and writing tests, I came to the conclusion that our logical operators behave very strangely all the time, and I think quite wrong in SQL compatible null handling mode since null was always treated as "false" instead of "unknown", so I've proposed changing the behavior in this PR.

The first change is around output. Previously the logical operators would pass values through. For Druid numeric values, any value greater than 0 is true, so passing through values would result in some strange but technically correct outputs. The new behavior homogenizes output to always be LONG boolean values, e.g. 1 or 0 for boolean operations involving any types.

Previous behavior:

  • 100 && 11 -> 11
  • 0.7 || 0.3 -> 0.7
  • 100 && 0 -> 0
  • 'troo' && 'true' -> 'troo'
  • 'troo' || 'true' -> 'true'

New behavior:

  • 100 && 11 -> 1
  • 0.7 || 0.3 -> 1
  • 100 && 0 -> 0
  • 'troo' && 'true' -> 0
  • 'troo' || 'true' -> 1

etc.

The implicit conversion of STRING, DOUBLE, and LONG values to booleans remains in effect:

  • LONG or DOUBLE - any value greater than 0 is considered true, else false
  • STRING - the value 'true' (case insensitive) is considered true, everything else is false.

and has been documented.

The second change is that now the logical operators in SQL compatible mode will treat null as "unknown" when druid.generic.useDefaultValueForNull is set as false (SQL compatible null handling mode).

For the "or" operator:

  • true || null, null || true -> 1
  • false || null, null || false, null || null-> null

For the "and" operator:

  • true && null, null && true, null && null -> null
  • false && null, null && false -> 0

Since this new behavior changes query results, subtly in the case of default mode since the results will be different (but equivalent in terms of true or false result), and fairly significantly in SQL compatible mode since it will now respect null values and treat them differently than always false, this PR also adds a new configuration, druid.expressions.useStrictBooleans, which defaults to false to use the legacy behavior, but I encourage everyone to switch to the new behavior in this PR to get better performance and SQL compatible behavior. I don't really love this flag existing, but don't see a way around it unless we are ok with changing query results to use the new behavior only. && and || will only be vectorized when druid.expressions.useStrictBooleans=true.

benchmarks:

      // 30: logical and operator
      "SELECT CAST(long1 as BOOLEAN) AND CAST (long2 as BOOLEAN), COUNT(*) FROM foo GROUP BY 1 ORDER BY 2",
      // 31: isnull, notnull
      "SELECT long5 IS NULL, long3 IS NOT NULL, count(*) FROM foo GROUP BY 1,2 ORDER BY 3"
Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlExpressionBenchmark.querySql       30           5000000        false  avgt    5  761.667 ± 35.745  ms/op
SqlExpressionBenchmark.querySql       30           5000000        force  avgt    5  152.102 ±  9.390  ms/op
SqlExpressionBenchmark.querySql       31           5000000        false  avgt    5  431.104 ± 32.951  ms/op
SqlExpressionBenchmark.querySql       31           5000000        force  avgt    5  100.884 ±  8.919  ms/op

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@suneet-s
Copy link
Copy Markdown
Contributor

I tested what calcite does if you just issue a select statement in the web-console, and it's behavior is similar to what is spelt out in the PR description. I'm in agreement with the behavior change in the description. The one thing I did notice is that calcite treats 'true' and true differently. NOTE to self: check for this subtle difference
SELECT 'true' || null returns 'true'
SELECT true || null returns 1

The feature flag does seem like the only option since this change does affect query results. I am reading through the code now.

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. One question and a bunch of nits. I'm going to experiment with this patch and then I'll be happy to approve

Comment thread docs/misc/math-expr.md Outdated
* other: `parse_long` is supported for numeric and string types

## Legacy logical operator mode
In earlier releases of Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Perhaps we can be more specific about which version.

Suggested change
In earlier releases of Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values.
Prior to the 0.22 release of Apache Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values.

}
allSame &= argType == currentType;
}
return allSame;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I'm understanding this correctly - a constant expression of a string and a constant expression of null will not be considered the same type. Is this the behavior we want?

I was thinking that a null could in theory be any type so null && String coulbe the same type

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read it, the intent is more like areDefinitelyTheSameType (i.e. it should return false if it cannot prove that the exprs are all the same type).

@clintropolis if that's correct, clarifying javadoc would be useful.

Assert.assertEquals(0L, eval("0 && null", bindings).value());
Assert.assertEquals(null, eval("null && null", bindings).value());
// reset
NullHandling.initializeForTests();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will want this in a try - finally block so that if 1 test fails here, it doesn't throw off any other tests that run in the JVM

Comment on lines +230 to +231
for (int i = 0; i < currentSize; i++) {
if (nulls != null && nulls[i]) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is the optimizer smart enough to optimize this to

Suggested change
for (int i = 0; i < currentSize; i++) {
if (nulls != null && nulls[i]) {
for (int i = 0; nulls !=null && i < currentSize; i++) {
if (nulls[i]) {

since if nulls == null we don't have to iterate over the list because long arrays default to 0L iirc.
Similar comment elsewhere this pattern is used

Comment on lines +558 to +559
// true/null, null/true, null/null -> true
// false/null, null/false -> null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this comment doesn't match the implementation

null /null -> null which is what's described in docs/misc/math-expr.md but this comment says it should be true.

@suneet-s suneet-s closed this Nov 11, 2021
@suneet-s suneet-s reopened this Nov 11, 2021
Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM except for the default behavior (which is the only reason I didn't ✅). The existing behavior, while not very SQLy and something that I agree we should move away from, may have people that depend on it. How do you feel about defaulting to legacy behavior, but updating the bundled common.runtime.properties files to set legacy = false? That way, most new users would get the new behavior, but people upgrading will retain existing behavior. In a future release, we could then change the default to legacy = false. Maybe at the same time as we swap the null handling default? The idea would be to minimize the number of releases that change default behaviors, by bundling up those changes.

What do you think?

Comment thread docs/misc/math-expr.md Outdated
* `true && null`, `null && true`, `null && null` -> `null`
* `false && null`, `null && false` -> `0`

To revert to the behavior of previous Druid versions, `druid.expressions.useLegacyLogicalOperators` can be set to `true` in your Druid configuration. No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also go in configuration/index.md, so all the configuration things are in one place.


/**
* whether nulls should be replaced with default value.
* [['is expression support for'],['nested arrays'],['enabled?']]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥸

{
// this should only be null in a unit test context, in production this will be injected by the null handling module
if (INSTANCE == null) {
throw new IllegalStateException("NullHandling module not initialized, call NullHandling.initializeForTests()");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module name should be ExpressionProcessing.

}
allSame &= argType == currentType;
}
return allSame;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read it, the intent is more like areDefinitelyTheSameType (i.e. it should return false if it cannot prove that the exprs are all the same type).

@clintropolis if that's correct, clarifying javadoc would be useful.

* common machinery for processing two input operators and functions, which should always treat null inputs as null
* output, and are backed by a primitive values instead of an object values (and need to use the null vectors instead of
* checking the vector themselves for nulls)
* Basic vector processor that processes 2 inputs and works for both primtive value vectors and object vectors.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primitive (spelling)

: allowNestedArrays;
if (useLegacyLogicalOperators == null) {
this.useLegacyLogicalOperators = Boolean.parseBoolean(
System.getProperty(NULL_HANDLING_LEGACY_LOGICAL_OPS_STRING, "false")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set the default to "true" for some time? (to minimize sudden disruption)

@clintropolis
Copy link
Copy Markdown
Member Author

The existing behavior, while not very SQLy and something that I agree we should move away from, may have people that depend on it.

heh, the current behavior reminds me a lot of https://www.destroyallsoftware.com/talks/wat 😅

How do you feel about defaulting to legacy behavior, but updating the bundled common.runtime.properties files to set legacy = false? That way, most new users would get the new behavior, but people upgrading will retain existing behavior. In a future release, we could then change the default to legacy = false. Maybe at the same time as we swap the null handling default?

I don't love it, but I guess it would be ok to swap the default whenever we swap to SQL compatible null handling (which I also hope isn't so far from now). Vectorization for virtual columns is also not currently on by default, so unless that is also explicitly set people wouldn't get the benefit from the new behavior... other than saner results. The performance increase for these expressions being vectorized would maybe change my stance to be a bit more in favor of turning it on by default, and I do think the current behavior is ... not good for SQL, but I guess not having disruptions of running clusters in an upgrade is nice.

I guess I should write some more docs to try to encourage people to enable this new mode and we should shout it out in the release notes so that operators who do want SQL compatible behavior know to turn on this setting, and the vectorization is a bit of a motivator to make the switch (I don't think the current behavior should be vectorized or maybe even could be vectorized because the output type is potentially varying row to row depending on the truthy/falsy values of inputs)

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Nov 23, 2021

Thanks for considering the suggestion to be more compatible. I think it's good that the bundled configs set it to false, and I agree that the release notes and docs should push people towards considering setting it to false for existing deployments.

IMO, it'd make sense to switch this behavior and a few others (like SQL compatible null handling) at the same time in a future release.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Nov 24, 2021

I like the new "strict booleans" property name.

@paul-rogers
Copy link
Copy Markdown
Contributor

@clintropolis, the NULL handling proposed is standard SQL trinary logic, so +1 on that change.

SQL also applies NULL handling to all functions and operators: in general, f(NULL) -> NULL for every function f, unless special cased, as in the logical operators.

Does the change also cover the negation operator? NOT a where a is NULL results in NULL. (Note that a IS NOT NULL is entirely different!)

Did a prior (or will a future) change handle nulls in math operators? 10 + NULL -> NULL?

{
ExprEval leftVal = left.eval(bindings);
return leftVal.asBoolean() ? right.eval(bindings) : leftVal;
if (!ExpressionProcessing.useStrictBooleans()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an inner loop. Can the !ExpressionProcessing.useStrictBooleans() be evaluated on setup and reused here rather than making this call every time? The value can't change. Maybe the JVM will inline the call, but better to just eliminate it in the inner-loop code path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is how most of the non-vectorized processing is currently, it would be possible with some refactoring I think to make specialized implementations for various cases, but we've been mainly focusing on doing that where possible in the vectorized expression processing, since operating on batches is significantly faster and has less overhead


// if left is false, always false
if (leftVal.value() != null && !leftVal.asBoolean()) {
return ExprEval.ofLongBoolean(false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a constant. Can it be defined as such? static final ExprEval.FALSE = ExprEval.ofLongBoolean(false) to avoid recomputing the value in an inner loop? Here and below.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it seems reasonable to make true and false constants, I may make this change in this PR or in a follow-up later if nothing else comes up that needs immediate change

return ExprEval.ofLongBoolean(false);
}
ExprEval rightVal;
if (NullHandling.sqlCompatible() || Types.is(leftVal.type(), ExprType.STRING)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment, can the NullHandling.sqlCompatible() be cached?

Better, since the value never changes, can there be separate implementations for the two cases so we pick the one we want up front and never have to check again? ("Up-front" would be the start of this part of a query, if the value can change per query.)

Picking the exact right code is not quite code gen, but is pretty close in terms of performance.


ExpressionType type = ExpressionTypeConversion.autoDetect(leftVal, rightVal);
boolean result;
switch (type.getType()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better is to have different implementations for each type so the switch is done at setup time, not in the inner loop.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you describe is the way vector expression processing works, but the non-vectorized implementations of eval are filled with all sorts of branches (in part due to not always being able to know the input types at setup time), so this implementation is just being consistent with other non-vectorized eval implementations. Using vector processors with a vector size of 1 for the non-vectorized engine does seem to offer a light performance increase, since the majority of branches can be eliminated at setup time as well as being stronger typed so numeric primitives can avoid boxing/unboxing, (but there is still a lot of overhead wrapper objects that show their weight when there is one per row instead of one per batch).

This whole area is ripe for code generation of some sort, i've been trying to get the base vectorized implementation in place so we have a baseline to compare different strategies against, so maybe in that world we can have better implementations for non-vectorized expression processing as well, but we're not quite there yet I think.

if (currentType == null) {
currentType = argType;
}
allSame &= Objects.equals(argType, currentType);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler to do a short-circuit AND:

    for ... {
      if (!Objects.equals(argType, currentType)) {
        return false;
      }
    }
    return true;

@clintropolis
Copy link
Copy Markdown
Member Author

clintropolis commented Dec 1, 2021

Does the change also cover the negation operator? NOT a where a is NULL results in NULL. (Note that a IS NOT NULL is entirely different!)

This was already behaving correctly in SQL compatible null handling mode, so no changes were needed

Did a prior (or will a future) change handle nulls in math operators? 10 + NULL -> NULL?

As far as I know, these should also already behave correctly prior to this PR

SQL also applies NULL handling to all functions and operators: in general, f(NULL) -> NULL for every function f, unless special cased, as in the logical operators.

I didn't do a complete survey, but i believe many functions are doing the correct thing with regards to null handling, (but it would probably be worth validating and correcting any behavior that isn't consistent)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants