vectorize logical operators and boolean functions#11184
vectorize logical operators and boolean functions#11184clintropolis merged 21 commits intoapache:masterfrom
Conversation
|
I tested what calcite does if you just issue a select statement in the web-console, and it's behavior is similar to what is spelt out in the PR description. I'm in agreement with the behavior change in the description. The one thing I did notice is that calcite treats The feature flag does seem like the only option since this change does affect query results. I am reading through the code now. |
suneet-s
left a comment
There was a problem hiding this comment.
Overall looks good. One question and a bunch of nits. I'm going to experiment with this patch and then I'll be happy to approve
| * other: `parse_long` is supported for numeric and string types | ||
|
|
||
| ## Legacy logical operator mode | ||
| In earlier releases of Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values. |
There was a problem hiding this comment.
nit: Perhaps we can be more specific about which version.
| In earlier releases of Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values. | |
| Prior to the 0.22 release of Apache Druid, the logical 'and' and 'or' operators behaved in a non-standard manner, but this behavior has been changed so that these operations output 'homogeneous' boolean values. |
| } | ||
| allSame &= argType == currentType; | ||
| } | ||
| return allSame; |
There was a problem hiding this comment.
if I'm understanding this correctly - a constant expression of a string and a constant expression of null will not be considered the same type. Is this the behavior we want?
I was thinking that a null could in theory be any type so null && String coulbe the same type
There was a problem hiding this comment.
As I read it, the intent is more like areDefinitelyTheSameType (i.e. it should return false if it cannot prove that the exprs are all the same type).
@clintropolis if that's correct, clarifying javadoc would be useful.
| Assert.assertEquals(0L, eval("0 && null", bindings).value()); | ||
| Assert.assertEquals(null, eval("null && null", bindings).value()); | ||
| // reset | ||
| NullHandling.initializeForTests(); |
There was a problem hiding this comment.
I think you will want this in a try - finally block so that if 1 test fails here, it doesn't throw off any other tests that run in the JVM
| for (int i = 0; i < currentSize; i++) { | ||
| if (nulls != null && nulls[i]) { |
There was a problem hiding this comment.
nit: is the optimizer smart enough to optimize this to
| for (int i = 0; i < currentSize; i++) { | |
| if (nulls != null && nulls[i]) { | |
| for (int i = 0; nulls !=null && i < currentSize; i++) { | |
| if (nulls[i]) { |
since if nulls == null we don't have to iterate over the list because long arrays default to 0L iirc.
Similar comment elsewhere this pattern is used
| // true/null, null/true, null/null -> true | ||
| // false/null, null/false -> null |
There was a problem hiding this comment.
It seems like this comment doesn't match the implementation
null /null -> null which is what's described in docs/misc/math-expr.md but this comment says it should be true.
gianm
left a comment
There was a problem hiding this comment.
This LGTM except for the default behavior (which is the only reason I didn't ✅). The existing behavior, while not very SQLy and something that I agree we should move away from, may have people that depend on it. How do you feel about defaulting to legacy behavior, but updating the bundled common.runtime.properties files to set legacy = false? That way, most new users would get the new behavior, but people upgrading will retain existing behavior. In a future release, we could then change the default to legacy = false. Maybe at the same time as we swap the null handling default? The idea would be to minimize the number of releases that change default behaviors, by bundling up those changes.
What do you think?
| * `true && null`, `null && true`, `null && null` -> `null` | ||
| * `false && null`, `null && false` -> `0` | ||
|
|
||
| To revert to the behavior of previous Druid versions, `druid.expressions.useLegacyLogicalOperators` can be set to `true` in your Druid configuration. No newline at end of file |
There was a problem hiding this comment.
This should also go in configuration/index.md, so all the configuration things are in one place.
|
|
||
| /** | ||
| * whether nulls should be replaced with default value. | ||
| * [['is expression support for'],['nested arrays'],['enabled?']] |
| { | ||
| // this should only be null in a unit test context, in production this will be injected by the null handling module | ||
| if (INSTANCE == null) { | ||
| throw new IllegalStateException("NullHandling module not initialized, call NullHandling.initializeForTests()"); |
There was a problem hiding this comment.
Module name should be ExpressionProcessing.
| } | ||
| allSame &= argType == currentType; | ||
| } | ||
| return allSame; |
There was a problem hiding this comment.
As I read it, the intent is more like areDefinitelyTheSameType (i.e. it should return false if it cannot prove that the exprs are all the same type).
@clintropolis if that's correct, clarifying javadoc would be useful.
| * common machinery for processing two input operators and functions, which should always treat null inputs as null | ||
| * output, and are backed by a primitive values instead of an object values (and need to use the null vectors instead of | ||
| * checking the vector themselves for nulls) | ||
| * Basic vector processor that processes 2 inputs and works for both primtive value vectors and object vectors. |
| : allowNestedArrays; | ||
| if (useLegacyLogicalOperators == null) { | ||
| this.useLegacyLogicalOperators = Boolean.parseBoolean( | ||
| System.getProperty(NULL_HANDLING_LEGACY_LOGICAL_OPS_STRING, "false") |
There was a problem hiding this comment.
Can we set the default to "true" for some time? (to minimize sudden disruption)
heh, the current behavior reminds me a lot of https://www.destroyallsoftware.com/talks/wat 😅
I don't love it, but I guess it would be ok to swap the default whenever we swap to SQL compatible null handling (which I also hope isn't so far from now). Vectorization for virtual columns is also not currently on by default, so unless that is also explicitly set people wouldn't get the benefit from the new behavior... other than saner results. The performance increase for these expressions being vectorized would maybe change my stance to be a bit more in favor of turning it on by default, and I do think the current behavior is ... not good for SQL, but I guess not having disruptions of running clusters in an upgrade is nice. I guess I should write some more docs to try to encourage people to enable this new mode and we should shout it out in the release notes so that operators who do want SQL compatible behavior know to turn on this setting, and the vectorization is a bit of a motivator to make the switch (I don't think the current behavior should be vectorized or maybe even could be vectorized because the output type is potentially varying row to row depending on the truthy/falsy values of inputs) |
|
Thanks for considering the suggestion to be more compatible. I think it's good that the bundled configs set it to false, and I agree that the release notes and docs should push people towards considering setting it to false for existing deployments. IMO, it'd make sense to switch this behavior and a few others (like SQL compatible null handling) at the same time in a future release. |
|
I like the new "strict booleans" property name. |
|
@clintropolis, the SQL also applies NULL handling to all functions and operators: in general, Does the change also cover the negation operator? Did a prior (or will a future) change handle nulls in math operators? |
| { | ||
| ExprEval leftVal = left.eval(bindings); | ||
| return leftVal.asBoolean() ? right.eval(bindings) : leftVal; | ||
| if (!ExpressionProcessing.useStrictBooleans()) { |
There was a problem hiding this comment.
This is an inner loop. Can the !ExpressionProcessing.useStrictBooleans() be evaluated on setup and reused here rather than making this call every time? The value can't change. Maybe the JVM will inline the call, but better to just eliminate it in the inner-loop code path.
There was a problem hiding this comment.
yeah, this is how most of the non-vectorized processing is currently, it would be possible with some refactoring I think to make specialized implementations for various cases, but we've been mainly focusing on doing that where possible in the vectorized expression processing, since operating on batches is significantly faster and has less overhead
|
|
||
| // if left is false, always false | ||
| if (leftVal.value() != null && !leftVal.asBoolean()) { | ||
| return ExprEval.ofLongBoolean(false); |
There was a problem hiding this comment.
This looks like a constant. Can it be defined as such? static final ExprEval.FALSE = ExprEval.ofLongBoolean(false) to avoid recomputing the value in an inner loop? Here and below.
There was a problem hiding this comment.
yeah, it seems reasonable to make true and false constants, I may make this change in this PR or in a follow-up later if nothing else comes up that needs immediate change
| return ExprEval.ofLongBoolean(false); | ||
| } | ||
| ExprEval rightVal; | ||
| if (NullHandling.sqlCompatible() || Types.is(leftVal.type(), ExprType.STRING)) { |
There was a problem hiding this comment.
Same comment, can the NullHandling.sqlCompatible() be cached?
Better, since the value never changes, can there be separate implementations for the two cases so we pick the one we want up front and never have to check again? ("Up-front" would be the start of this part of a query, if the value can change per query.)
Picking the exact right code is not quite code gen, but is pretty close in terms of performance.
|
|
||
| ExpressionType type = ExpressionTypeConversion.autoDetect(leftVal, rightVal); | ||
| boolean result; | ||
| switch (type.getType()) { |
There was a problem hiding this comment.
Even better is to have different implementations for each type so the switch is done at setup time, not in the inner loop.
There was a problem hiding this comment.
What you describe is the way vector expression processing works, but the non-vectorized implementations of eval are filled with all sorts of branches (in part due to not always being able to know the input types at setup time), so this implementation is just being consistent with other non-vectorized eval implementations. Using vector processors with a vector size of 1 for the non-vectorized engine does seem to offer a light performance increase, since the majority of branches can be eliminated at setup time as well as being stronger typed so numeric primitives can avoid boxing/unboxing, (but there is still a lot of overhead wrapper objects that show their weight when there is one per row instead of one per batch).
This whole area is ripe for code generation of some sort, i've been trying to get the base vectorized implementation in place so we have a baseline to compare different strategies against, so maybe in that world we can have better implementations for non-vectorized expression processing as well, but we're not quite there yet I think.
| if (currentType == null) { | ||
| currentType = argType; | ||
| } | ||
| allSame &= Objects.equals(argType, currentType); |
There was a problem hiding this comment.
Simpler to do a short-circuit AND:
for ... {
if (!Objects.equals(argType, currentType)) {
return false;
}
}
return true;
This was already behaving correctly in SQL compatible null handling mode, so no changes were needed
As far as I know, these should also already behave correctly prior to this PR
I didn't do a complete survey, but i believe many functions are doing the correct thing with regards to null handling, (but it would probably be worth validating and correcting any behavior that isn't consistent) |
Description
This PR adds vectorization support for the Druid native expression logical operators,
!,&&,||, as well as boolean functionsisnull,notnull, andnvl.The
&&,||, andnvlimplementations are not quite as optimal as they could be, since they will evaluate all arguments up front, but I will fix this up in another branch I have as a follow-up (split from this one because it was starting to get big). The follow-up will add vectorized conditional expressions, and introduces a filtered vector binding that uses vector matches to allow processing only a subset of input rows. This filtered binding will allow slightly modifying these implementations so that||only needs to evaluate the rhs rows for which the lhs is false or null,&&where lhs is true or null, andnvlonly when lhs is null.In the process of doing this PR and writing tests, I came to the conclusion that our logical operators behave very strangely all the time, and I think quite wrong in SQL compatible null handling mode since null was always treated as "false" instead of "unknown", so I've proposed changing the behavior in this PR.
The first change is around output. Previously the logical operators would pass values through. For Druid numeric values, any value greater than 0 is
true, so passing through values would result in some strange but technically correct outputs. The new behavior homogenizes output to always beLONGboolean values, e.g.1or0for boolean operations involving any types.Previous behavior:
100 && 11->110.7 || 0.3->0.7100 && 0->0'troo' && 'true'->'troo''troo' || 'true'->'true'New behavior:
100 && 11->10.7 || 0.3->1100 && 0->0'troo' && 'true'->0'troo' || 'true'->1etc.
The implicit conversion of
STRING,DOUBLE, andLONGvalues to booleans remains in effect:LONGorDOUBLE- any value greater than 0 is consideredtrue, elsefalseSTRING- the value'true'(case insensitive) is consideredtrue, everything else isfalse.and has been documented.
The second change is that now the logical operators in SQL compatible mode will treat
nullas "unknown" whendruid.generic.useDefaultValueForNullis set asfalse(SQL compatible null handling mode).For the "or" operator:
true || null,null || true->1false || null,null || false,null || null->nullFor the "and" operator:
true && null,null && true,null && null->nullfalse && null,null && false->0Since this new behavior changes query results, subtly in the case of default mode since the results will be different (but equivalent in terms of true or false result), and fairly significantly in SQL compatible mode since it will now respect
nullvalues and treat them differently than always false, this PR also adds a new configuration,druid.expressions.useStrictBooleans, which defaults to false to use the legacy behavior, but I encourage everyone to switch to the new behavior in this PR to get better performance and SQL compatible behavior. I don't really love this flag existing, but don't see a way around it unless we are ok with changing query results to use the new behavior only.&&and||will only be vectorized whendruid.expressions.useStrictBooleans=true.benchmarks:
This PR has: