optimize single input column multi-value expressions#8047
optimize single input column multi-value expressions#8047gianm merged 12 commits intoapache:masterfrom
Conversation
|
Tagged for milestone 0.16.0 since this should be included in the same release with the rest of the multi-value expression support, to avoid a future behavior change on single input multi-value expressions. |
| */ | ||
| default boolean hasArrayOutput() | ||
| { | ||
| return false; |
There was a problem hiding this comment.
What if a function might sometimes return an array and might sometimes return a scalar? Is that:
- Not allowed (functions should not differ in their return types in this way)
- Allowed, and
hasArrayOutputshould betrue - Allowed, and
hasArrayOutputshould befalse
Please adjust the javadocs appropriately.
There was a problem hiding this comment.
This should not be allowed imo, but I think we'll have to discuss this further when we get around to adding full type inference to the expression system. I will adjust the javadoc to state it should not be done currently.
| public void testGroupByWithStringVirtualColumn() | ||
| { | ||
| // Cannot vectorize due to virtual columns. | ||
| // all virtual columns are single input column, so it will work for group by v1, even with multi-value inputs |
There was a problem hiding this comment.
🆒 it's quite nice that this test can pass again.
| /** | ||
| * Returns true if the expression has any array inputs. Note that in some cases, this can be true and | ||
| * {@link BindingDetails#getArrayVariables} can be empty, since the latter can only shallowly collect identifiers | ||
| * that are explicitly used as arrays. |
There was a problem hiding this comment.
I turned this comment over a few times in my mind but don't quite understand it. What does "shallowly" mean in this context and what is "deeper" about hasInputArrays?
There was a problem hiding this comment.
The distinction is that getArrayVariables returns the identifiers we were able to explicitly detect as arrays, where as hasInputArrays means that an expression expects some number of array inputs. Expr.analyzeInputs implementations can currently only correctly categorize arguments that are directly identifiers.
Now, the reason we need to be able to distinguish is an issue because I also accept slop expressions. For example:
array_to_string(concat('foo', tags), ', ')
where tags is a multi-value dimension, the expression bindings will list tags as a scalar input, but array_to_string does in fact take an array argument. By distinguishing these I can still accept this expression and translate it correctly to:
array_to_string(map((tags) -> concat('foo', tags), tags), ',')
and not incorrectly use this optimized single input column introduced in this PR. This was not an issue prior to this PR because being multi-valued in the capabilities was reason enough to exclude an expression from the single column optimization, but after the restriction only applies to things that expects array inputs or produce array outputs.
Does that make sense? I will try to clear up the javadocs for this.
| scalarVariables, | ||
| ImmutableSet.copyOf(Sets.union(arrayVariables, arrayIdentifiers)) | ||
| ImmutableSet.copyOf(Sets.union(arrayVariables, arrayIdentifiers)), | ||
| hasInputArrays || arrayArguments.size() > 0, |
There was a problem hiding this comment.
Is it possible for arrayArguments.size() to be greater than zero, but hasInputArrays to be false? If so how? (Related to previous comment, I think)
There was a problem hiding this comment.
I don't think so, this can probably just check hasInputArrays, but I will confirm by removing the size check and seeing what happens to the tests
There was a problem hiding this comment.
Oh i wasn't paying attention to which function this was, it is possible in this specific context and it's unrelated to the other thing.
Since this is a sort of 'builder' method it is hypothetically possible for this method to be called when adding the initial set of array columns on a binding details, without explicitly calling withArrayInputs. This || then ensures that if even if you forgot to call withArrayInputs the binding details being constructed are correct, becase the hasInputArrays in this case is the value of the existing binding before the array arguments are added.
|
|
||
| lambdaBindingDetails = lambdaExpr.analyzeInputs(); | ||
| bindingDetails = accumulator.with(lambdaBindingDetails).withArrayArguments(function.getArrayInputs(argsExpr)); | ||
| boolean isArrayOutput = function.hasArrayOutput() || (function instanceof ApplyFunction.BaseFoldFunction |
There was a problem hiding this comment.
Using instanceof here seems like a code smell -- is there a nice way to add a method to ApplyFunction that gets the necessary info out?
There was a problem hiding this comment.
Maybe? Will have a look
There was a problem hiding this comment.
I've modified ApplyFunction.hasArrayOutput to now take the LambaExpr as an argument so BaseFoldFunction can just override hasArrayOutput to examine the lambda expression to see if it's output is array typed.
| } | ||
|
|
||
| /** | ||
| * Returns true if a function takes an array arguments |
There was a problem hiding this comment.
- "takes array arguments" or "takes an array argument" (grammar, and also, I'm not sure which one you mean; the two phrases mean different things)
- Clarify intent: what does "takes an array arguments" mean? That it might take array arguments? Or that it must take array arguments?
There was a problem hiding this comment.
the typo is that it's supposed to be any array arguments heh, which I think clears up both things if I fix?
| } | ||
|
|
||
| /** | ||
| * Returns true if function produces an array |
There was a problem hiding this comment.
What happens if it sometimes produces an array, and sometimes doesn't, based on its input type? Similar comment to ApplyFunction.hasArrayOutput: is it disallowed, or should it return true/false?
There was a problem hiding this comment.
see above comment, will adjust docs to mention it shouldn't be done, but i don't have a way to really enforce it yet, need type inference i think
| } | ||
| assertFilterMatchesSkipVectorize(edf("dim4 == '1'"), ImmutableList.of("0")); | ||
| assertFilterMatchesSkipVectorize(edf("dim4 == '3'"), ImmutableList.of("3")); | ||
| assertFilterMatchesSkipVectorize(edf("dim4 == '4'"), ImmutableList.of("4", "5")); |
There was a problem hiding this comment.
I .. don't really remember, I'm going to go with the reason was for fun, to add another test that actually has multiple matches in the case where sql compatible null handling is enabled 😜
| && capabilities.isComplete() | ||
| && !capabilities.hasMultipleValues() | ||
| && !exprDetails.getArrayColumns().contains(column)) { | ||
| && exprDetails.getArrayColumns().size() == 0) { |
| && !exprDetails.isOutputArray() | ||
| && (!capabilities.hasMultipleValues() || exprDetails.getFreeVariables().size() == 1) | ||
| ) { | ||
| // Optimization for dimension selectors that wrap a single underlying string column. |
There was a problem hiding this comment.
The flag chain above is getting complicated!
Could you please alter this comment to reflect what it's looking for? If I understand it right, that should be something like:
Optimization for dimension selectors that wrap a single underlying string column. The string column can be multi-valued, but if so, it must be implicitly mappable (i.e. the expression is not treating it as an array, not wanting to output an array, and the multi-value dimension appears exactly once).
By the way, I have to assume logic like this is being used elsewhere to implicitly map expressions that go through the non-optimized selector. Is there any opportunity to extract the logic into a common helper method? It would make it easier for readers and probably less error prone too.
There was a problem hiding this comment.
The logic is only used in these 2 places in ExpressionSelectors I think, and it's slightly different between the uses because SingleStringInputDimensionSelector supports multi-value inputs, but SingleStringInputCachingExpressionColumnValueSelector does not. I'll play around and see what I can do, or at least comment what's going on heavily.
| public void validateArguments(List<Expr> args) | ||
| { | ||
| return ImmutableSet.copyOf(args); | ||
| if (!(args.size() > 0)) { |
There was a problem hiding this comment.
Could simplify to:
if (args.isEmpty()) {
gianm
left a comment
There was a problem hiding this comment.
LGTM - thanks for the clarifications.
Description
This PR follows up on the changes in #8014, which actually opened the door to using the same optimized selector to evaluate single input column expressions for multi-value string columns, using native Druid behavior to do the
mapof the expression across the values. This works provided a few conditions are met:To achieve this, functions and apply functions now also report if they have array inputs or outputs which is collected on
Expr.BindingDetails, which is already used to analyze the expression inExpressionSelectorsin order to determine the correct type of selector to create.This change also allows multi-value expressions that meet the above conditions to work with group by v1, since the cardinality can be known, however wider multi-value expression support for group by v1 is not available (or really my priority).
This PR has also uncovered some inconsistencies in the null value handling for multi-value expressions, specifically the part that coerces
null,[], and[null]all into[null], which creates some minor differences between using theSingleStringInputDimensionSelectorto "map" an expression across the row values compared to using amapexpression. Ideally I will modify the coercion to not convert[]to[null], which seems to be the problematic issue, but it creates mass of failing tests in theExpressionFilterTestdue to differences in behavior in the underlying selectors, so I intend to investigate this deeper in follow-up work The difference in behavior is apparent in the modified calcite test, where[]is grouped as null with default null handling behavior instead of"". This is likely related or at least similar to #5897 and #4915.This PR has: