multi-value string column support for expressions#7588
Conversation
|
subscribe |
| For logical operators, a number is true if and only if it is positive (0 or negative value means false). For string | ||
| type, it's the evaluation result of 'Boolean.valueOf(string)'. | ||
|
|
||
| Multi-value string dimensions are supported and may be treated as either scalar or array typed values. When treated as |
There was a problem hiding this comment.
Suggest mentioning that expression postaggs won't normally see array types because of the unnesting behavior of TopN and GroupBy, and maybe mention array_to_string and string_to_array as a way to pass arrays to expr postaggs
| import java.util.stream.Collectors; | ||
|
|
||
| /** | ||
| * Base interface of Druid expression language abstract syntax tree nodes |
There was a problem hiding this comment.
can we add that, "All Expr instances are Immutable." (and make them immutable if any of them are not)
| this(ImmutableSet.of(identifier), Collections.emptySet(), Collections.emptySet()); | ||
| } | ||
|
|
||
| public BindingDetails(Set<String> freeVariables, Set<String> scalarVariables, Set<String> arrayVariables) |
There was a problem hiding this comment.
is Set.union(scalarVariables, arrayVariables) = freeVariables ?
If that is expected then I think we should add a check for that to be true to catch any potential bugs.
There was a problem hiding this comment.
In a perfect world, yes, though I know that at least currently that I don't have anything in place to infer if the accumulator input of the 'fold' functions is used as an array or a scalar, so I don't add it to either collection, so this cannot be done quite yet.
| { | ||
| private final Set<String> freeVariables; | ||
| private final Set<String> scalarVariables; | ||
| private final Set<String> arrayVariables; |
There was a problem hiding this comment.
can we make all of them "ImmutableSet" to guarantee immutability?
if we have similar cases e.g. List in other classes in this file then those should also be made ImmutableXX .
| scalars = Collections.emptySet(); | ||
| } | ||
| return expr.analyzeInputs().mergeWithScalars(scalars); | ||
| } |
There was a problem hiding this comment.
not sure why the impl here can't simply return return expr.analyzeInputs() ?
expr.getIdentifierIfIdentifier() returns non-null value only for IdentifierExpr .
can you give example of an unary expression where just returning return expr.analyzeInputs() wouldn't work?
| } | ||
| return left.analyzeInputs() | ||
| .merge(right.analyzeInputs()) | ||
| .mergeWithScalars(scalars); |
There was a problem hiding this comment.
Similar to UnaryExpr, this could also just return
return left.analyzeInputs().merge(right.analyzeInputs()) ?
There was a problem hiding this comment.
this is a common theme in multiple places, I must be missing something :)
There was a problem hiding this comment.
Ah, so the operators and functions have to add context of how they are using IdentifierExpr as part of their work for analyzeInputs. This is currently done in a shallow manner, in other words, the operators and functions only add direct IdentifierExpr children to 'scalar' or 'array' collections based on the type of functions. IdentifierExpr themselves do not know their context, so only add themselves to the set of 'free' variables.
In case an example makes this makes more sense, imagine an expression: x + (y - z) which is parsed into Expr as the tree:
BinPlusExpr
/ \
IdentifierExpr(x) BinMinusExpr
/ \
IdentifierExpr(y) IdentifierExpr(z)
or
+
/ \
x -
/ \
y z
for short.
working this out:
analyzeInputs(+) = analyzeInputs(x) + analyzeInputs(-) + "set of direct IdentifierExpr children of +"
analyzeInputs(+) = (analyzeInputs(x)) + (analyzeInputs(y) + analyzeInputs(z) + "set of direct IdentifierExpr children of -") + ("set of direct IdentifierExpr children of +")
analyzeInputs(+) = ([free:[x], scalar:[], array:[]]) + ([free:[y], scalar:[], array:[]] + [free:[z], scalar:[], array:[]] + [scalar:[y, z]]) + ([scalar:[x]])
analyzeInputs(+) = [free:[x, y, z], scalar:[x, y, z], array:[]]
Hopefully that didn't make it more confusing!
This approach definitely has some shortcomings, such as if a function works with both array and scalar inputs such as the accumulator input of the 'fold' functions, but luckily there is very little overlap between array and scalar functions right now. I imagine that this analyzeInputs machinery will be largely reworked or replaced or scrapped in the future when I try to add full type inference necessary to allow us to fully optimize expressions to use primitive types where possible, but for now it's doing ok enough to usually give me enough context to determine when an input column is expected to be an array, if an array input is not being used as an array, and if any contradictory usages of an input column as an array and a scalar are detected.
| | '[]' # emptyArray | ||
| ; | ||
|
|
||
| lambda : (IDENTIFIER | '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr |
There was a problem hiding this comment.
nit: from grammar alone, it would be more obvious to see that no arg lambdas are possible if rule looked like
| lambda : (IDENTIFIER | '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr | |
| lambda : (IDENTIFIER | '(' ')' | '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr |
|
|
||
| @Test | ||
| public void testApplyFunctions() | ||
| { |
There was a problem hiding this comment.
add a case for no arg ..
| { | |
| { | |
| validateParser( | |
| "map(() -> 1, x)", | |
| "(map ([] -> 1), [x])", | |
| ImmutableList.of("x"), | |
| ImmutableSet.of(), | |
| ImmutableSet.of("x") | |
| ); |
|
There are currently a few missing pieces for this functionality to work with Druid ingestion transformations, since there are a handful of places that lack the value coercion that is happening in the expression selectors on the query side. I'm trying to hunt down all the areas that need fixed up so everything works as expected, and depending on how large of a change it is and how far reviewed this PR is before I get it finished, I might prefer to do as a follow-up PR. |
|
I would like to do ingestion side support through transforms as a follow-up PR, I think maybe it could benefit from a slight refactoring and this PR is already big enough. This PR should have no effect on multi-value dimensions used in transform spec, because it appears to already behave differently than the expression selectors in that instead of array values being translated to I think my main outstanding question for this PR is whether or not this new expression selector behavior should be optional for now, because there is a potential performance impact for expression selectors with bindings from row based selectors since they will have to use |
I wouldn't worry too much about this overhead, for two reasons:
|
|
so, I have mostly looked at the expression language changes and that LGTM . I haven't followed the expression selector changes but it appears other reviewers did, so feel free to merge if/when you feel it is ready. |
… in the future, or define empty arrays as other types..
|
Thanks for review all 👍 |
* array support for expression language for multi-value string columns * fix tests? * fixes * more tests * fixes * cleanup * more better, more test * ignore inspection * license * license fix * inspection * remove dumb import * more better * some comments * add expr rewrite for arrayfn args for more magic, tests * test stuff * more tests * fix test * fix test * castfunc can deal with arrays * needs more empty array * more tests, make cast to long array more forgiving * refactor * simplify ExprMacro Expr implementations with base classes in core * oops * more test * use Shuttle for Parser.flatten, javadoc, cleanup * fixes and more tests * unused import * fixes * javadocs, cleanup, refactors * fix imports * more javadoc * more javadoc * more * more javadocs, nonnullbydefault, minor refactor * markdown fix * adjustments * more doc * move initial filter out * docs * map empty arg lambda, apply function argument validation * check function args at parse time instead of eval time * more immutable * more more immutable * clarify grammar * fix docs * empty array is string test, we need a way to make arrays better maybe in the future, or define empty arrays as other types..
This PR implements the first part of #7525, adding support for using multi-value string dimensions in the Druid expression language, as either scalar string expressions or explicitly as arrays and manipulating them with array functions. See the proposal of #7525 for additional details about the motivation and high level design descriptions. It might be easiest to see by example though, so here are a few takes on equivalent expressions which are possible with the additions of this PR.
Examples are framed for a hypothetical 'twitter' datasource that ingests tweet events, collecting among other things a multi-value string dimension, 'hashtags'.
Prior to this PR, Druid only provides facilities for operating on the individual strings contained in a multi-value row. This PR preserves this behavior, so the following is a valid expression to prefix any individual 'hashtags' value in a row with 'lol'.
{ "name": "lol_concat", "type": "expression", "outputType": "STRING", "expression": "concat('lol', hashtags)" }This PR also introduces what I've classed as 'apply' functions, which apply a lambda in some manner to one or more arrays. These are the key to the magic that allows the above expression to function, but they can be used explicitly as well. This is an expression virtual column that produces results equivalent to the above expression:
{ "name": "alt_lol_concat", "type": "expression", "outputType": "STRING", "expression": "map((tag) -> concat('lol', tag), hashtags)" }and is what will be done implicitly behind the scenes if the first version of the expression is used instead.
This PR also introduces a handful of new array expressions, which we can use to operate on 'hashtags' explicitly as an array. One such function is
array_length, which does what you might expect and evaluates to a number indicating how many elements are in the array.{ "name": "tag_count", "type": "expression", "outputType": "LONG", "expression": "array_length(hashtags)" }We can also compute the length of an array in a more round-about way with
fold:{ "name": "alternate_tag_count", "type": "expression", "outputType": "LONG", "expression": "fold((tag, acc) -> acc + 1, hashtags, 0)" }One new thing that I find particularly showcases how much more expressive this makes multi-value string dimensions, is that by joining the values together into a single string value with
array_to_string, we can effectively make aggregation by a multi-value string dimension in a top-n or group-by query behave in a manner that is consistent with SQL array types, where grouping would be done for equivalent arrays, instead of the 'unnest' behavior that native Druid uses for multi-value string dimensions.{ "name": "all_tags_as_string", "type": "expression", "outputType": "STRING", "expression": "array_to_string(hashtags,';')" }which of course you can also do with apply functions:
{ "name": "alt_all_tags_as_string", "type": "expression", "outputType": "STRING", "expression": "fold((tag, acc) -> concat(concat(acc, case_searched(acc == '', '', ';')), tag), hashtags, '')" }This can even be turned back into arrays in the final result set, of the correct type, with expression post-aggregators:
{ "type":"expression", "name":"as_array", "expression": "string_to_array(all_tags_as_string, ';')" }or
{ "type":"expression", "name":"array_of_numbers", "expression": "cast(string_to_array(some_numbers_as_string, ';'), 'LONG_ARRAY')" }This just scratches the surface, the proposal and docs and unit tests in this PR have more details. I am rather hyped for all of the unimaginable ways people will torture their Druid clusters with these new expressions, and I find them a lot of fun to play with.
Expression Language
The majority of changes are in the expression language itself, to support array types, array functions, and apply functions. The proposal issue goes over the grammar additions and new functions, but was light on the details of how magic happens, so I will try to elaborate.
The
Exprinterface has been expanded with a new method that is similar to the existingvisitmethod, but instead of strictly examining theExprtree, it instead allows creating a newExprdynamicallywhere the
Shuttleinterface is very similar to theVisitorinterface except it'svisitmethod returns anExpr, see https://github.com/apache/incubator-druid/pull/7588/files#diff-cbf2baec8ae5861fc7761e11d2fbf664R116Currently, all
Exprrewriting happens in the methodParser.applyUnappliedIdentifiers, which expression selectors use to transform any expressions that are using multi-value input bindings in a scalar manner, adding apply functions and lifting variables into the lambdas of apply functions as appropriate.Expralso provides a newBindingDetailstype that is produced by aanalyzeInputsmethod on anyExprimplementation. This type captures the set of 'free' variables (variables that must have input provided by a binding at eval time), as well as which free variables are used as scalar types and which are used as array types, for the entire tree of thatExpr. This information is used in conjunction with details about the input columns themselves to determine which input variables need to be 'applied' byParser.applyUnappliedIdentifiers.Expression Virtual Columns
The majority of changes here are focused in
ExpressionSelectors, which now includes additional logic that examines the columns which make up the bindings used to evaluate the expression, and calls intoParser.examineBindingsto get the details of how those inputs are used within the expression, enabling transformation of the expression withParser.applyUnappliedIdentifiersas appropriate.There is also a deficiency here that really needs fixed, namely that type information of selectors we are going to feed into the input bindings comes from the
ColumnCapabilitiesfrom a selector factory, which is not always reliable and sometimes does not exist at all (nested group-by, postaggs, etc). The 'adapter' selectors do not have this problem, they can be trusted to have fullColumnCapabilitiesfor all input bindings since there is a segment backing, but row based selectors can not currently reliably provide this information.To deal with this, there is a fallback expression selector, which I'm currently calling
RowBasedExpressionColumnValueSelector, which examines the object bindings for any 'unknown' variables on a row by row basis to find list inputs, and use the correct expression for the inputs. This is suboptimal. As a crutch to detect this scenario, for now I have added a method I hate toColumnCapabilities,isComplete, which is set by selector factories which can be 'trusted' to indicate which columns support multiple values and provide generally reliable information, to determine when we need to use the fallbackRowBasedExpressionColumnValueSelectorfor cases for selectors that do not.I believe this fallback selector could be nearly (or maybe entirely) eliminated by enriching the type system to preserve information all the way through, and I'm putting together a proposal to address this, because not having consistent type information in the processing pipeline is related to other problems as well. But for now, there will probably be some performance casualties where we have to use this fallback selector instead of a more optimal selector if we had complete type information that we could trust.
Beyond all the unit tests added, this PR is also reasonably well tested both in a 'debugging' cluster capacity on my laptop with a wide range of contrived multi-value test data, as well as an actual test cluster that has the previously mentioned 'twitter' datasource. I have been trying to break the 'magic' expression transformations as much as possible, and while I suspect there are still some missing cases, I have reasonable confidence in it doing the right thing much of the time.
The remainder of #7525, including adding SQL operators for the new array function expressions, will be done in one or more follow-up PR(s).