multi-value string column support for expressions by clintropolis · Pull Request #7588 · apache/druid

clintropolis · 2019-05-02T11:14:02Z

This PR implements the first part of #7525, adding support for using multi-value string dimensions in the Druid expression language, as either scalar string expressions or explicitly as arrays and manipulating them with array functions. See the proposal of #7525 for additional details about the motivation and high level design descriptions. It might be easiest to see by example though, so here are a few takes on equivalent expressions which are possible with the additions of this PR.

Examples are framed for a hypothetical 'twitter' datasource that ingests tweet events, collecting among other things a multi-value string dimension, 'hashtags'.

Prior to this PR, Druid only provides facilities for operating on the individual strings contained in a multi-value row. This PR preserves this behavior, so the following is a valid expression to prefix any individual 'hashtags' value in a row with 'lol'.

{
  "name": "lol_concat",
  "type": "expression",
  "outputType": "STRING",
  "expression": "concat('lol', hashtags)"
}

This PR also introduces what I've classed as 'apply' functions, which apply a lambda in some manner to one or more arrays. These are the key to the magic that allows the above expression to function, but they can be used explicitly as well. This is an expression virtual column that produces results equivalent to the above expression:

{
  "name": "alt_lol_concat",
  "type": "expression",
  "outputType": "STRING",
  "expression": "map((tag) -> concat('lol', tag), hashtags)"
}

and is what will be done implicitly behind the scenes if the first version of the expression is used instead.

This PR also introduces a handful of new array expressions, which we can use to operate on 'hashtags' explicitly as an array. One such function is array_length, which does what you might expect and evaluates to a number indicating how many elements are in the array.

{
  "name": "tag_count",
  "type": "expression",
  "outputType": "LONG",
  "expression": "array_length(hashtags)"
}

We can also compute the length of an array in a more round-about way with fold:

{
  "name": "alternate_tag_count",
  "type": "expression",
  "outputType": "LONG",
  "expression": "fold((tag, acc) -> acc + 1, hashtags, 0)"
}

One new thing that I find particularly showcases how much more expressive this makes multi-value string dimensions, is that by joining the values together into a single string value with array_to_string, we can effectively make aggregation by a multi-value string dimension in a top-n or group-by query behave in a manner that is consistent with SQL array types, where grouping would be done for equivalent arrays, instead of the 'unnest' behavior that native Druid uses for multi-value string dimensions.

{
  "name": "all_tags_as_string",
  "type": "expression",
  "outputType": "STRING",
  "expression": "array_to_string(hashtags,';')"
}

which of course you can also do with apply functions:

{
  "name": "alt_all_tags_as_string",
  "type": "expression",
  "outputType": "STRING",
  "expression": "fold((tag, acc) -> concat(concat(acc, case_searched(acc == '', '', ';')), tag), hashtags, '')"
}

This can even be turned back into arrays in the final result set, of the correct type, with expression post-aggregators:

{
  "type":"expression",
  "name":"as_array",
  "expression": "string_to_array(all_tags_as_string, ';')"
}

or

{
  "type":"expression",
  "name":"array_of_numbers",
  "expression": "cast(string_to_array(some_numbers_as_string, ';'), 'LONG_ARRAY')"
}

This just scratches the surface, the proposal and docs and unit tests in this PR have more details. I am rather hyped for all of the unimaginable ways people will torture their Druid clusters with these new expressions, and I find them a lot of fun to play with.

Expression Language

The majority of changes are in the expression language itself, to support array types, array functions, and apply functions. The proposal issue goes over the grammar additions and new functions, but was light on the details of how magic happens, so I will try to elaborate.

The Expr interface has been expanded with a new method that is similar to the existing visit method, but instead of strictly examining the Expr tree, it instead allows creating a new Expr dynamically

Expr visit(Shuttle shuttle);

where the Shuttle interface is very similar to the Visitor interface except it's visit method returns an Expr, see https://github.com/apache/incubator-druid/pull/7588/files#diff-cbf2baec8ae5861fc7761e11d2fbf664R116

Currently, all Expr rewriting happens in the method Parser.applyUnappliedIdentifiers, which expression selectors use to transform any expressions that are using multi-value input bindings in a scalar manner, adding apply functions and lifting variables into the lambdas of apply functions as appropriate.

Expr also provides a new BindingDetails type that is produced by a analyzeInputs method on any Expr implementation. This type captures the set of 'free' variables (variables that must have input provided by a binding at eval time), as well as which free variables are used as scalar types and which are used as array types, for the entire tree of that Expr. This information is used in conjunction with details about the input columns themselves to determine which input variables need to be 'applied' by Parser.applyUnappliedIdentifiers.

Expression Virtual Columns

The majority of changes here are focused in ExpressionSelectors, which now includes additional logic that examines the columns which make up the bindings used to evaluate the expression, and calls into Parser.examineBindings to get the details of how those inputs are used within the expression, enabling transformation of the expression with Parser.applyUnappliedIdentifiers as appropriate.

There is also a deficiency here that really needs fixed, namely that type information of selectors we are going to feed into the input bindings comes from the ColumnCapabilities from a selector factory, which is not always reliable and sometimes does not exist at all (nested group-by, postaggs, etc). The 'adapter' selectors do not have this problem, they can be trusted to have full ColumnCapabilities for all input bindings since there is a segment backing, but row based selectors can not currently reliably provide this information.

To deal with this, there is a fallback expression selector, which I'm currently calling RowBasedExpressionColumnValueSelector, which examines the object bindings for any 'unknown' variables on a row by row basis to find list inputs, and use the correct expression for the inputs. This is suboptimal. As a crutch to detect this scenario, for now I have added a method I hate to ColumnCapabilities, isComplete, which is set by selector factories which can be 'trusted' to indicate which columns support multiple values and provide generally reliable information, to determine when we need to use the fallback RowBasedExpressionColumnValueSelector for cases for selectors that do not.

I believe this fallback selector could be nearly (or maybe entirely) eliminated by enriching the type system to preserve information all the way through, and I'm putting together a proposal to address this, because not having consistent type information in the processing pipeline is related to other problems as well. But for now, there will probably be some performance casualties where we have to use this fallback selector instead of a more optimal selector if we had complete type information that we could trust.

Beyond all the unit tests added, this PR is also reasonably well tested both in a 'debugging' cluster capacity on my laptop with a wide range of contrived multi-value test data, as well as an actual test cluster that has the previously mentioned 'twitter' datasource. I have been trying to break the 'magic' expression transformations as much as possible, and while I suspect there are still some missing cases, I have reasonable confidence in it doing the right thing much of the time.

The remainder of #7525, including adding SQL operators for the new array function expressions, will be done in one or more follow-up PR(s).

himanshug · 2019-05-10T18:54:27Z

subscribe

…ring-abomination

jon-wei · 2019-06-06T23:48:17Z

+For logical operators, a number is true if and only if it is positive (0 or negative value means false). For string
+type, it's the evaluation result of 'Boolean.valueOf(string)'.
+
+Multi-value string dimensions are supported and may be treated as either scalar or array typed values. When treated as


Suggest mentioning that expression postaggs won't normally see array types because of the unnesting behavior of TopN and GroupBy, and maybe mention array_to_string and string_to_array as a way to pass arrays to expr postaggs

himanshug · 2019-06-07T07:29:26Z

+import java.util.stream.Collectors;

 /**
+ * Base interface of Druid expression language abstract syntax tree nodes


can we add that, "All Expr instances are Immutable." (and make them immutable if any of them are not)

himanshug · 2019-06-07T19:13:57Z

+      this(ImmutableSet.of(identifier), Collections.emptySet(), Collections.emptySet());
+    }
+
+    public BindingDetails(Set<String> freeVariables, Set<String> scalarVariables, Set<String> arrayVariables)


is Set.union(scalarVariables, arrayVariables) = freeVariables ?
If that is expected then I think we should add a check for that to be true to catch any potential bugs.

In a perfect world, yes, though I know that at least currently that I don't have anything in place to infer if the accumulator input of the 'fold' functions is used as an array or a scalar, so I don't add it to either collection, so this cannot be done quite yet.

himanshug · 2019-06-07T19:16:43Z

+  {
+    private final Set<String> freeVariables;
+    private final Set<String> scalarVariables;
+    private final Set<String> arrayVariables;


can we make all of them "ImmutableSet" to guarantee immutability?
if we have similar cases e.g. List in other classes in this file then those should also be made ImmutableXX .

himanshug · 2019-06-07T19:21:50Z

+      scalars = Collections.emptySet();
+    }
+    return expr.analyzeInputs().mergeWithScalars(scalars);
+  }


not sure why the impl here can't simply return return expr.analyzeInputs() ?
expr.getIdentifierIfIdentifier() returns non-null value only for IdentifierExpr .

can you give example of an unary expression where just returning return expr.analyzeInputs() wouldn't work?

himanshug · 2019-06-07T19:24:24Z

+    }
+    return left.analyzeInputs()
+               .merge(right.analyzeInputs())
+               .mergeWithScalars(scalars);


Similar to UnaryExpr, this could also just return
return left.analyzeInputs().merge(right.analyzeInputs()) ?

this is a common theme in multiple places, I must be missing something :)

Ah, so the operators and functions have to add context of how they are using IdentifierExpr as part of their work for analyzeInputs. This is currently done in a shallow manner, in other words, the operators and functions only add direct IdentifierExpr children to 'scalar' or 'array' collections based on the type of functions. IdentifierExpr themselves do not know their context, so only add themselves to the set of 'free' variables.

In case an example makes this makes more sense, imagine an expression: x + (y - z) which is parsed into Expr as the tree:

BinPlusExpr / \ IdentifierExpr(x) BinMinusExpr / \ IdentifierExpr(y) IdentifierExpr(z)

or

+ / \ x - / \ y z

for short.

working this out:

analyzeInputs(+) = analyzeInputs(x) + analyzeInputs(-) + "set of direct IdentifierExpr children of +" analyzeInputs(+) = (analyzeInputs(x)) + (analyzeInputs(y) + analyzeInputs(z) + "set of direct IdentifierExpr children of -") + ("set of direct IdentifierExpr children of +") analyzeInputs(+) = ([free:[x], scalar:[], array:[]]) + ([free:[y], scalar:[], array:[]] + [free:[z], scalar:[], array:[]] + [scalar:[y, z]]) + ([scalar:[x]]) analyzeInputs(+) = [free:[x, y, z], scalar:[x, y, z], array:[]]

Hopefully that didn't make it more confusing!

This approach definitely has some shortcomings, such as if a function works with both array and scalar inputs such as the accumulator input of the 'fold' functions, but luckily there is very little overlap between array and scalar functions right now. I imagine that this analyzeInputs machinery will be largely reworked or replaced or scrapped in the future when I try to add full type inference necessary to allow us to fully optimize expressions to use primitive types where possible, but for now it's doing ok enough to usually give me enough context to determine when an input column is expected to be an array, if an array input is not being used as an array, and if any contradictory usages of an input column as an array and a scalar are detected.

thanks for explanation

himanshug · 2019-06-07T19:54:11Z

+     | '[]'                                           # emptyArray
     ;

+lambda : (IDENTIFIER | '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr


nit: from grammar alone, it would be more obvious to see that no arg lambdas are possible if rule looked like

Suggested change

lambda : (IDENTIFIER | '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr

lambda : (IDENTIFIER | '(' ')' | '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr

himanshug · 2019-06-07T20:18:24Z


+  @Test
+  public void testApplyFunctions()
+  {


add a case for no arg ..

Suggested change

{

{

validateParser(

"map(() -> 1, x)",

"(map ([] -> 1), [x])",

ImmutableList.of("x"),

ImmutableSet.of(),

ImmutableSet.of("x")

);

clintropolis · 2019-06-10T22:33:29Z

There are currently a few missing pieces for this functionality to work with Druid ingestion transformations, since there are a handful of places that lack the value coercion that is happening in the expression selectors on the query side. I'm trying to hunt down all the areas that need fixed up so everything works as expected, and depending on how large of a change it is and how far reviewed this PR is before I get it finished, I might prefer to do as a follow-up PR.

clintropolis · 2019-06-14T22:01:23Z

I would like to do ingestion side support through transforms as a follow-up PR, I think maybe it could benefit from a slight refactoring and this PR is already big enough.

This PR should have no effect on multi-value dimensions used in transform spec, because it appears to already behave differently than the expression selectors in that instead of array values being translated to null for the expression they are instead fed into String.valueOf so they appear as a scalar string. As expected, using the array functions from this PR at ingestion time can have unexpected results because of the scalar string behavior and also lack of array result to list of string coercion on the expression outputs, but because of the string behavior it appears that this is also true of any expression on multi-value ingestion time inputs.

I think my main outstanding question for this PR is whether or not this new expression selector behavior should be optional for now, because there is a potential performance impact for expression selectors with bindings from row based selectors since they will have to use RowBasedExpressionColumnValueSelector with it's per row check of input binding type. I do hope to eliminate nearly all of these cases in future work by enhancing the type information throughout the query pipeline, but i'm not certain how far in the future that actually is and I haven't had the chance to conduct thorough measurements on performance difference, which is why I brought this up. Also I could see a use for turning off this behavior if certain there are no multi-value dimensions. On the other hand though, we have a lot of options, I'm not sure we need more. Maybe some of the improvements done in this PR, like argument validation for functions being done at parse time instead of for each eval, make up enough for the RowBasedExpressionColumnValueSelector phase that it about evens out. Thoughts?

gianm · 2019-06-18T18:13:12Z

I think my main outstanding question for this PR is whether or not this new expression selector behavior should be optional for now, because there is a potential performance impact for expression selectors with bindings from row based selectors since they will have to use RowBasedExpressionColumnValueSelector with it's per row check of input binding type. I do hope to eliminate nearly all of these cases in future work by enhancing the type information throughout the query pipeline, but i'm not certain how far in the future that actually is and I haven't had the chance to conduct thorough measurements on performance difference, which is why I brought this up. Also I could see a use for turning off this behavior if certain there are no multi-value dimensions. On the other hand though, we have a lot of options, I'm not sure we need more. Maybe some of the improvements done in this PR, like argument validation for functions being done at parse time instead of for each eval, make up enough for the RowBasedExpressionColumnValueSelector phase that it about evens out. Thoughts?

I wouldn't worry too much about this overhead, for two reasons:

Type check overhead should be very minimal in the grand scheme of everything that needs to happen per row during the input binding phase.
Right now the expression layer is quite inefficient due to not having types known in advance, and this is really just one more manifestation of that. We should be addressing this in a holistic way in the future & so I wouldn't worry too much about any particular instance right now. At this point, we should be able to eliminate the overhead completely.

jon-wei

LGTM

…ring-abomination

himanshug · 2019-06-19T06:45:52Z

so, I have mostly looked at the expression language changes and that LGTM .

I haven't followed the expression selector changes but it appears other reviewers did, so feel free to merge if/when you feel it is ready.

… in the future, or define empty arrays as other types..

clintropolis · 2019-06-21T00:48:45Z

Thanks for review all 👍

* array support for expression language for multi-value string columns * fix tests? * fixes * more tests * fixes * cleanup * more better, more test * ignore inspection * license * license fix * inspection * remove dumb import * more better * some comments * add expr rewrite for arrayfn args for more magic, tests * test stuff * more tests * fix test * fix test * castfunc can deal with arrays * needs more empty array * more tests, make cast to long array more forgiving * refactor * simplify ExprMacro Expr implementations with base classes in core * oops * more test * use Shuttle for Parser.flatten, javadoc, cleanup * fixes and more tests * unused import * fixes * javadocs, cleanup, refactors * fix imports * more javadoc * more javadoc * more * more javadocs, nonnullbydefault, minor refactor * markdown fix * adjustments * more doc * move initial filter out * docs * map empty arg lambda, apply function argument validation * check function args at parse time instead of eval time * more immutable * more more immutable * clarify grammar * fix docs * empty array is string test, we need a way to make arrays better maybe in the future, or define empty arrays as other types..

array support for expression language for multi-value string columns

4672efb

clintropolis added the WIP label May 2, 2019

clintropolis added 4 commits May 2, 2019 12:08

fix tests?

907c550

fixes

5c997db

more tests

cefe3e7

fixes

a46a9a2

clintropolis added Design Review Area - Querying labels May 4, 2019

clintropolis added 2 commits May 8, 2019 13:01

cleanup

81aadce

more better, more test

b9410e9

clintropolis added 4 commits May 14, 2019 13:03

ignore inspection

97c38d4

Merge remote-tracking branch 'upstream/master' into expr-multi-val-st…

d323c25

…ring-abomination

license

f3fff98

license fix

81055b3

clintropolis mentioned this pull request May 14, 2019

Quantiles sketch agg fails on inner query numeric post-agg columns #7660

Closed

clintropolis added 5 commits May 14, 2019 17:03

inspection

54e5e67

remove dumb import

fd67f0e

Merge remote-tracking branch 'upstream/master' into expr-multi-val-st…

a1bb13d

…ring-abomination

more better

94efb2c

some comments

ec5c420

clintropolis added the Release Notes label May 22, 2019

clintropolis added 5 commits May 23, 2019 17:11

add expr rewrite for arrayfn args for more magic, tests

3371505

test stuff

e6bd16b

more tests

f210ba6

fix test

e9d223e

fix test

2387597

clintropolis removed the WIP label May 28, 2019

clintropolis mentioned this pull request May 29, 2019

fix group-by v2 BufferArrayGrouper for empty multi-value dimension row #7794

Merged

Merge remote-tracking branch 'upstream/master' into expr-multi-val-st…

17ada78

…ring-abomination

jon-wei reviewed Jun 6, 2019

View reviewed changes

clintropolis added 3 commits June 6, 2019 23:22

docs

ee304ae

map empty arg lambda, apply function argument validation

520cac9

check function args at parse time instead of eval time

ff167d2

himanshug reviewed Jun 7, 2019

View reviewed changes

clintropolis added 3 commits June 10, 2019 15:14

more immutable

d0bd362

more more immutable

48ebd0f

clarify grammar

c9b304b

jon-wei approved these changes Jun 18, 2019

View reviewed changes

clintropolis added 2 commits June 18, 2019 13:29

Merge remote-tracking branch 'upstream/master' into expr-multi-val-st…

b05b7a6

…ring-abomination

fix docs

5032099

himanshug approved these changes Jun 19, 2019

View reviewed changes

empty array is string test, we need a way to make arrays better maybe…

4aae1d8

… in the future, or define empty arrays as other types..

fjy merged commit 494b8eb into apache:master Jun 19, 2019

clintropolis deleted the expr-multi-val-string-abomination branch June 21, 2019 00:48

clintropolis mentioned this pull request Jun 25, 2019

expression virtual column selector fix for expressions which produce array types #7958

Merged

clintropolis mentioned this pull request Jul 2, 2019

optimize single string dimension expression selector #8014

Merged

2 tasks

gianm mentioned this pull request Jul 4, 2019

Fix ExpressionVirtualColumn capabilities; fix groupBy's improper uses of StorageAdapter#getColumnCapabilities. #8013

Merged

clintropolis added this to the 0.16.0 milestone Aug 8, 2019

clintropolis mentioned this pull request Aug 21, 2019

0.16.0-incubating release notes #8369

Closed

clintropolis mentioned this pull request Oct 4, 2019

hll sketch to accept array as input #8619

Closed

2 tasks

clintropolis mentioned this pull request Apr 21, 2020

ColumnCapabilities.hasMultipleValues refactor #9731

Merged

8 tasks

clintropolis mentioned this pull request May 14, 2021

Update math-expr.md #11254

Merged

9 tasks

clintropolis mentioned this pull request Mar 6, 2023

relax native expression multi-value string usage validation for conditional and null coalescing functions #13885

Closed

8 tasks

	lambda : (IDENTIFIER \| '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr
	lambda : (IDENTIFIER \| '(' ')' \| '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr

-  {
+  {
+    validateParser(
+        "map(() -> 1, x)",
+        "(map ([] -> 1), [x])",
+        ImmutableList.of("x"),
+        ImmutableSet.of(),
+        ImmutableSet.of("x")
+    );

Conversation

clintropolis commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Expression Language

Expression Virtual Columns

Uh oh!

himanshug commented May 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis commented Jun 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintropolis commented Jun 14, 2019

Uh oh!

gianm commented Jun 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jon-wei left a comment

Choose a reason for hiding this comment

Uh oh!

himanshug commented Jun 19, 2019

Uh oh!

clintropolis commented Jun 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

clintropolis commented May 2, 2019 •

edited

Loading

clintropolis commented Jun 10, 2019 •

edited

Loading

gianm commented Jun 18, 2019 •

edited

Loading