Skip to content

multi-value string column support for expressions#7588

Merged
fjy merged 55 commits intoapache:masterfrom
clintropolis:expr-multi-val-string-abomination
Jun 19, 2019
Merged

multi-value string column support for expressions#7588
fjy merged 55 commits intoapache:masterfrom
clintropolis:expr-multi-val-string-abomination

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented May 2, 2019

This PR implements the first part of #7525, adding support for using multi-value string dimensions in the Druid expression language, as either scalar string expressions or explicitly as arrays and manipulating them with array functions. See the proposal of #7525 for additional details about the motivation and high level design descriptions. It might be easiest to see by example though, so here are a few takes on equivalent expressions which are possible with the additions of this PR.

Examples are framed for a hypothetical 'twitter' datasource that ingests tweet events, collecting among other things a multi-value string dimension, 'hashtags'.

Prior to this PR, Druid only provides facilities for operating on the individual strings contained in a multi-value row. This PR preserves this behavior, so the following is a valid expression to prefix any individual 'hashtags' value in a row with 'lol'.

{
  "name": "lol_concat",
  "type": "expression",
  "outputType": "STRING",
  "expression": "concat('lol', hashtags)"
}

This PR also introduces what I've classed as 'apply' functions, which apply a lambda in some manner to one or more arrays. These are the key to the magic that allows the above expression to function, but they can be used explicitly as well. This is an expression virtual column that produces results equivalent to the above expression:

{
  "name": "alt_lol_concat",
  "type": "expression",
  "outputType": "STRING",
  "expression": "map((tag) -> concat('lol', tag), hashtags)"
}

and is what will be done implicitly behind the scenes if the first version of the expression is used instead.

This PR also introduces a handful of new array expressions, which we can use to operate on 'hashtags' explicitly as an array. One such function is array_length, which does what you might expect and evaluates to a number indicating how many elements are in the array.

{
  "name": "tag_count",
  "type": "expression",
  "outputType": "LONG",
  "expression": "array_length(hashtags)"
}

We can also compute the length of an array in a more round-about way with fold:

{
  "name": "alternate_tag_count",
  "type": "expression",
  "outputType": "LONG",
  "expression": "fold((tag, acc) -> acc + 1, hashtags, 0)"
}

One new thing that I find particularly showcases how much more expressive this makes multi-value string dimensions, is that by joining the values together into a single string value with array_to_string, we can effectively make aggregation by a multi-value string dimension in a top-n or group-by query behave in a manner that is consistent with SQL array types, where grouping would be done for equivalent arrays, instead of the 'unnest' behavior that native Druid uses for multi-value string dimensions.

{
  "name": "all_tags_as_string",
  "type": "expression",
  "outputType": "STRING",
  "expression": "array_to_string(hashtags,';')"
}

which of course you can also do with apply functions:

{
  "name": "alt_all_tags_as_string",
  "type": "expression",
  "outputType": "STRING",
  "expression": "fold((tag, acc) -> concat(concat(acc, case_searched(acc == '', '', ';')), tag), hashtags, '')"
}

This can even be turned back into arrays in the final result set, of the correct type, with expression post-aggregators:

{
  "type":"expression",
  "name":"as_array",
  "expression": "string_to_array(all_tags_as_string, ';')"
}

or

{
  "type":"expression",
  "name":"array_of_numbers",
  "expression": "cast(string_to_array(some_numbers_as_string, ';'), 'LONG_ARRAY')"
}

This just scratches the surface, the proposal and docs and unit tests in this PR have more details. I am rather hyped for all of the unimaginable ways people will torture their Druid clusters with these new expressions, and I find them a lot of fun to play with.

Expression Language

The majority of changes are in the expression language itself, to support array types, array functions, and apply functions. The proposal issue goes over the grammar additions and new functions, but was light on the details of how magic happens, so I will try to elaborate.

The Expr interface has been expanded with a new method that is similar to the existing visit method, but instead of strictly examining the Expr tree, it instead allows creating a new Expr dynamically

Expr visit(Shuttle shuttle);

where the Shuttle interface is very similar to the Visitor interface except it's visit method returns an Expr, see https://github.com/apache/incubator-druid/pull/7588/files#diff-cbf2baec8ae5861fc7761e11d2fbf664R116

Currently, all Expr rewriting happens in the method Parser.applyUnappliedIdentifiers, which expression selectors use to transform any expressions that are using multi-value input bindings in a scalar manner, adding apply functions and lifting variables into the lambdas of apply functions as appropriate.

Expr also provides a new BindingDetails type that is produced by a analyzeInputs method on any Expr implementation. This type captures the set of 'free' variables (variables that must have input provided by a binding at eval time), as well as which free variables are used as scalar types and which are used as array types, for the entire tree of that Expr. This information is used in conjunction with details about the input columns themselves to determine which input variables need to be 'applied' by Parser.applyUnappliedIdentifiers.

Expression Virtual Columns

The majority of changes here are focused in ExpressionSelectors, which now includes additional logic that examines the columns which make up the bindings used to evaluate the expression, and calls into Parser.examineBindings to get the details of how those inputs are used within the expression, enabling transformation of the expression with Parser.applyUnappliedIdentifiers as appropriate.

There is also a deficiency here that really needs fixed, namely that type information of selectors we are going to feed into the input bindings comes from the ColumnCapabilities from a selector factory, which is not always reliable and sometimes does not exist at all (nested group-by, postaggs, etc). The 'adapter' selectors do not have this problem, they can be trusted to have full ColumnCapabilities for all input bindings since there is a segment backing, but row based selectors can not currently reliably provide this information.

To deal with this, there is a fallback expression selector, which I'm currently calling RowBasedExpressionColumnValueSelector, which examines the object bindings for any 'unknown' variables on a row by row basis to find list inputs, and use the correct expression for the inputs. This is suboptimal. As a crutch to detect this scenario, for now I have added a method I hate to ColumnCapabilities, isComplete, which is set by selector factories which can be 'trusted' to indicate which columns support multiple values and provide generally reliable information, to determine when we need to use the fallback RowBasedExpressionColumnValueSelector for cases for selectors that do not.

I believe this fallback selector could be nearly (or maybe entirely) eliminated by enriching the type system to preserve information all the way through, and I'm putting together a proposal to address this, because not having consistent type information in the processing pipeline is related to other problems as well. But for now, there will probably be some performance casualties where we have to use this fallback selector instead of a more optimal selector if we had complete type information that we could trust.

Beyond all the unit tests added, this PR is also reasonably well tested both in a 'debugging' cluster capacity on my laptop with a wide range of contrived multi-value test data, as well as an actual test cluster that has the previously mentioned 'twitter' datasource. I have been trying to break the 'magic' expression transformations as much as possible, and while I suspect there are still some missing cases, I have reasonable confidence in it doing the right thing much of the time.

The remainder of #7525, including adding SQL operators for the new array function expressions, will be done in one or more follow-up PR(s).

@himanshug
Copy link
Copy Markdown
Contributor

subscribe

For logical operators, a number is true if and only if it is positive (0 or negative value means false). For string
type, it's the evaluation result of 'Boolean.valueOf(string)'.

Multi-value string dimensions are supported and may be treated as either scalar or array typed values. When treated as
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest mentioning that expression postaggs won't normally see array types because of the unnesting behavior of TopN and GroupBy, and maybe mention array_to_string and string_to_array as a way to pass arrays to expr postaggs

import java.util.stream.Collectors;

/**
* Base interface of Druid expression language abstract syntax tree nodes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add that, "All Expr instances are Immutable." (and make them immutable if any of them are not)

this(ImmutableSet.of(identifier), Collections.emptySet(), Collections.emptySet());
}

public BindingDetails(Set<String> freeVariables, Set<String> scalarVariables, Set<String> arrayVariables)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is Set.union(scalarVariables, arrayVariables) = freeVariables ?
If that is expected then I think we should add a check for that to be true to catch any potential bugs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a perfect world, yes, though I know that at least currently that I don't have anything in place to infer if the accumulator input of the 'fold' functions is used as an array or a scalar, so I don't add it to either collection, so this cannot be done quite yet.

{
private final Set<String> freeVariables;
private final Set<String> scalarVariables;
private final Set<String> arrayVariables;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make all of them "ImmutableSet" to guarantee immutability?
if we have similar cases e.g. List in other classes in this file then those should also be made ImmutableXX .

scalars = Collections.emptySet();
}
return expr.analyzeInputs().mergeWithScalars(scalars);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why the impl here can't simply return return expr.analyzeInputs() ?
expr.getIdentifierIfIdentifier() returns non-null value only for IdentifierExpr .

can you give example of an unary expression where just returning return expr.analyzeInputs() wouldn't work?

}
return left.analyzeInputs()
.merge(right.analyzeInputs())
.mergeWithScalars(scalars);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to UnaryExpr, this could also just return
return left.analyzeInputs().merge(right.analyzeInputs()) ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a common theme in multiple places, I must be missing something :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the operators and functions have to add context of how they are using IdentifierExpr as part of their work for analyzeInputs. This is currently done in a shallow manner, in other words, the operators and functions only add direct IdentifierExpr children to 'scalar' or 'array' collections based on the type of functions. IdentifierExpr themselves do not know their context, so only add themselves to the set of 'free' variables.

In case an example makes this makes more sense, imagine an expression: x + (y - z) which is parsed into Expr as the tree:

               BinPlusExpr
               /         \
   IdentifierExpr(x)  BinMinusExpr
                     /            \
          IdentifierExpr(y)   IdentifierExpr(z)

or

   +
 /   \
x     -
    /   \
   y     z

for short.

working this out:

analyzeInputs(+) = analyzeInputs(x) + analyzeInputs(-) + "set of direct IdentifierExpr children of +"
analyzeInputs(+) = (analyzeInputs(x)) + (analyzeInputs(y) + analyzeInputs(z) + "set of direct IdentifierExpr children of -") + ("set of direct IdentifierExpr children of +")
analyzeInputs(+) = ([free:[x], scalar:[], array:[]]) + ([free:[y], scalar:[], array:[]] + [free:[z], scalar:[], array:[]] + [scalar:[y, z]]) + ([scalar:[x]])
analyzeInputs(+) = [free:[x, y, z], scalar:[x, y, z], array:[]]

Hopefully that didn't make it more confusing!

This approach definitely has some shortcomings, such as if a function works with both array and scalar inputs such as the accumulator input of the 'fold' functions, but luckily there is very little overlap between array and scalar functions right now. I imagine that this analyzeInputs machinery will be largely reworked or replaced or scrapped in the future when I try to add full type inference necessary to allow us to fully optimize expressions to use primitive types where possible, but for now it's doing ok enough to usually give me enough context to determine when an input column is expected to be an array, if an array input is not being used as an array, and if any contradictory usages of an input column as an array and a scalar are detected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explanation

| '[]' # emptyArray
;

lambda : (IDENTIFIER | '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: from grammar alone, it would be more obvious to see that no arg lambdas are possible if rule looked like

Suggested change
lambda : (IDENTIFIER | '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr
lambda : (IDENTIFIER | '(' ')' | '(' (IDENTIFIER (',' IDENTIFIER)*)? ')') '->' expr


@Test
public void testApplyFunctions()
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a case for no arg ..

Suggested change
{
{
validateParser(
"map(() -> 1, x)",
"(map ([] -> 1), [x])",
ImmutableList.of("x"),
ImmutableSet.of(),
ImmutableSet.of("x")
);

@clintropolis
Copy link
Copy Markdown
Member Author

clintropolis commented Jun 10, 2019

There are currently a few missing pieces for this functionality to work with Druid ingestion transformations, since there are a handful of places that lack the value coercion that is happening in the expression selectors on the query side. I'm trying to hunt down all the areas that need fixed up so everything works as expected, and depending on how large of a change it is and how far reviewed this PR is before I get it finished, I might prefer to do as a follow-up PR.

@clintropolis
Copy link
Copy Markdown
Member Author

I would like to do ingestion side support through transforms as a follow-up PR, I think maybe it could benefit from a slight refactoring and this PR is already big enough.

This PR should have no effect on multi-value dimensions used in transform spec, because it appears to already behave differently than the expression selectors in that instead of array values being translated to null for the expression they are instead fed into String.valueOf so they appear as a scalar string. As expected, using the array functions from this PR at ingestion time can have unexpected results because of the scalar string behavior and also lack of array result to list of string coercion on the expression outputs, but because of the string behavior it appears that this is also true of any expression on multi-value ingestion time inputs.

I think my main outstanding question for this PR is whether or not this new expression selector behavior should be optional for now, because there is a potential performance impact for expression selectors with bindings from row based selectors since they will have to use RowBasedExpressionColumnValueSelector with it's per row check of input binding type. I do hope to eliminate nearly all of these cases in future work by enhancing the type information throughout the query pipeline, but i'm not certain how far in the future that actually is and I haven't had the chance to conduct thorough measurements on performance difference, which is why I brought this up. Also I could see a use for turning off this behavior if certain there are no multi-value dimensions. On the other hand though, we have a lot of options, I'm not sure we need more. Maybe some of the improvements done in this PR, like argument validation for functions being done at parse time instead of for each eval, make up enough for the RowBasedExpressionColumnValueSelector phase that it about evens out. Thoughts?

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jun 18, 2019

I think my main outstanding question for this PR is whether or not this new expression selector behavior should be optional for now, because there is a potential performance impact for expression selectors with bindings from row based selectors since they will have to use RowBasedExpressionColumnValueSelector with it's per row check of input binding type. I do hope to eliminate nearly all of these cases in future work by enhancing the type information throughout the query pipeline, but i'm not certain how far in the future that actually is and I haven't had the chance to conduct thorough measurements on performance difference, which is why I brought this up. Also I could see a use for turning off this behavior if certain there are no multi-value dimensions. On the other hand though, we have a lot of options, I'm not sure we need more. Maybe some of the improvements done in this PR, like argument validation for functions being done at parse time instead of for each eval, make up enough for the RowBasedExpressionColumnValueSelector phase that it about evens out. Thoughts?

I wouldn't worry too much about this overhead, for two reasons:

  • Type check overhead should be very minimal in the grand scheme of everything that needs to happen per row during the input binding phase.
  • Right now the expression layer is quite inefficient due to not having types known in advance, and this is really just one more manifestation of that. We should be addressing this in a holistic way in the future & so I wouldn't worry too much about any particular instance right now. At this point, we should be able to eliminate the overhead completely.

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@himanshug
Copy link
Copy Markdown
Contributor

so, I have mostly looked at the expression language changes and that LGTM .

I haven't followed the expression selector changes but it appears other reviewers did, so feel free to merge if/when you feel it is ready.

… in the future, or define empty arrays as other types..
@fjy fjy merged commit 494b8eb into apache:master Jun 19, 2019
@clintropolis
Copy link
Copy Markdown
Member Author

Thanks for review all 👍

@clintropolis clintropolis deleted the expr-multi-val-string-abomination branch June 21, 2019 00:48
gianm pushed a commit to implydata/druid-public that referenced this pull request Jul 4, 2019
* array support for expression language for multi-value string columns

* fix tests?

* fixes

* more tests

* fixes

* cleanup

* more better, more test

* ignore inspection

* license

* license fix

* inspection

* remove dumb import

* more better

* some comments

* add expr rewrite for arrayfn args for more magic, tests

* test stuff

* more tests

* fix test

* fix test

* castfunc can deal with arrays

* needs more empty array

* more tests, make cast to long array more forgiving

* refactor

* simplify ExprMacro Expr implementations with base classes in core

* oops

* more test

* use Shuttle for Parser.flatten, javadoc, cleanup

* fixes and more tests

* unused import

* fixes

* javadocs, cleanup, refactors

* fix imports

* more javadoc

* more javadoc

* more

* more javadocs, nonnullbydefault, minor refactor

* markdown fix

* adjustments

* more doc

* move initial filter out

* docs

* map empty arg lambda, apply function argument validation

* check function args at parse time instead of eval time

* more immutable

* more more immutable

* clarify grammar

* fix docs

* empty array is string test, we need a way to make arrays better maybe in the future, or define empty arrays as other types..
@clintropolis clintropolis added this to the 0.16.0 milestone Aug 8, 2019
@clintropolis clintropolis mentioned this pull request May 14, 2021
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants