Skip to content

optimize single input column multi-value expressions#8047

Merged
gianm merged 12 commits intoapache:masterfrom
clintropolis:expr-array-output
Aug 2, 2019
Merged

optimize single input column multi-value expressions#8047
gianm merged 12 commits intoapache:masterfrom
clintropolis:expr-array-output

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Jul 9, 2019

Description

This PR follows up on the changes in #8014, which actually opened the door to using the same optimized selector to evaluate single input column expressions for multi-value string columns, using native Druid behavior to do the map of the expression across the values. This works provided a few conditions are met:

  • the expression does not operate on the multi-value input as an array
  • the expression does not produce an array
  • the single multi-value column is used at most once in the expression

To achieve this, functions and apply functions now also report if they have array inputs or outputs which is collected on Expr.BindingDetails, which is already used to analyze the expression in ExpressionSelectors in order to determine the correct type of selector to create.

This change also allows multi-value expressions that meet the above conditions to work with group by v1, since the cardinality can be known, however wider multi-value expression support for group by v1 is not available (or really my priority).

This PR has also uncovered some inconsistencies in the null value handling for multi-value expressions, specifically the part that coerces null, [], and [null] all into [null], which creates some minor differences between using the SingleStringInputDimensionSelector to "map" an expression across the row values compared to using a map expression. Ideally I will modify the coercion to not convert [] to [null], which seems to be the problematic issue, but it creates mass of failing tests in the ExpressionFilterTest due to differences in behavior in the underlying selectors, so I intend to investigate this deeper in follow-up work The difference in behavior is apparent in the modified calcite test, where [] is grouped as null with default null handling behavior instead of "". This is likely related or at least similar to #5897 and #4915.


This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths.

@clintropolis clintropolis added this to the 0.16.0 milestone Jul 30, 2019
@clintropolis
Copy link
Copy Markdown
Member Author

Tagged for milestone 0.16.0 since this should be included in the same release with the rest of the multi-value expression support, to avoid a future behavior change on single input multi-value expressions.

*/
default boolean hasArrayOutput()
{
return false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a function might sometimes return an array and might sometimes return a scalar? Is that:

  • Not allowed (functions should not differ in their return types in this way)
  • Allowed, and hasArrayOutput should be true
  • Allowed, and hasArrayOutput should be false

Please adjust the javadocs appropriately.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be allowed imo, but I think we'll have to discuss this further when we get around to adding full type inference to the expression system. I will adjust the javadoc to state it should not be done currently.

public void testGroupByWithStringVirtualColumn()
{
// Cannot vectorize due to virtual columns.
// all virtual columns are single input column, so it will work for group by v1, even with multi-value inputs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆒 it's quite nice that this test can pass again.

/**
* Returns true if the expression has any array inputs. Note that in some cases, this can be true and
* {@link BindingDetails#getArrayVariables} can be empty, since the latter can only shallowly collect identifiers
* that are explicitly used as arrays.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I turned this comment over a few times in my mind but don't quite understand it. What does "shallowly" mean in this context and what is "deeper" about hasInputArrays?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distinction is that getArrayVariables returns the identifiers we were able to explicitly detect as arrays, where as hasInputArrays means that an expression expects some number of array inputs. Expr.analyzeInputs implementations can currently only correctly categorize arguments that are directly identifiers.

Now, the reason we need to be able to distinguish is an issue because I also accept slop expressions. For example:

array_to_string(concat('foo', tags), ', ')

where tags is a multi-value dimension, the expression bindings will list tags as a scalar input, but array_to_string does in fact take an array argument. By distinguishing these I can still accept this expression and translate it correctly to:

array_to_string(map((tags) -> concat('foo', tags), tags), ',')

and not incorrectly use this optimized single input column introduced in this PR. This was not an issue prior to this PR because being multi-valued in the capabilities was reason enough to exclude an expression from the single column optimization, but after the restriction only applies to things that expects array inputs or produce array outputs.

Does that make sense? I will try to clear up the javadocs for this.

scalarVariables,
ImmutableSet.copyOf(Sets.union(arrayVariables, arrayIdentifiers))
ImmutableSet.copyOf(Sets.union(arrayVariables, arrayIdentifiers)),
hasInputArrays || arrayArguments.size() > 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for arrayArguments.size() to be greater than zero, but hasInputArrays to be false? If so how? (Related to previous comment, I think)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, this can probably just check hasInputArrays, but I will confirm by removing the size check and seeing what happens to the tests

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i wasn't paying attention to which function this was, it is possible in this specific context and it's unrelated to the other thing.

Since this is a sort of 'builder' method it is hypothetically possible for this method to be called when adding the initial set of array columns on a binding details, without explicitly calling withArrayInputs. This || then ensures that if even if you forgot to call withArrayInputs the binding details being constructed are correct, becase the hasInputArrays in this case is the value of the existing binding before the array arguments are added.


lambdaBindingDetails = lambdaExpr.analyzeInputs();
bindingDetails = accumulator.with(lambdaBindingDetails).withArrayArguments(function.getArrayInputs(argsExpr));
boolean isArrayOutput = function.hasArrayOutput() || (function instanceof ApplyFunction.BaseFoldFunction
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using instanceof here seems like a code smell -- is there a nice way to add a method to ApplyFunction that gets the necessary info out?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? Will have a look

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified ApplyFunction.hasArrayOutput to now take the LambaExpr as an argument so BaseFoldFunction can just override hasArrayOutput to examine the lambda expression to see if it's output is array typed.

}

/**
* Returns true if a function takes an array arguments
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. "takes array arguments" or "takes an array argument" (grammar, and also, I'm not sure which one you mean; the two phrases mean different things)
  2. Clarify intent: what does "takes an array arguments" mean? That it might take array arguments? Or that it must take array arguments?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the typo is that it's supposed to be any array arguments heh, which I think clears up both things if I fix?

}

/**
* Returns true if function produces an array
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it sometimes produces an array, and sometimes doesn't, based on its input type? Similar comment to ApplyFunction.hasArrayOutput: is it disallowed, or should it return true/false?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment, will adjust docs to mention it shouldn't be done, but i don't have a way to really enforce it yet, need type inference i think

}
assertFilterMatchesSkipVectorize(edf("dim4 == '1'"), ImmutableList.of("0"));
assertFilterMatchesSkipVectorize(edf("dim4 == '3'"), ImmutableList.of("3"));
assertFilterMatchesSkipVectorize(edf("dim4 == '4'"), ImmutableList.of("4", "5"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd this change?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I .. don't really remember, I'm going to go with the reason was for fun, to add another test that actually has multiple matches in the case where sql compatible null handling is enabled 😜

&& capabilities.isComplete()
&& !capabilities.hasMultipleValues()
&& !exprDetails.getArrayColumns().contains(column)) {
&& exprDetails.getArrayColumns().size() == 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isEmpty()?

&& !exprDetails.isOutputArray()
&& (!capabilities.hasMultipleValues() || exprDetails.getFreeVariables().size() == 1)
) {
// Optimization for dimension selectors that wrap a single underlying string column.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag chain above is getting complicated!

Could you please alter this comment to reflect what it's looking for? If I understand it right, that should be something like:

Optimization for dimension selectors that wrap a single underlying string column. The string column can be multi-valued, but if so, it must be implicitly mappable (i.e. the expression is not treating it as an array, not wanting to output an array, and the multi-value dimension appears exactly once).

By the way, I have to assume logic like this is being used elsewhere to implicitly map expressions that go through the non-optimized selector. Is there any opportunity to extract the logic into a common helper method? It would make it easier for readers and probably less error prone too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is only used in these 2 places in ExpressionSelectors I think, and it's slightly different between the uses because SingleStringInputDimensionSelector supports multi-value inputs, but SingleStringInputCachingExpressionColumnValueSelector does not. I'll play around and see what I can do, or at least comment what's going on heavily.

public void validateArguments(List<Expr> args)
{
return ImmutableSet.copyOf(args);
if (!(args.size() > 0)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could simplify to:

if (args.isEmpty()) {

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for the clarifications.

@gianm gianm merged commit e7c6dea into apache:master Aug 2, 2019
@clintropolis clintropolis deleted the expr-array-output branch August 2, 2019 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants