Skip to content

Expressions work better with strings.#4394

Merged
gianm merged 5 commits intoapache:masterfrom
gianm:se-expr-strings
Jun 14, 2017
Merged

Expressions work better with strings.#4394
gianm merged 5 commits intoapache:masterfrom
gianm:se-expr-strings

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Jun 12, 2017

  • ExpressionObjectSelector able to read from string columns, and able to
    return strings.
  • ExpressionVirtualColumn able to offer string (and long for that matter)
    as its native type.
  • ExpressionPostAggregator able to return strings.
  • groupBy, topN: Allow post-aggregators to accept dimensions as inputs,
    making ExpressionPostAggregator more useful.
  • topN: Use DimExtractionTopNAlgorithm for STRING columns that do not
    have dictionaries, allowing it to work with STRING-type expression
    virtual columns.
  • Adjusts null handling to better match the rest of Druid: null and
    empty string treated the same; nulls implicitly treated as zeroes in
    numeric context.

- ExpressionObjectSelector able to read from string columns, and able to
  return strings.
- ExpressionVirtualColumn able to offer string (and long for that matter)
  as its native type.
- ExpressionPostAggregator able to return strings.
- groupBy, topN: Allow post-aggregators to accept dimensions as inputs,
  making ExpressionPostAggregator more useful.
- topN: Use DimExtractionTopNAlgorithm for STRING columns that do not
  have dictionaries, allowing it to work with STRING-type expression
  virtual columns.
- Adjusts null handling to better match the rest of Druid: null and
  empty string treated the same; nulls implicitly treated as zeroes in
  numeric context.
}

final Integer theInt = Ints.tryParse(value);
return theInt == null ? 0 : theInt;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why consume parse errors and turn them into nulls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be more useful to treat unparseable strings as zeroes rather than to throw errors. I don't want one bad string in a column to make the entire query fail. It's similar to the behavior of Druid's "numeric" comparator on string columns, where unparseable numbers as less than all other numbers, not as errors.

return number;
}
};
return bindings::get;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore situation when there is no binding?

Copy link
Copy Markdown
Contributor Author

@gianm gianm Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because rather than treating expressions referencing missing columns as errors, I'd like to treat them as if they were reading nulls, since that's consistent with what Druid generally does when you ask it to read columns that don't exist.

Just doing bindings::get achieves that since bindings.get(nonexistentColumnName) == null

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a tradeoff between "ease of use" and the problems from consuming client errors. Maybe it should be a configurable?

Copy link
Copy Markdown
Contributor Author

@gianm gianm Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be configurable for expressions, since the behavior of treating nonexistent columns as having some "default" value is very ingrained in Druid -- not just in the expressions but throughout the system in general.

I guess you could argue that you want it to be configurable throughout Druid, but that is out of scope for this PR, imo. In this one I'm just trying to make Druid expression behavior better match what the rest of Druid already does.

interface ObjectBinding
{
Number get(String name);
Object get(String name);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotated with @Nullable.

};
private static final Comparator<Comparable> DEFAULT_COMPARATOR = Comparator.nullsFirst(
(Comparable o1, Comparable o2) -> {
if (o1 instanceof Long && o2 instanceof Long) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming null is treated as 0 for Numbers, it shouldn't be just Comparator.nullsFirst(), because some numbers may be less than null = 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Druid isn't really consistent here. Sometimes nulls are treated as zeros (like when reading from nonexistent columns) and sometimes they are treated as less than any other number (like in most comparators).

My sense is that most comparators in Druid treat nulls as less than any other number (and actually return them to the user as null not as 0) so I am inclined to keep this the way it is.

}

public static enum Ordering implements Comparator<Number>
public enum Ordering implements Comparator<Comparable>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why enum?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. It was already an enum and I didn't change it.

public enum Ordering implements Comparator<Comparable>
{
// ensures the following order: numeric > NaN > Infinite
numericFirst {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be unused.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be used via an Ordering.valueOf(ordering) in the constructor.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too smart, suggested to change it to NUMERIC_FIRST constant with anonymous impl, and set it in constructor if the given ordering is "numericFirst".

Also, this ordering is not null-proof, unlike the default comparator. It will throw NPE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ordering is null-proof. If either lhs or rhs is null then the final else branch will be taken, which uses Comparators.naturalNullsFirst().

I don't really have strong feelings about how the constructor should be implemented, but I don't want to change it in this PR, since it's not relevant to the main purpose. NB: If you (or anyone else) ends up reading this and doing another patch to change it, a very similar construction is used in ArithmeticPostAggregator.

I added a comment though explaining how the constructor works.

};
} else {
// No numbers or strings.
return null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be Supplier of null, not just null?

Copy link
Copy Markdown
Contributor Author

@gianm gianm Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter, since a null supplier is treated equivalently to a supplier that always returns null.

Copy link
Copy Markdown
Member

@leventov leventov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in classes like BitLtExpr, in evalDouble() < operator is used, Double.compare() < 0 is better because it treats NaNs consistently.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jun 12, 2017

👍

}

@Override
public int hashCode()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't getCacheKey() also be updated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. Updated it.

protected final double evalDouble(double left, double right)
{
return Evals.asDouble(left < right);
return Evals.asDouble(Double.compare(left, right) < 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please comment on this and similar places, not to amuse later readers why such "long" form is chosen over "simple" left < right.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add comments.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall!

private DoubleExprEval(Number value)
{
super(value);
super(Preconditions.checkNotNull(value, "value"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. DoubleExpr.getLiteralValue() is nullable, which means its value can be null. Also, why is null checking needed here instead of treating nulls as 0s?

Copy link
Copy Markdown
Contributor Author

@gianm gianm Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value can't actually be null. For reasons of trying to make the behavior more consistent with Druid null handling in general, null value is always going to be a StringExprEval. (See that the constructors for DoubleExprEval and LongExprEval have a non-null check for value). I'll remove the null checks in toExpr, since they are pointless, given value cannot be null.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then, would you remove @Nullable annotations at DoubleExpr.getLiteralValue() and LongExpr.getLiteralValue()?

Copy link
Copy Markdown
Contributor Author

@gianm gianm Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ok. I removed those and added non-null checks to the constructors to be defensive.

private LongExprEval(Number value)
{
super(value);
super(Preconditions.checkNotNull(value, "value"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. LongExpr.getLiteralValue() is nullable, which means its value can be null. Also, why is null checking needed here instead of treating nulls as 0s?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same response as #4394 (comment).

}

public static List<PostAggregator> prepareAggregations(
List<String> otherOutputNames,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherOutputNames looks quite broad. Docs will be useful to understand and use this method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add javadocs.

public final boolean isNull()
{
return Strings.isNullOrEmpty(value);
return value == null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is same with the overridden method and thus can be removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, removed.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jun 13, 2017

@leventov, @jihoonson, thanks for reviewing. I pushed a new update.

@b-slim
Copy link
Copy Markdown
Contributor

b-slim commented Jun 13, 2017

@gianm i think there is some tests that needs some fixes

GroupByQueryRunnerTest.testGroupByWithOutputNameCollisions[config=v2SmallDictionary, runner=noRollupRtIndex] 

Expected: (an instance of java.lang.IllegalArgumentException and exception with message a string containing "Duplicate output name[alias]")

     but: exception with message a string containing "Duplicate output name[alias]" message was "[alias] already defined"

Stacktrace was: java.lang.IllegalArgumentException: [alias] already defined

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jun 13, 2017

@b-slim ah, yeah, one of the error messages changed. I just pushed a fix for that, thanks.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jun 13, 2017

@b-slim any more comments?

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jun 14, 2017

@fjy are you ok with the design here?

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jun 14, 2017

yeah +1

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jun 14, 2017

Ok, I think it's mergeable then, since we have 3 +1s and one of the reviewers looked at the code. Thanks everyone.

@gianm gianm merged commit 6edee7f into apache:master Jun 14, 2017
@gianm gianm deleted the se-expr-strings branch June 14, 2017 21:50
* @throws IllegalArgumentException if there are any output name collisions or missing post-aggregator inputs
*/
public static List<PostAggregator> prepareAggregations(
List<String> otherOutputNames,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Queries considered public API? Because this is a breaking change. E. g. there is a usage of this method in our extensions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, good question. I guess we might as well consider it public, since it's likely that people will have copied it from builtin query impls.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm gianm added this to the 0.10.2 milestone Jun 20, 2017
gianm added a commit to gianm/druid that referenced this pull request Jun 21, 2017
@leventov leventov added this to the 0.10.1 milestone Jun 21, 2017
@leventov leventov removed this from the 0.10.2 milestone Jun 21, 2017
b-slim pushed a commit that referenced this pull request Jun 21, 2017
gianm added a commit to gianm/druid that referenced this pull request Jun 21, 2017
gianm added a commit that referenced this pull request Jun 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants