Expressions work better with strings.#4394
Conversation
- ExpressionObjectSelector able to read from string columns, and able to return strings. - ExpressionVirtualColumn able to offer string (and long for that matter) as its native type. - ExpressionPostAggregator able to return strings. - groupBy, topN: Allow post-aggregators to accept dimensions as inputs, making ExpressionPostAggregator more useful. - topN: Use DimExtractionTopNAlgorithm for STRING columns that do not have dictionaries, allowing it to work with STRING-type expression virtual columns. - Adjusts null handling to better match the rest of Druid: null and empty string treated the same; nulls implicitly treated as zeroes in numeric context.
| } | ||
|
|
||
| final Integer theInt = Ints.tryParse(value); | ||
| return theInt == null ? 0 : theInt; |
There was a problem hiding this comment.
Why consume parse errors and turn them into nulls?
There was a problem hiding this comment.
I thought it would be more useful to treat unparseable strings as zeroes rather than to throw errors. I don't want one bad string in a column to make the entire query fail. It's similar to the behavior of Druid's "numeric" comparator on string columns, where unparseable numbers as less than all other numbers, not as errors.
| return number; | ||
| } | ||
| }; | ||
| return bindings::get; |
There was a problem hiding this comment.
Why ignore situation when there is no binding?
There was a problem hiding this comment.
Because rather than treating expressions referencing missing columns as errors, I'd like to treat them as if they were reading nulls, since that's consistent with what Druid generally does when you ask it to read columns that don't exist.
Just doing bindings::get achieves that since bindings.get(nonexistentColumnName) == null
There was a problem hiding this comment.
There is a tradeoff between "ease of use" and the problems from consuming client errors. Maybe it should be a configurable?
There was a problem hiding this comment.
I don't think this needs to be configurable for expressions, since the behavior of treating nonexistent columns as having some "default" value is very ingrained in Druid -- not just in the expressions but throughout the system in general.
I guess you could argue that you want it to be configurable throughout Druid, but that is out of scope for this PR, imo. In this one I'm just trying to make Druid expression behavior better match what the rest of Druid already does.
| interface ObjectBinding | ||
| { | ||
| Number get(String name); | ||
| Object get(String name); |
There was a problem hiding this comment.
@Nullable? Assuming motivation here: https://github.com/druid-io/druid/pull/4394/files#diff-3eafb27b343481e2d3d14991760b6b61R151 is similar to #4365 (comment)
There was a problem hiding this comment.
Annotated with @Nullable.
| }; | ||
| private static final Comparator<Comparable> DEFAULT_COMPARATOR = Comparator.nullsFirst( | ||
| (Comparable o1, Comparable o2) -> { | ||
| if (o1 instanceof Long && o2 instanceof Long) { |
There was a problem hiding this comment.
Assuming null is treated as 0 for Numbers, it shouldn't be just Comparator.nullsFirst(), because some numbers may be less than null = 0
There was a problem hiding this comment.
I guess Druid isn't really consistent here. Sometimes nulls are treated as zeros (like when reading from nonexistent columns) and sometimes they are treated as less than any other number (like in most comparators).
My sense is that most comparators in Druid treat nulls as less than any other number (and actually return them to the user as null not as 0) so I am inclined to keep this the way it is.
| } | ||
|
|
||
| public static enum Ordering implements Comparator<Number> | ||
| public enum Ordering implements Comparator<Comparable> |
There was a problem hiding this comment.
I'm not sure. It was already an enum and I didn't change it.
| public enum Ordering implements Comparator<Comparable> | ||
| { | ||
| // ensures the following order: numeric > NaN > Infinite | ||
| numericFirst { |
There was a problem hiding this comment.
It can be used via an Ordering.valueOf(ordering) in the constructor.
There was a problem hiding this comment.
It's too smart, suggested to change it to NUMERIC_FIRST constant with anonymous impl, and set it in constructor if the given ordering is "numericFirst".
Also, this ordering is not null-proof, unlike the default comparator. It will throw NPE.
There was a problem hiding this comment.
The ordering is null-proof. If either lhs or rhs is null then the final else branch will be taken, which uses Comparators.naturalNullsFirst().
I don't really have strong feelings about how the constructor should be implemented, but I don't want to change it in this PR, since it's not relevant to the main purpose. NB: If you (or anyone else) ends up reading this and doing another patch to change it, a very similar construction is used in ArithmeticPostAggregator.
I added a comment though explaining how the constructor works.
| }; | ||
| } else { | ||
| // No numbers or strings. | ||
| return null; |
There was a problem hiding this comment.
Should it be Supplier of null, not just null?
There was a problem hiding this comment.
It doesn't really matter, since a null supplier is treated equivalently to a supplier that always returns null.
leventov
left a comment
There was a problem hiding this comment.
Also in classes like BitLtExpr, in evalDouble() < operator is used, Double.compare() < 0 is better because it treats NaNs consistently.
|
👍 |
| } | ||
|
|
||
| @Override | ||
| public int hashCode() |
There was a problem hiding this comment.
Shouldn't getCacheKey() also be updated?
There was a problem hiding this comment.
Yes, good catch. Updated it.
| protected final double evalDouble(double left, double right) | ||
| { | ||
| return Evals.asDouble(left < right); | ||
| return Evals.asDouble(Double.compare(left, right) < 0); |
There was a problem hiding this comment.
Could you please comment on this and similar places, not to amuse later readers why such "long" form is chosen over "simple" left < right.
There was a problem hiding this comment.
Yes, I'll add comments.
jihoonson
left a comment
There was a problem hiding this comment.
Looks good to me overall!
| private DoubleExprEval(Number value) | ||
| { | ||
| super(value); | ||
| super(Preconditions.checkNotNull(value, "value")); |
There was a problem hiding this comment.
I'm not sure about this. DoubleExpr.getLiteralValue() is nullable, which means its value can be null. Also, why is null checking needed here instead of treating nulls as 0s?
There was a problem hiding this comment.
The value can't actually be null. For reasons of trying to make the behavior more consistent with Druid null handling in general, null value is always going to be a StringExprEval. (See that the constructors for DoubleExprEval and LongExprEval have a non-null check for value). I'll remove the null checks in toExpr, since they are pointless, given value cannot be null.
There was a problem hiding this comment.
I see. Then, would you remove @Nullable annotations at DoubleExpr.getLiteralValue() and LongExpr.getLiteralValue()?
There was a problem hiding this comment.
Yes, ok. I removed those and added non-null checks to the constructors to be defensive.
| private LongExprEval(Number value) | ||
| { | ||
| super(value); | ||
| super(Preconditions.checkNotNull(value, "value")); |
There was a problem hiding this comment.
I'm not sure about this. LongExpr.getLiteralValue() is nullable, which means its value can be null. Also, why is null checking needed here instead of treating nulls as 0s?
| } | ||
|
|
||
| public static List<PostAggregator> prepareAggregations( | ||
| List<String> otherOutputNames, |
There was a problem hiding this comment.
otherOutputNames looks quite broad. Docs will be useful to understand and use this method.
| public final boolean isNull() | ||
| { | ||
| return Strings.isNullOrEmpty(value); | ||
| return value == null; |
There was a problem hiding this comment.
This method is same with the overridden method and thus can be removed.
|
@leventov, @jihoonson, thanks for reviewing. I pushed a new update. |
|
@gianm i think there is some tests that needs some fixes GroupByQueryRunnerTest.testGroupByWithOutputNameCollisions[config=v2SmallDictionary, runner=noRollupRtIndex]
Expected: (an instance of java.lang.IllegalArgumentException and exception with message a string containing "Duplicate output name[alias]")
but: exception with message a string containing "Duplicate output name[alias]" message was "[alias] already defined"
Stacktrace was: java.lang.IllegalArgumentException: [alias] already defined |
|
@b-slim ah, yeah, one of the error messages changed. I just pushed a fix for that, thanks. |
|
@b-slim any more comments? |
|
@fjy are you ok with the design here? |
|
yeah +1 |
|
Ok, I think it's mergeable then, since we have 3 +1s and one of the reviewers looked at the code. Thanks everyone. |
| * @throws IllegalArgumentException if there are any output name collisions or missing post-aggregator inputs | ||
| */ | ||
| public static List<PostAggregator> prepareAggregations( | ||
| List<String> otherOutputNames, |
There was a problem hiding this comment.
Is Queries considered public API? Because this is a breaking change. E. g. there is a usage of this method in our extensions.
There was a problem hiding this comment.
I'm not sure, good question. I guess we might as well consider it public, since it's likely that people will have copied it from builtin query impls.
For backwards compatibility, post-apache#4394.
For backwards compatibility, post-#4394.
For backwards compatibility, post-apache#4394.
return strings.
as its native type.
making ExpressionPostAggregator more useful.
have dictionaries, allowing it to work with STRING-type expression
virtual columns.
empty string treated the same; nulls implicitly treated as zeroes in
numeric context.