Expressions work better with strings. by gianm · Pull Request #4394 · apache/druid

gianm · 2017-06-12T03:55:32Z

ExpressionObjectSelector able to read from string columns, and able to
return strings.
ExpressionVirtualColumn able to offer string (and long for that matter)
as its native type.
ExpressionPostAggregator able to return strings.
groupBy, topN: Allow post-aggregators to accept dimensions as inputs,
making ExpressionPostAggregator more useful.
topN: Use DimExtractionTopNAlgorithm for STRING columns that do not
have dictionaries, allowing it to work with STRING-type expression
virtual columns.
Adjusts null handling to better match the rest of Druid: null and
empty string treated the same; nulls implicitly treated as zeroes in
numeric context.

- ExpressionObjectSelector able to read from string columns, and able to return strings. - ExpressionVirtualColumn able to offer string (and long for that matter) as its native type. - ExpressionPostAggregator able to return strings. - groupBy, topN: Allow post-aggregators to accept dimensions as inputs, making ExpressionPostAggregator more useful. - topN: Use DimExtractionTopNAlgorithm for STRING columns that do not have dictionaries, allowing it to work with STRING-type expression virtual columns. - Adjusts null handling to better match the rest of Druid: null and empty string treated the same; nulls implicitly treated as zeroes in numeric context.

leventov · 2017-06-12T20:52:55Z

+      }
+
+      final Integer theInt = Ints.tryParse(value);
+      return theInt == null ? 0 : theInt;


Why consume parse errors and turn them into nulls?

I thought it would be more useful to treat unparseable strings as zeroes rather than to throw errors. I don't want one bad string in a column to make the entire query fail. It's similar to the behavior of Druid's "numeric" comparator on string columns, where unparseable numbers as less than all other numbers, not as errors.

leventov · 2017-06-12T20:55:34Z

-        return number;
-      }
-    };
+    return bindings::get;


Why ignore situation when there is no binding?

Because rather than treating expressions referencing missing columns as errors, I'd like to treat them as if they were reading nulls, since that's consistent with what Druid generally does when you ask it to read columns that don't exist.

Just doing bindings::get achieves that since bindings.get(nonexistentColumnName) == null

There is a tradeoff between "ease of use" and the problems from consuming client errors. Maybe it should be a configurable?

I don't think this needs to be configurable for expressions, since the behavior of treating nonexistent columns as having some "default" value is very ingrained in Druid -- not just in the expressions but throughout the system in general.

I guess you could argue that you want it to be configurable throughout Druid, but that is out of scope for this PR, imo. In this one I'm just trying to make Druid expression behavior better match what the rest of Druid already does.

leventov · 2017-06-12T20:57:36Z

  interface ObjectBinding
  {
-    Number get(String name);
+    Object get(String name);


@Nullable? Assuming motivation here: https://github.com/druid-io/druid/pull/4394/files#diff-3eafb27b343481e2d3d14991760b6b61R151 is similar to #4365 (comment)

Annotated with @Nullable.

leventov · 2017-06-12T20:59:51Z

-  };
+  private static final Comparator<Comparable> DEFAULT_COMPARATOR = Comparator.nullsFirst(
+      (Comparable o1, Comparable o2) -> {
+        if (o1 instanceof Long && o2 instanceof Long) {


Assuming null is treated as 0 for Numbers, it shouldn't be just Comparator.nullsFirst(), because some numbers may be less than null = 0

I guess Druid isn't really consistent here. Sometimes nulls are treated as zeros (like when reading from nonexistent columns) and sometimes they are treated as less than any other number (like in most comparators).

My sense is that most comparators in Druid treat nulls as less than any other number (and actually return them to the user as null not as 0) so I am inclined to keep this the way it is.

leventov · 2017-06-12T21:00:51Z

  }

-  public static enum Ordering implements Comparator<Number>
+  public enum Ordering implements Comparator<Comparable>


I'm not sure. It was already an enum and I didn't change it.

leventov · 2017-06-12T21:03:11Z

+  public enum Ordering implements Comparator<Comparable>
  {
    // ensures the following order: numeric > NaN > Infinite
    numericFirst {


Seems to be unused.

It can be used via an Ordering.valueOf(ordering) in the constructor.

It's too smart, suggested to change it to NUMERIC_FIRST constant with anonymous impl, and set it in constructor if the given ordering is "numericFirst".

Also, this ordering is not null-proof, unlike the default comparator. It will throw NPE.

The ordering is null-proof. If either lhs or rhs is null then the final else branch will be taken, which uses Comparators.naturalNullsFirst().

I don't really have strong feelings about how the constructor should be implemented, but I don't want to change it in this PR, since it's not relevant to the main purpose. NB: If you (or anyone else) ends up reading this and doing another patch to change it, a very similar construction is used in ArithmeticPostAggregator.

I added a comment though explaining how the constructor works.

leventov · 2017-06-12T21:08:56Z

+      };
+    } else {
+      // No numbers or strings.
      return null;


Should it be Supplier of null, not just null?

It doesn't really matter, since a null supplier is treated equivalently to a supplier that always returns null.

leventov

Also in classes like BitLtExpr, in evalDouble() < operator is used, Double.compare() < 0 is better because it treats NaNs consistently.

fjy · 2017-06-12T22:18:10Z

👍

leventov · 2017-06-12T22:20:28Z

  }

  @Override
  public int hashCode()


Shouldn't getCacheKey() also be updated?

Yes, good catch. Updated it.

leventov · 2017-06-12T23:42:19Z

  protected final double evalDouble(double left, double right)
  {
-    return Evals.asDouble(left < right);
+    return Evals.asDouble(Double.compare(left, right) < 0);


Could you please comment on this and similar places, not to amuse later readers why such "long" form is chosen over "simple" left < right.

Yes, I'll add comments.

jihoonson

Looks good to me overall!

jihoonson · 2017-06-13T00:44:56Z

    private DoubleExprEval(Number value)
    {
-      super(value);
+      super(Preconditions.checkNotNull(value, "value"));


I'm not sure about this. DoubleExpr.getLiteralValue() is nullable, which means its value can be null. Also, why is null checking needed here instead of treating nulls as 0s?

The value can't actually be null. For reasons of trying to make the behavior more consistent with Druid null handling in general, null value is always going to be a StringExprEval. (See that the constructors for DoubleExprEval and LongExprEval have a non-null check for value). I'll remove the null checks in toExpr, since they are pointless, given value cannot be null.

I see. Then, would you remove @Nullable annotations at DoubleExpr.getLiteralValue() and LongExpr.getLiteralValue()?

Yes, ok. I removed those and added non-null checks to the constructors to be defensive.

jihoonson · 2017-06-13T00:45:15Z

    private LongExprEval(Number value)
    {
-      super(value);
+      super(Preconditions.checkNotNull(value, "value"));


I'm not sure about this. LongExpr.getLiteralValue() is nullable, which means its value can be null. Also, why is null checking needed here instead of treating nulls as 0s?

Same response as #4394 (comment).

jihoonson · 2017-06-13T00:53:06Z

  }

  public static List<PostAggregator> prepareAggregations(
+      List<String> otherOutputNames,


otherOutputNames looks quite broad. Docs will be useful to understand and use this method.

I'll add javadocs.

jihoonson · 2017-06-13T01:28:10Z

    public final boolean isNull()
    {
-      return Strings.isNullOrEmpty(value);
+      return value == null;


This method is same with the overridden method and thus can be removed.

Thanks, removed.

gianm · 2017-06-13T15:23:38Z

@leventov, @jihoonson, thanks for reviewing. I pushed a new update.

b-slim · 2017-06-13T20:15:09Z

@gianm i think there is some tests that needs some fixes

GroupByQueryRunnerTest.testGroupByWithOutputNameCollisions[config=v2SmallDictionary, runner=noRollupRtIndex] 

Expected: (an instance of java.lang.IllegalArgumentException and exception with message a string containing "Duplicate output name[alias]")

     but: exception with message a string containing "Duplicate output name[alias]" message was "[alias] already defined"

Stacktrace was: java.lang.IllegalArgumentException: [alias] already defined

gianm · 2017-06-13T21:16:26Z

@b-slim ah, yeah, one of the error messages changed. I just pushed a fix for that, thanks.

fjy · 2017-06-13T21:51:57Z

@b-slim any more comments?

gianm · 2017-06-14T21:01:19Z

@fjy are you ok with the design here?

fjy · 2017-06-14T21:07:51Z

yeah +1

gianm · 2017-06-14T21:50:12Z

Ok, I think it's mergeable then, since we have 3 +1s and one of the reviewers looked at the code. Thanks everyone.

leventov · 2017-06-20T23:54:29Z

+   * @throws IllegalArgumentException if there are any output name collisions or missing post-aggregator inputs
+   */
  public static List<PostAggregator> prepareAggregations(
+      List<String> otherOutputNames,


Is Queries considered public API? Because this is a breaking change. E. g. there is a usage of this method in our extensions.

I'm not sure, good question. I guess we might as well consider it public, since it's likely that people will have copied it from builtin query impls.

For backwards compatibility, post-apache#4394.

For backwards compatibility, post-#4394.

For backwards compatibility, post-apache#4394.

For backwards compatibility, post-#4394.

gianm added Improvement Design Review labels Jun 12, 2017

leventov reviewed Jun 12, 2017

View reviewed changes

Code review comments.

6553f3b

leventov reviewed Jun 12, 2017

View reviewed changes

jihoonson reviewed Jun 13, 2017

View reviewed changes

More code review.

5ba06db

leventov approved these changes Jun 13, 2017

View reviewed changes

Fix test.

3ff9516

Adjust annotations.

910ea69

gianm merged commit 6edee7f into apache:master Jun 14, 2017

gianm deleted the se-expr-strings branch June 14, 2017 21:50

gianm mentioned this pull request Jun 14, 2017

Math expression null handling #3645

Closed

leventov mentioned this pull request Jun 15, 2017

Some tests take too long time #4402

Closed

leventov reviewed Jun 20, 2017

View reviewed changes

gianm added this to the 0.10.2 milestone Jun 20, 2017

gianm added a commit to gianm/druid that referenced this pull request Jun 21, 2017

Queries: Restore old prepareAggregations method.

c495174

For backwards compatibility, post-apache#4394.

gianm mentioned this pull request Jun 21, 2017

Queries: Restore old prepareAggregations method. #4432

Merged

leventov added this to the 0.10.1 milestone Jun 21, 2017

leventov removed this from the 0.10.2 milestone Jun 21, 2017

gianm mentioned this pull request Jun 21, 2017

Add @ExtensionPoint and @PublicApi annotations. #4433

Merged

b-slim pushed a commit that referenced this pull request Jun 21, 2017

Queries: Restore old prepareAggregations method. (#4432)

34d2f9e

For backwards compatibility, post-#4394.

gianm added a commit to gianm/druid that referenced this pull request Jun 21, 2017

Queries: Restore old prepareAggregations method. (apache#4432)

9e177d7

For backwards compatibility, post-apache#4394.

gianm added a commit that referenced this pull request Jun 21, 2017

Queries: Restore old prepareAggregations method. (#4432) (#4436)

729402e

For backwards compatibility, post-#4394.

Conversation

gianm commented Jun 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Jun 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Jun 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Jun 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov left a comment

Choose a reason for hiding this comment

Uh oh!

fjy commented Jun 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Jun 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Jun 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gianm Jun 12, 2017 •

edited

Loading

gianm Jun 12, 2017 •

edited

Loading

gianm Jun 12, 2017 •

edited

Loading

gianm Jun 13, 2017 •

edited

Loading

gianm Jun 14, 2017 •

edited

Loading