ANY Aggregator should not skip null values implementation by maytasm · Pull Request #9317 · apache/druid

maytasm · 2020-02-05T23:24:23Z

Modify ANY aggregator to not skip null values

Description

Modify ANY aggregator to not skip null values and hence return null values if that is the first value the aggregator encountered.

DoubleAnyAggregatorFactory, FloatAnyAggregatorFactory, LongAnyAggregatorFactory instead of extending from SimpleDoubleAggregatorFactory (which skip null values) will instead extend AggregatorFactory and will handles null values by in it's AnyAggregator.
AnyAggregator will have additional flag to mark if null value is found. If null value is found, all subsequent values will be skipped.
AnyAggregator can return null value if null value is he first value the aggregator encountered.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.
added integration tests.
been tested in a test Druid cluster.

ccaominh · 2020-02-08T22:25:04Z

Static analysis is flagging unused imports

suneet-s · 2020-02-11T17:07:59Z

 (Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries.

-If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any non-null value.
+If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any value including null.


I think we should re-write this to explain why someone would use this aggregator similar to how it's explained in the snowflake docs - https://docs.snowflake.net/manuals/sql-reference/functions/any_value.html
I'm not sure where the correct place in the docs is to explain this - since technically this is the spec for the native query and we have another page with a spec for sql.

Here's my suggestion for the why:

ANY aggregator can be used to simplify and optimize the performance of GROUP BY statements. A common problem for many queries is that the result of a query with a GROUP BY clause can only contain expressions used in the GROUP BY clause itself, or results of aggregate functions select customer.id , customer.name , sum(orders.value) from customer join orders on customer.id = orders.customer_id group by customer.id , customer.name; Since we know that each customer.id can have only one name, this can be optimized as select customer.id , ANY(customer.name) , sum(orders.value) from customer join orders on customer.id = orders.customer_id group by customer.id ;

I should also point out, with the current implementation of aggregators, there is no advantage to using an ANY aggregator vs a min aggregator, but maybe that will change in the future 🤷‍♂

Suggested re-wording to explain how null handling works

This aggregator will return any value for the provided expression. It does not prefer non-null values over null values. If `druid.generic.useDefaultValueForNull` is set to true, this will not return null, but instead return the default value. Otherwise this aggregator may return null

EDIT: @maytasm3 pointed out this benchmark - https://static.imply.io/gianm/vb.html which shows the performance gain of not having to read a value. Since ANY only reads the first value, this will be faster than min, even thought the aggregator has to loop over all values.

I should have known that... 🤦‍♂

+1 to Suneet's rewording. Agreed as well that we should add the "why".

Added the why (short version of what Suneet wrote)

sthetland · 2020-02-11T18:22:42Z

 (Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries.

-If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any non-null value.
+If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any value including null.


+1 to Suneet's rewording. Agreed as well that we should add the "why".

sthetland · 2020-02-11T18:23:59Z

 |`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.|
 |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
-|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any non-null value of `expr`|
+|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any value of `expr` including null|


nit, adding a comma: "If druid.generic.useDefaultValueForNull=true, this can return..."

I sort of think this should just be simplified so it is less confusing, maybe something like:

Returns any value of expr, including null.

Also, expr does not need to be numeric since you implemented a string any aggregator in the previous PR.

simplified as suggested.
expr needs to be null for ANY_VALUE(expr)
String column have to be ANY_VALUE(expr, maxBytesPerString)

clintropolis · 2020-02-07T23:06:56Z

-    } else {
-      return rhs;
-    }
+    return lhs;


I think from like a .. user satisfaction perspective, it might still be nice to prefer non-null values since it is still legitimate.

to be consistent with the new policy of not discriminating null values #equality
Decided not to make assumption in preferring non-null value.

clintropolis · 2020-02-07T23:23:18Z

+  @Override
+  public String getTypeName()
+  {
+    // if we don't pretend to be a primitive, group by v1 gets sad and doesn't work because no complex type serde


this comment seems not applicable since it is a primitive

clintropolis · 2020-02-11T22:10:11Z

 |`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.|
 |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
-|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any non-null value of `expr`|
+|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any value of `expr` including null|


I sort of think this should just be simplified so it is less confusing, maybe something like:

Returns any value of expr, including null.

Also, expr does not need to be numeric since you implemented a string any aggregator in the previous PR.

clintropolis · 2020-02-11T22:13:55Z

- * {@link NullableNumericAggregatorFactory}. If null needs to be handle, then {@link NullableNumericAggregatorFactory}
- * will wrap this aggregator in {@link NullableNumericAggregator} and can handle all null in that class.
- * Hence, no null will ever be pass into this aggregator from the valueSelector.
+ * This Aggregator is created by the {@link DoubleAnyAggregatorFactory} which has no special null handling logic.


I'm not sure this is really worth mentioning. This is the standard case for aggregator implementations, since it is less common for an agg implementation to have magic null handling wrapped around it than just handling the nulls itself.

clintropolis · 2020-02-11T22:19:55Z

+{
+  private static final byte BYTE_FLAG_IS_NOT_SET = 0;
+  private static final byte BYTE_FLAG_IS_SET = 1;
+  private static final int IS_FOUND_FLAG_OFFSET_POSITION = 0;


nit: I think this 0 offset variable makes the put and get operations more complicated than they need to be, suggest just dropping this and not adding or removing anything.

Further, since is null is only going to be either 0 or 1, you save a byte per agg result and could use any of the high bits of this null byte to set whether or not you have found a value. Instead of an offset for null flag, you define a bit mask and your found check becomes something like buf.get(position) & FOUND_MASK == FOUND_MASK or whatever.

clintropolis · 2020-02-11T22:22:00Z

+public abstract class NumericAnyBufferAggregator<TSelector extends BaseNullableColumnValueSelector>
+    implements BufferAggregator
+{
+  private static final byte BYTE_FLAG_IS_NOT_SET = 0;


Please use NullHandling.IS_NULL_BYTE and NullHandling.IS_NOT_NULL_BYTE to be consistent with other aggregators, at least for the 'is null' byte.

clintropolis · 2020-02-11T22:29:20Z

+  private static final byte BYTE_FLAG_IS_SET = 1;
+  private static final int IS_FOUND_FLAG_OFFSET_POSITION = 0;
+  private static final int IS_NULL_FLAG_OFFSET_POSITION = IS_FOUND_FLAG_OFFSET_POSITION + Byte.BYTES;
+  private static final int FOUND_VALUE_OFFSET_POSITION = IS_NULL_FLAG_OFFSET_POSITION + Byte.BYTES;


nit: suggest just making this package private and having subclasses use this directly in their putValue implementations instead of getFoundValueStoredPosition function call, and maybe just calling it VALUE_OFFSET or FOUND_VALUE_OFFSET since the position seems redundant.

clintropolis · 2020-02-11T22:38:31Z

-          statsMap.put("interrupted", true);
-          statsMap.put("reason", e.toString());
-        }
+        // Mimic behavior from QueryResource, where this code was originally taken from.


Why this change? I don't think it is correct, since this stuff should only be set if it was a QueryInterruptedException

Changed to just moving the logging out of the if check

The change is so that we can see the exception (and possibly stacktrace) for exceptions other than QueryInterruptedException too

clintropolis

👍 after CI

clintropolis · 2020-02-12T10:05:02Z

 (Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries.

-If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any value including null.
+Returns any value including null. This aggregator simplify and optimize the performance by returning the first encoutnered value (including null)


typo 'encoutnered' is causing CI failure

clintropolis · 2020-02-12T10:06:28Z

 |`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.|
 |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
-|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any value of `expr` including null|
+|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator simplify and optimize the performance by returning the first encoutnered value (including null)|


nit: should be 'This aggregator can simplify..'

Should this note also be added to the description of string any value?

Added 'can'. The string any indicates "Like ANY_VALUE(expr)" which implies that whatever is mentioned for the ANY_VALUE(expr) also applies to the string any.

clintropolis · 2020-02-12T10:09:57Z

-  private static final int FOUND_VALUE_OFFSET_POSITION = IS_NULL_FLAG_OFFSET_POSITION + Byte.BYTES;
+  // Rightmost bit for is null check (0 for is null and 1 for not null)
+  // Second rightmost bit for is found check (0 for not found and 1 for found)
+  private static final byte BYTE_FLAG_FOUND_MASK = 0b0010;


super super nit, but can you use hex? (0x02 and 0x01 for BYTE_FLAG_NULL_MASK)

* ANY Aggregator should not skip null values implementation * add tests * add more tests * Update documentation * add more tests * address review comments * optimize StringAnyBufferAggregator * fix failing tests * address pr comments

… (#56) * ANY Aggregator should not skip null values implementation * add tests * add more tests * Update documentation * add more tests * address review comments * optimize StringAnyBufferAggregator * fix failing tests * address pr comments

ANY Aggregator should not skip null values implementation

59488e0

clintropolis added Area - Null Handling Area - Querying labels Feb 6, 2020

maytasm added 4 commits February 6, 2020 13:23

add tests

4344b4a

add more tests

68ac8de

Update documentation

e68e9ee

add more tests

e224eed

suneet-s reviewed Feb 11, 2020

View reviewed changes

sthetland reviewed Feb 11, 2020

View reviewed changes

clintropolis reviewed Feb 11, 2020

View reviewed changes

maytasm added 4 commits February 11, 2020 15:27

Merge remote-tracking branch 'upstream/master' into IMPLY-1889

511d33d

address review comments

094425a

optimize StringAnyBufferAggregator

00eb20e

fix failing tests

b9ef751

clintropolis approved these changes Feb 12, 2020

View reviewed changes

address pr comments

c3e645c

clintropolis approved these changes Feb 12, 2020

View reviewed changes

clintropolis merged commit c30579e into apache:master Feb 12, 2020

jihoonson added this to the 0.18.0 milestone Mar 26, 2020

Conversation

maytasm commented Feb 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

ccaominh commented Feb 8, 2020

Uh oh!

suneet-s Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sthetland Feb 11, 2020 • edited by gianm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sthetland Feb 11, 2020 • edited by gianm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maytasm commented Feb 5, 2020 •

edited

Loading

suneet-s Feb 11, 2020 •

edited

Loading

sthetland Feb 11, 2020 •

edited by gianm

Loading

sthetland Feb 11, 2020 •

edited by gianm

Loading

clintropolis Feb 11, 2020 •

edited

Loading