ANY Aggregator should not skip null values implementation#9317
ANY Aggregator should not skip null values implementation#9317clintropolis merged 10 commits intoapache:masterfrom
Conversation
|
Static analysis is flagging unused imports |
| (Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries. | ||
|
|
||
| If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any non-null value. | ||
| If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any value including null. |
There was a problem hiding this comment.
I think we should re-write this to explain why someone would use this aggregator similar to how it's explained in the snowflake docs - https://docs.snowflake.net/manuals/sql-reference/functions/any_value.html
I'm not sure where the correct place in the docs is to explain this - since technically this is the spec for the native query and we have another page with a spec for sql.
Here's my suggestion for the why:
ANY aggregator can be used to simplify and optimize the performance of GROUP BY statements. A common problem for many queries is that the result of a query with a GROUP BY clause can only contain expressions used in the GROUP BY clause itself, or results of aggregate functions
select customer.id , customer.name , sum(orders.value)
from customer
join orders on customer.id = orders.customer_id
group by customer.id , customer.name;
Since we know that each customer.id can have only one name, this can be optimized as
select customer.id , ANY(customer.name) , sum(orders.value)
from customer
join orders on customer.id = orders.customer_id
group by customer.id ;
I should also point out, with the current implementation of aggregators, there is no advantage to using an ANY aggregator vs a min aggregator, but maybe that will change in the future 🤷♂
There was a problem hiding this comment.
Suggested re-wording to explain how null handling works
This aggregator will return any value for the provided expression. It does not prefer non-null values over null values. If `druid.generic.useDefaultValueForNull` is set to true, this will not return null, but instead return the default value. Otherwise this aggregator may return null
There was a problem hiding this comment.
EDIT: @maytasm3 pointed out this benchmark - https://static.imply.io/gianm/vb.html which shows the performance gain of not having to read a value. Since ANY only reads the first value, this will be faster than min, even thought the aggregator has to loop over all values.
I should have known that... 🤦♂
There was a problem hiding this comment.
+1 to Suneet's rewording. Agreed as well that we should add the "why".
There was a problem hiding this comment.
Added the why (short version of what Suneet wrote)
| (Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries. | ||
|
|
||
| If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any non-null value. | ||
| If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any value including null. |
There was a problem hiding this comment.
+1 to Suneet's rewording. Agreed as well that we should add the "why".
| |`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.| | ||
| |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| | ||
| |`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any non-null value of `expr`| | ||
| |`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any value of `expr` including null| |
There was a problem hiding this comment.
nit, adding a comma: "If druid.generic.useDefaultValueForNull=true, this can return..."
There was a problem hiding this comment.
I sort of think this should just be simplified so it is less confusing, maybe something like:
Returns any value of
expr, including null.
Also, expr does not need to be numeric since you implemented a string any aggregator in the previous PR.
There was a problem hiding this comment.
simplified as suggested.
expr needs to be null for ANY_VALUE(expr)
String column have to be ANY_VALUE(expr, maxBytesPerString)
| } else { | ||
| return rhs; | ||
| } | ||
| return lhs; |
There was a problem hiding this comment.
I think from like a .. user satisfaction perspective, it might still be nice to prefer non-null values since it is still legitimate.
There was a problem hiding this comment.
to be consistent with the new policy of not discriminating null values #equality
Decided not to make assumption in preferring non-null value.
| @Override | ||
| public String getTypeName() | ||
| { | ||
| // if we don't pretend to be a primitive, group by v1 gets sad and doesn't work because no complex type serde |
There was a problem hiding this comment.
this comment seems not applicable since it is a primitive
| |`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.| | ||
| |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| | ||
| |`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any non-null value of `expr`| | ||
| |`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any value of `expr` including null| |
There was a problem hiding this comment.
I sort of think this should just be simplified so it is less confusing, maybe something like:
Returns any value of
expr, including null.
Also, expr does not need to be numeric since you implemented a string any aggregator in the previous PR.
| * {@link NullableNumericAggregatorFactory}. If null needs to be handle, then {@link NullableNumericAggregatorFactory} | ||
| * will wrap this aggregator in {@link NullableNumericAggregator} and can handle all null in that class. | ||
| * Hence, no null will ever be pass into this aggregator from the valueSelector. | ||
| * This Aggregator is created by the {@link DoubleAnyAggregatorFactory} which has no special null handling logic. |
There was a problem hiding this comment.
I'm not sure this is really worth mentioning. This is the standard case for aggregator implementations, since it is less common for an agg implementation to have magic null handling wrapped around it than just handling the nulls itself.
| { | ||
| private static final byte BYTE_FLAG_IS_NOT_SET = 0; | ||
| private static final byte BYTE_FLAG_IS_SET = 1; | ||
| private static final int IS_FOUND_FLAG_OFFSET_POSITION = 0; |
There was a problem hiding this comment.
nit: I think this 0 offset variable makes the put and get operations more complicated than they need to be, suggest just dropping this and not adding or removing anything.
There was a problem hiding this comment.
Further, since is null is only going to be either 0 or 1, you save a byte per agg result and could use any of the high bits of this null byte to set whether or not you have found a value. Instead of an offset for null flag, you define a bit mask and your found check becomes something like buf.get(position) & FOUND_MASK == FOUND_MASK or whatever.
| public abstract class NumericAnyBufferAggregator<TSelector extends BaseNullableColumnValueSelector> | ||
| implements BufferAggregator | ||
| { | ||
| private static final byte BYTE_FLAG_IS_NOT_SET = 0; |
There was a problem hiding this comment.
Please use NullHandling.IS_NULL_BYTE and NullHandling.IS_NOT_NULL_BYTE to be consistent with other aggregators, at least for the 'is null' byte.
| private static final byte BYTE_FLAG_IS_SET = 1; | ||
| private static final int IS_FOUND_FLAG_OFFSET_POSITION = 0; | ||
| private static final int IS_NULL_FLAG_OFFSET_POSITION = IS_FOUND_FLAG_OFFSET_POSITION + Byte.BYTES; | ||
| private static final int FOUND_VALUE_OFFSET_POSITION = IS_NULL_FLAG_OFFSET_POSITION + Byte.BYTES; |
There was a problem hiding this comment.
nit: suggest just making this package private and having subclasses use this directly in their putValue implementations instead of getFoundValueStoredPosition function call, and maybe just calling it VALUE_OFFSET or FOUND_VALUE_OFFSET since the position seems redundant.
| statsMap.put("interrupted", true); | ||
| statsMap.put("reason", e.toString()); | ||
| } | ||
| // Mimic behavior from QueryResource, where this code was originally taken from. |
There was a problem hiding this comment.
Why this change? I don't think it is correct, since this stuff should only be set if it was a QueryInterruptedException
There was a problem hiding this comment.
Changed to just moving the logging out of the if check
There was a problem hiding this comment.
The change is so that we can see the exception (and possibly stacktrace) for exceptions other than QueryInterruptedException too
| (Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries. | ||
|
|
||
| If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any value including null. | ||
| Returns any value including null. This aggregator simplify and optimize the performance by returning the first encoutnered value (including null) |
There was a problem hiding this comment.
typo 'encoutnered' is causing CI failure
| |`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.| | ||
| |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| | ||
| |`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any value of `expr` including null| | ||
| |`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator simplify and optimize the performance by returning the first encoutnered value (including null)| |
There was a problem hiding this comment.
nit: should be 'This aggregator can simplify..'
Should this note also be added to the description of string any value?
There was a problem hiding this comment.
Added 'can'. The string any indicates "Like ANY_VALUE(expr)" which implies that whatever is mentioned for the ANY_VALUE(expr) also applies to the string any.
| private static final int FOUND_VALUE_OFFSET_POSITION = IS_NULL_FLAG_OFFSET_POSITION + Byte.BYTES; | ||
| // Rightmost bit for is null check (0 for is null and 1 for not null) | ||
| // Second rightmost bit for is found check (0 for not found and 1 for found) | ||
| private static final byte BYTE_FLAG_FOUND_MASK = 0b0010; |
There was a problem hiding this comment.
super super nit, but can you use hex? (0x02 and 0x01 for BYTE_FLAG_NULL_MASK)
* ANY Aggregator should not skip null values implementation * add tests * add more tests * Update documentation * add more tests * address review comments * optimize StringAnyBufferAggregator * fix failing tests * address pr comments
… (#56) * ANY Aggregator should not skip null values implementation * add tests * add more tests * Update documentation * add more tests * address review comments * optimize StringAnyBufferAggregator * fix failing tests * address pr comments
Modify ANY aggregator to not skip null values
Description
Modify ANY aggregator to not skip null values and hence return null values if that is the first value the aggregator encountered.
This PR has: