Skip to content

ANY Aggregator should not skip null values implementation#9317

Merged
clintropolis merged 10 commits intoapache:masterfrom
maytasm:IMPLY-1889
Feb 12, 2020
Merged

ANY Aggregator should not skip null values implementation#9317
clintropolis merged 10 commits intoapache:masterfrom
maytasm:IMPLY-1889

Conversation

@maytasm
Copy link
Copy Markdown
Contributor

@maytasm maytasm commented Feb 5, 2020

Modify ANY aggregator to not skip null values

Description

Modify ANY aggregator to not skip null values and hence return null values if that is the first value the aggregator encountered.

  • DoubleAnyAggregatorFactory, FloatAnyAggregatorFactory, LongAnyAggregatorFactory instead of extending from SimpleDoubleAggregatorFactory (which skip null values) will instead extend AggregatorFactory and will handles null values by in it's AnyAggregator.
  • AnyAggregator will have additional flag to mark if null value is found. If null value is found, all subsequent values will be skipped.
  • AnyAggregator can return null value if null value is he first value the aggregator encountered.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

@ccaominh
Copy link
Copy Markdown
Contributor

ccaominh commented Feb 8, 2020

Static analysis is flagging unused imports

Comment thread docs/querying/aggregations.md Outdated
(Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries.

If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any non-null value.
If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any value including null.
Copy link
Copy Markdown
Contributor

@suneet-s suneet-s Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should re-write this to explain why someone would use this aggregator similar to how it's explained in the snowflake docs - https://docs.snowflake.net/manuals/sql-reference/functions/any_value.html
I'm not sure where the correct place in the docs is to explain this - since technically this is the spec for the native query and we have another page with a spec for sql.

Here's my suggestion for the why:

ANY aggregator can be used to simplify and optimize the performance of GROUP BY statements. A common problem for many queries is that the result of a query with a GROUP BY clause can only contain expressions used in the GROUP BY clause itself, or results of aggregate functions

select customer.id , customer.name , sum(orders.value)
    from customer
    join orders on customer.id = orders.customer_id
    group by customer.id , customer.name;

Since we know that each customer.id can have only one name, this can be optimized as

select customer.id , ANY(customer.name) , sum(orders.value)
    from customer
    join orders on customer.id = orders.customer_id
    group by customer.id ;

I should also point out, with the current implementation of aggregators, there is no advantage to using an ANY aggregator vs a min aggregator, but maybe that will change in the future 🤷‍♂

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested re-wording to explain how null handling works

This aggregator will return any value for the provided expression. It does not prefer non-null values over null values. If `druid.generic.useDefaultValueForNull` is set to true, this will not return null, but instead return the default value. Otherwise this aggregator may return null

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: @maytasm3 pointed out this benchmark - https://static.imply.io/gianm/vb.html which shows the performance gain of not having to read a value. Since ANY only reads the first value, this will be faster than min, even thought the aggregator has to loop over all values.

I should have known that... 🤦‍♂

Copy link
Copy Markdown

@sthetland sthetland Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Suneet's rewording. Agreed as well that we should add the "why".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the why (short version of what Suneet wrote)

Comment thread docs/querying/aggregations.md Outdated
(Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries.

If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any non-null value.
If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any value including null.
Copy link
Copy Markdown

@sthetland sthetland Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Suneet's rewording. Agreed as well that we should add the "why".

Comment thread docs/querying/sql.md Outdated
|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.|
|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any non-null value of `expr`|
|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any value of `expr` including null|
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, adding a comma: "If druid.generic.useDefaultValueForNull=true, this can return..."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of think this should just be simplified so it is less confusing, maybe something like:

Returns any value of expr, including null.

Also, expr does not need to be numeric since you implemented a string any aggregator in the previous PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified as suggested.
expr needs to be null for ANY_VALUE(expr)
String column have to be ANY_VALUE(expr, maxBytesPerString)

} else {
return rhs;
}
return lhs;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think from like a .. user satisfaction perspective, it might still be nice to prefer non-null values since it is still legitimate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with the new policy of not discriminating null values #equality
Decided not to make assumption in preferring non-null value.

@Override
public String getTypeName()
{
// if we don't pretend to be a primitive, group by v1 gets sad and doesn't work because no complex type serde
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems not applicable since it is a primitive

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Comment thread docs/querying/sql.md Outdated
|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.|
|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any non-null value of `expr`|
|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any value of `expr` including null|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of think this should just be simplified so it is less confusing, maybe something like:

Returns any value of expr, including null.

Also, expr does not need to be numeric since you implemented a string any aggregator in the previous PR.

* {@link NullableNumericAggregatorFactory}. If null needs to be handle, then {@link NullableNumericAggregatorFactory}
* will wrap this aggregator in {@link NullableNumericAggregator} and can handle all null in that class.
* Hence, no null will ever be pass into this aggregator from the valueSelector.
* This Aggregator is created by the {@link DoubleAnyAggregatorFactory} which has no special null handling logic.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is really worth mentioning. This is the standard case for aggregator implementations, since it is less common for an agg implementation to have magic null handling wrapped around it than just handling the nulls itself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

{
private static final byte BYTE_FLAG_IS_NOT_SET = 0;
private static final byte BYTE_FLAG_IS_SET = 1;
private static final int IS_FOUND_FLAG_OFFSET_POSITION = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this 0 offset variable makes the put and get operations more complicated than they need to be, suggest just dropping this and not adding or removing anything.

Copy link
Copy Markdown
Member

@clintropolis clintropolis Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, since is null is only going to be either 0 or 1, you save a byte per agg result and could use any of the high bits of this null byte to set whether or not you have found a value. Instead of an offset for null flag, you define a bit mask and your found check becomes something like buf.get(position) & FOUND_MASK == FOUND_MASK or whatever.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

public abstract class NumericAnyBufferAggregator<TSelector extends BaseNullableColumnValueSelector>
implements BufferAggregator
{
private static final byte BYTE_FLAG_IS_NOT_SET = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use NullHandling.IS_NULL_BYTE and NullHandling.IS_NOT_NULL_BYTE to be consistent with other aggregators, at least for the 'is null' byte.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private static final byte BYTE_FLAG_IS_SET = 1;
private static final int IS_FOUND_FLAG_OFFSET_POSITION = 0;
private static final int IS_NULL_FLAG_OFFSET_POSITION = IS_FOUND_FLAG_OFFSET_POSITION + Byte.BYTES;
private static final int FOUND_VALUE_OFFSET_POSITION = IS_NULL_FLAG_OFFSET_POSITION + Byte.BYTES;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: suggest just making this package private and having subclasses use this directly in their putValue implementations instead of getFoundValueStoredPosition function call, and maybe just calling it VALUE_OFFSET or FOUND_VALUE_OFFSET since the position seems redundant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

statsMap.put("interrupted", true);
statsMap.put("reason", e.toString());
}
// Mimic behavior from QueryResource, where this code was originally taken from.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? I don't think it is correct, since this stuff should only be set if it was a QueryInterruptedException

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to just moving the logging out of the if check

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is so that we can see the exception (and possibly stacktrace) for exceptions other than QueryInterruptedException too

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 after CI

Comment thread docs/querying/aggregations.md Outdated
(Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries.

If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any value including null.
Returns any value including null. This aggregator simplify and optimize the performance by returning the first encoutnered value (including null)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo 'encoutnered' is causing CI failure

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops done.

Comment thread docs/querying/sql.md Outdated
|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.|
|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can return the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will return any value of `expr` including null|
|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator simplify and optimize the performance by returning the first encoutnered value (including null)|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be 'This aggregator can simplify..'

Should this note also be added to the description of string any value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 'can'. The string any indicates "Like ANY_VALUE(expr)" which implies that whatever is mentioned for the ANY_VALUE(expr) also applies to the string any.

private static final int FOUND_VALUE_OFFSET_POSITION = IS_NULL_FLAG_OFFSET_POSITION + Byte.BYTES;
// Rightmost bit for is null check (0 for is null and 1 for not null)
// Second rightmost bit for is found check (0 for not found and 1 for found)
private static final byte BYTE_FLAG_FOUND_MASK = 0b0010;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super super nit, but can you use hex? (0x02 and 0x01 for BYTE_FLAG_NULL_MASK)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@clintropolis clintropolis merged commit c30579e into apache:master Feb 12, 2020
maytasm added a commit to maytasm/druid that referenced this pull request Feb 12, 2020
* ANY Aggregator should not skip null values implementation

* add tests

* add more tests

* Update documentation

* add more tests

* address review comments

* optimize StringAnyBufferAggregator

* fix failing tests

* address pr comments
maytasm added a commit to implydata/druid-public that referenced this pull request Feb 13, 2020
… (#56)

* ANY Aggregator should not skip null values implementation

* add tests

* add more tests

* Update documentation

* add more tests

* address review comments

* optimize StringAnyBufferAggregator

* fix failing tests

* address pr comments
@jihoonson jihoonson added this to the 0.18.0 milestone Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants