add "function" field to long/double Sum aggs and "sqrt" to arithmetic post agg by himanshug · Pull Request #1965 · apache/druid

himanshug · 2015-11-13T05:44:20Z

which allows storing
sum(function(x)) and not just sum(x)

now, say, you want to report standard deviation on a metric x. you would need to store 3 columns
sum of x, sum of square of x and count

at query time, it should be possible to compute standard deviation by using the formula..

it would be variance if you didn't do square root

in general, by just using long/double sum , users can possibly compute a variety of statistics functions and other polynomial functions.

also, support for computing square root is added to arithmetic post aggregator.
TODO: wondering whether to add a "variance" post agg.

xvrl · 2015-11-13T07:20:43Z

can we quantify the performance impact of the additional branch check?

@xvrl will do that

ran some tests using following code

@State(Scope.Benchmark) public class LongSumAggregatorBenchmark { private LongColumnSelector selector = new LongColumnSelector() { @Override public long get() { return 100l; } }; @Benchmark @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.MICROSECONDS) public void benchmarkAggregate(Blackhole blackhole) { LongSumAggregator aggregator = new LongSumAggregator("name", selector, 1); for (int i = 0; i < 10000000; i++) { aggregator.aggregate(); } blackhole.consume(aggregator.get()); } public static void main(String[] args) throws Exception { Options opt = new OptionsBuilder() .include(".*" + LongSumAggregatorBenchmark.class.getSimpleName() + ".*") .warmupIterations(5) .forks(1) .build(); new Runner(opt).run(); } }

with branching..

Result "benchmarkAggregate": 210.999 ±(99.9%) 12.971 us/op [Average] (min, avg, max) = (192.895, 210.999, 241.701), stdev = 14.937 CI (99.9%): [198.028, 223.970] (assumes normal distribution)

without branching..

Result "benchmarkAggregate": 209.661 ±(99.9%) 10.036 us/op [Average] (min, avg, max) = (189.288, 209.661, 234.579), stdev = 11.557 CI (99.9%): [199.625, 219.697] (assumes normal distribution)

note that, on multiple runs, numbers for both fluctuate a bit for both cases and I can't see any major difference in performance between two... branching is OK and creating another aggregator ( #1965 (comment) ) to handle exponent case is not really needed.

drcrallen · 2015-11-13T17:28:22Z

Can this be a new aggregator?

himanshug · 2015-11-13T17:33:28Z

@drcrallen yeah, if branching introduces a major performance issue then we can have another aggregator like GenericLongSumAggregator (and other similar ones). LongSumAggregator (without change) gets used when exponent=1 and GenericLongSumAggregator used otherwise.

himanshug · 2015-11-13T17:34:31Z

@xvrl @drcrallen main intention of this PR is to showcase whether the idea of supporting the exponent makes sense to have basic statistics support . I believe, it does but wanted to get second opinions.

drcrallen · 2015-11-13T18:08:34Z

@himanshug I'm curious if other methods like http://www.johndcook.com/blog/standard_deviation/ are more fitting

xvrl · 2015-11-13T19:04:02Z

@himanshug Overall I'm on board with supporting either:

some kind of function composition to allow composing aggregators with other functions.
some specialized methods for computing statistics such as variance.

If the objective in this particular case is to compute variance, I don't think it makes sense to have a general sum(x^y) aggregator, since we will almost always use x^2, and x*x will be orders of magnitude faster than using the generic power operator.

I also agree, that if only care about variance, it may be worthwhile to look into different numerical techniques to avoid overflow problems, given the scale that we intend to have Druid run at.

Even if we do go the route of having specialized aggregators, I would prefer if we exposed some form of function composition at the query API level, and then translate internally into special-purpose aggregations for the cases that we have optimized.

nishantmonu51 · 2015-11-13T19:10:01Z

+1 on having composing aggregators at the query level to do this and the implementation can can optimize things based on the compositions.

himanshug · 2015-11-13T19:41:53Z

@drcrallen while I agree that a sketch approach specific to variance would give better results but complex types are a bit slower.
composable approach here is more general and allows computing any polynomial functions. e.g. this can be used to do things like simple linear/logistic regression given that model parameters are known upfront (say from training in a batch pipeline).
also , we can still implement variance specific complex aggregator at some point. However, for the sketches, going forward, I would put the sketching code in the datasketches library (so that same thing becomes available for pig/hive if users wish to pre-aggregate data in the batch pipeline) and implementing the druid aggregator inside datasketches-aggregators extension.

anyways, it seems like ppl are in general favor of supporting this functionality , so will do the necessary things and update this PR.

drcrallen · 2015-11-16T17:26:33Z

I still think this should be a separate aggregator because it muddles user expectation.

We already have a count aggregator which returns different results at ingestion time vs query time, and that causes a lot of confusion.

A longSum aggregator in this PR cannot be used the same at ingestion time vs query time. For example, a longSum aggregator with a power 2 exponential specified at both ingestion and query time will result in a result of a power 4 result... ASSUMING no rollup.

If the exponential is applied at QUERY time, then for cases where the data was rolled up the results are very likely NOT going to be what the user is looking for. A simple scenario would be ingesting some double value at ingestion time (with doubleSum), and then issuing a doubleSum with an exponential in a query. In such a scenario the result would be some absurd polynomial of the original data that doesn't at all resemble what the user was actually looking for.

As such this functionality would be an advanced feature and, if possible, should be distinct from the default longSum/doubleSum at least in documentation, and (my personal preference) in nomenclature.

himanshug · 2015-11-17T03:06:09Z

@drcrallen Even if we create a separate aggregator to handle this, user will have to provide different aggregator [configuration] at ingestion time vs query time because you can not detect query/ingestion context inside aggregator implementation. Please let me know if I am missing something here.
I still think it is OK to have new attribute exponent in same longSum and doubleSum aggregators to support this use case. It is totally backward compatible and we can document it separately if needed.

xvrl · 2015-11-17T05:28:18Z

this seems unrelated to this PR

generated java files were failing to compile because of the constructor change. I decided to remove it instead of fixing as this is not used.

xvrl · 2015-11-17T06:27:23Z

My two primary concerns with adding exponent to the existing aggregator factories are the following

it creates a precent for tacking on additional functionality to the existing aggregator primitives. Let's say I want to compute a geometric mean, I would probably do a sum of log, should we then add an ln flag to doubleSum?
it does not provide a clean path to support function composition at the API level, nor a clean implementation amenable to further optimizations.

How about instead of adding an exponent field on the aggregator-factory, we add a function field that allows passing some kind of function of the column value. For now we can support only exponent, and pass this exponent to a specialized aggregator, but at least it leaves the path open to support more things down the road.

xvrl · 2015-11-17T06:29:03Z

needs an updated cache key for all factories.

himanshug · 2015-11-17T07:09:02Z

@xvrl function instead of exponent sounds good. do you mean adding support for predefined functions or do you mean accepting javascript ? I guess it is the former but just confirming.

himanshug · 2015-11-18T17:02:03Z

@xvrl pls see 2ef5582 for generic "function" support instead of "exponenty" , does it look OK? (incomplete but demonstrates the idea)

xvrl · 2015-11-25T00:52:04Z

@himanshug that seems reasonable, although we might want to separate out the type used for serde from the actual implementation of the function. This will make it easier to abstract the query api from the function implementation. What do you think? Anyone else have thoughts on this, maybe @cheddar ?

himanshug · 2015-12-02T19:55:49Z

@xvrl i confirmed with @cheddar and the "function" attribute OK with him.
However (may be in future), it would be nice to take a math expression as function so that user can do arbitrary functions without us having to implement them all.
I would like to know if there is a suitable math expression language implementation available already that we can use or else I can define and write a custom one? ( i understand that we can do it with javascript evaluator)

fjy · 2016-01-08T19:19:55Z

@himanshug can we rebase this from master again?

himanshug · 2016-01-08T19:26:10Z

@fjy i'm waiting for #2090 to be merged first then get back to this one.

drcrallen · 2016-02-25T17:18:57Z

@himanshug can you reconcile this and #2525 regarding functionality and how it relates to #2090 ?

himanshug · 2016-02-27T04:20:09Z

@drcrallen I haven't gone through #2525 completely, but that is a dedicated aggregator for indexing and computing variance while the change proposed here is general feature to compute any polynomial function on the input. #2090 provides the expression language for computing those functions and as a result a pre-requisite.

fjy · 2016-04-12T20:13:39Z

@himanshug #2090 has been merged so I think this can be finished up

himanshug · 2016-06-29T18:10:42Z

I think similar work is done in some other PRs, will take a look at those and will reopen if needed.

himanshug added the Discuss label Nov 13, 2015

himanshug force-pushed the more_aggs branch from 2730ecb to 79a8e54 Compare November 13, 2015 05:47

xvrl reviewed Nov 13, 2015
View reviewed changes

himanshug force-pushed the more_aggs branch from 79a8e54 to 9b4bf4e Compare November 15, 2015 06:27

himanshug removed the Discuss label Nov 15, 2015

himanshug changed the title ~~discuss - support variance and standard deviation~~ add "exponent" field to long/double Sum aggs and "sqrt" to arithmetic post agg Nov 15, 2015

himanshug force-pushed the more_aggs branch 2 times, most recently from 1ee3f02 to 3b261b2 Compare November 15, 2015 06:50

himanshug added 3 commits November 15, 2015 00:52

removing unused antlr grammer etc

bfe595c

adding attribute "exponent" to [long/double]Sum aggregators

ee326a8

adding support for square root to arithmetic post aggregator

526ed2b

himanshug force-pushed the more_aggs branch from 3b261b2 to 526ed2b Compare November 15, 2015 06:52

xvrl reviewed Nov 17, 2015
View reviewed changes

support "function" attribute instead of "exponent"

2ef5582

himanshug changed the title ~~add "exponent" field to long/double Sum aggs and "sqrt" to arithmetic post agg~~ add "function" field to long/double Sum aggs and "sqrt" to arithmetic post agg Dec 2, 2015

himanshug mentioned this pull request Dec 14, 2015

math expression support #2090

Merged

himanshug mentioned this pull request Mar 5, 2016

Support variance and standard deviation #2525

Merged

navis mentioned this pull request Apr 20, 2016

Math expressional parameters for aggregator #2783

Merged

fjy added the Feature label Jun 28, 2016

himanshug closed this Jun 29, 2016

himanshug deleted the more_aggs branch January 3, 2017 16:24

Conversation

himanshug commented Nov 13, 2015

Uh oh!

xvrl Nov 13, 2015

Choose a reason for hiding this comment

Uh oh!

himanshug Nov 13, 2015

Choose a reason for hiding this comment

Uh oh!

himanshug Nov 14, 2015

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Nov 13, 2015

Uh oh!

himanshug commented Nov 13, 2015

Uh oh!

himanshug commented Nov 13, 2015

Uh oh!

drcrallen commented Nov 13, 2015

Uh oh!

xvrl commented Nov 13, 2015

Uh oh!

nishantmonu51 commented Nov 13, 2015

Uh oh!

himanshug commented Nov 13, 2015

Uh oh!

drcrallen commented Nov 16, 2015

Uh oh!

himanshug commented Nov 17, 2015

Uh oh!

xvrl Nov 17, 2015

Choose a reason for hiding this comment

Uh oh!

himanshug Nov 17, 2015

Choose a reason for hiding this comment

Uh oh!

xvrl commented Nov 17, 2015

Uh oh!

xvrl Nov 17, 2015

Choose a reason for hiding this comment

Uh oh!

himanshug Nov 17, 2015

Choose a reason for hiding this comment

Uh oh!

himanshug commented Nov 17, 2015

Uh oh!

himanshug commented Nov 18, 2015

Uh oh!

xvrl commented Nov 25, 2015

Uh oh!

himanshug commented Dec 2, 2015

Uh oh!

fjy commented Jan 8, 2016

Uh oh!

himanshug commented Jan 8, 2016

Uh oh!

drcrallen commented Feb 25, 2016

Uh oh!

himanshug commented Feb 27, 2016

Uh oh!

fjy commented Apr 12, 2016

Uh oh!

himanshug commented Jun 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants