add "function" field to long/double Sum aggs and "sqrt" to arithmetic post agg#1965
add "function" field to long/double Sum aggs and "sqrt" to arithmetic post agg#1965himanshug wants to merge 4 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
can we quantify the performance impact of the additional branch check?
There was a problem hiding this comment.
ran some tests using following code
@State(Scope.Benchmark)
public class LongSumAggregatorBenchmark
{
private LongColumnSelector selector = new LongColumnSelector()
{
@Override
public long get()
{
return 100l;
}
};
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void benchmarkAggregate(Blackhole blackhole)
{
LongSumAggregator aggregator = new LongSumAggregator("name", selector, 1);
for (int i = 0; i < 10000000; i++) {
aggregator.aggregate();
}
blackhole.consume(aggregator.get());
}
public static void main(String[] args) throws Exception
{
Options opt = new OptionsBuilder()
.include(".*" + LongSumAggregatorBenchmark.class.getSimpleName() + ".*")
.warmupIterations(5)
.forks(1)
.build();
new Runner(opt).run();
}
}
with branching..
Result "benchmarkAggregate":
210.999 ±(99.9%) 12.971 us/op [Average]
(min, avg, max) = (192.895, 210.999, 241.701), stdev = 14.937
CI (99.9%): [198.028, 223.970] (assumes normal distribution)
without branching..
Result "benchmarkAggregate":
209.661 ±(99.9%) 10.036 us/op [Average]
(min, avg, max) = (189.288, 209.661, 234.579), stdev = 11.557
CI (99.9%): [199.625, 219.697] (assumes normal distribution)
note that, on multiple runs, numbers for both fluctuate a bit for both cases and I can't see any major difference in performance between two... branching is OK and creating another aggregator ( #1965 (comment) ) to handle exponent case is not really needed.
|
Can this be a new aggregator? |
|
@drcrallen yeah, if branching introduces a major performance issue then we can have another aggregator like GenericLongSumAggregator (and other similar ones). LongSumAggregator (without change) gets used when exponent=1 and GenericLongSumAggregator used otherwise. |
|
@xvrl @drcrallen main intention of this PR is to showcase whether the idea of supporting the exponent makes sense to have basic statistics support . I believe, it does but wanted to get second opinions. |
|
@himanshug I'm curious if other methods like http://www.johndcook.com/blog/standard_deviation/ are more fitting |
|
@himanshug Overall I'm on board with supporting either:
If the objective in this particular case is to compute variance, I don't think it makes sense to have a general I also agree, that if only care about variance, it may be worthwhile to look into different numerical techniques to avoid overflow problems, given the scale that we intend to have Druid run at. Even if we do go the route of having specialized aggregators, I would prefer if we exposed some form of function composition at the query API level, and then translate internally into special-purpose aggregations for the cases that we have optimized. |
|
+1 on having composing aggregators at the query level to do this and the implementation can can optimize things based on the compositions. |
|
@drcrallen while I agree that a sketch approach specific to variance would give better results but complex types are a bit slower. anyways, it seems like ppl are in general favor of supporting this functionality , so will do the necessary things and update this PR. |
1ee3f02 to
3b261b2
Compare
|
I still think this should be a separate aggregator because it muddles user expectation. We already have a A longSum aggregator in this PR cannot be used the same at ingestion time vs query time. For example, a longSum aggregator with a power 2 exponential specified at both ingestion and query time will result in a result of a power 4 result... ASSUMING no rollup. If the exponential is applied at QUERY time, then for cases where the data was rolled up the results are very likely NOT going to be what the user is looking for. A simple scenario would be ingesting some double value at ingestion time (with doubleSum), and then issuing a doubleSum with an exponential in a query. In such a scenario the result would be some absurd polynomial of the original data that doesn't at all resemble what the user was actually looking for. As such this functionality would be an advanced feature and, if possible, should be distinct from the default longSum/doubleSum at least in documentation, and (my personal preference) in nomenclature. |
|
@drcrallen Even if we create a separate aggregator to handle this, user will have to provide different aggregator [configuration] at ingestion time vs query time because you can not detect query/ingestion context inside aggregator implementation. Please let me know if I am missing something here. |
There was a problem hiding this comment.
generated java files were failing to compile because of the constructor change. I decided to remove it instead of fixing as this is not used.
|
My two primary concerns with adding exponent to the existing aggregator factories are the following
How about instead of adding an |
There was a problem hiding this comment.
needs an updated cache key for all factories.
|
@xvrl |
|
@himanshug that seems reasonable, although we might want to separate out the type used for serde from the actual implementation of the function. This will make it easier to abstract the query api from the function implementation. What do you think? Anyone else have thoughts on this, maybe @cheddar ? |
|
@xvrl i confirmed with @cheddar and the "function" attribute OK with him. |
|
@himanshug can we rebase this from master again? |
|
@himanshug can you reconcile this and #2525 regarding functionality and how it relates to #2090 ? |
|
@drcrallen I haven't gone through #2525 completely, but that is a dedicated aggregator for indexing and computing variance while the change proposed here is general feature to compute any polynomial function on the input. #2090 provides the expression language for computing those functions and as a result a pre-requisite. |
|
@himanshug #2090 has been merged so I think this can be finished up |
|
I think similar work is done in some other PRs, will take a look at those and will reopen if needed. |
which allows storing
sum(function(x)) and not just sum(x)
now, say, you want to report standard deviation on a metric x. you would need to store 3 columns
sum of x, sum of square of x and count
at query time, it should be possible to compute standard deviation by using the formula..
it would be variance if you didn't do square root
in general, by just using long/double sum , users can possibly compute a variety of statistics functions and other polynomial functions.
also, support for computing square root is added to arithmetic post aggregator.
TODO: wondering whether to add a "variance" post agg.