Skip to content

add "function" field to long/double Sum aggs and "sqrt" to arithmetic post agg#1965

Closed
himanshug wants to merge 4 commits intoapache:masterfrom
himanshug:more_aggs
Closed

add "function" field to long/double Sum aggs and "sqrt" to arithmetic post agg#1965
himanshug wants to merge 4 commits intoapache:masterfrom
himanshug:more_aggs

Conversation

@himanshug
Copy link
Copy Markdown
Contributor

which allows storing
sum(function(x)) and not just sum(x)

now, say, you want to report standard deviation on a metric x. you would need to store 3 columns
sum of x, sum of square of x and count

at query time, it should be possible to compute standard deviation by using the formula..

image

it would be variance if you didn't do square root

in general, by just using long/double sum , users can possibly compute a variety of statistics functions and other polynomial functions.

also, support for computing square root is added to arithmetic post aggregator.
TODO: wondering whether to add a "variance" post agg.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we quantify the performance impact of the additional branch check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvrl will do that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ran some tests using following code

@State(Scope.Benchmark)
public class LongSumAggregatorBenchmark
{
  private LongColumnSelector selector = new LongColumnSelector()
  {
    @Override
    public long get()
    {
      return 100l;
    }
  };

  @Benchmark
  @BenchmarkMode(Mode.AverageTime)
  @OutputTimeUnit(TimeUnit.MICROSECONDS)
  public void benchmarkAggregate(Blackhole blackhole)
  {
    LongSumAggregator aggregator = new LongSumAggregator("name", selector, 1);
    for (int i = 0; i < 10000000; i++) {
      aggregator.aggregate();
    }
    blackhole.consume(aggregator.get());
  }

  public static void main(String[] args) throws Exception
  {
    Options opt = new OptionsBuilder()
        .include(".*" + LongSumAggregatorBenchmark.class.getSimpleName() + ".*")
        .warmupIterations(5)
        .forks(1)
        .build();
    new Runner(opt).run();
  }
}

with branching..

Result "benchmarkAggregate":
  210.999 ±(99.9%) 12.971 us/op [Average]
  (min, avg, max) = (192.895, 210.999, 241.701), stdev = 14.937
  CI (99.9%): [198.028, 223.970] (assumes normal distribution)

without branching..

Result "benchmarkAggregate":
  209.661 ±(99.9%) 10.036 us/op [Average]
  (min, avg, max) = (189.288, 209.661, 234.579), stdev = 11.557
  CI (99.9%): [199.625, 219.697] (assumes normal distribution)

note that, on multiple runs, numbers for both fluctuate a bit for both cases and I can't see any major difference in performance between two... branching is OK and creating another aggregator ( #1965 (comment) ) to handle exponent case is not really needed.

@drcrallen
Copy link
Copy Markdown
Contributor

Can this be a new aggregator?

@himanshug
Copy link
Copy Markdown
Contributor Author

@drcrallen yeah, if branching introduces a major performance issue then we can have another aggregator like GenericLongSumAggregator (and other similar ones). LongSumAggregator (without change) gets used when exponent=1 and GenericLongSumAggregator used otherwise.

@himanshug
Copy link
Copy Markdown
Contributor Author

@xvrl @drcrallen main intention of this PR is to showcase whether the idea of supporting the exponent makes sense to have basic statistics support . I believe, it does but wanted to get second opinions.

@drcrallen
Copy link
Copy Markdown
Contributor

@himanshug I'm curious if other methods like http://www.johndcook.com/blog/standard_deviation/ are more fitting

@xvrl
Copy link
Copy Markdown
Member

xvrl commented Nov 13, 2015

@himanshug Overall I'm on board with supporting either:

  • some kind of function composition to allow composing aggregators with other functions.
  • some specialized methods for computing statistics such as variance.

If the objective in this particular case is to compute variance, I don't think it makes sense to have a general sum(x^y) aggregator, since we will almost always use x^2, and x*x will be orders of magnitude faster than using the generic power operator.

I also agree, that if only care about variance, it may be worthwhile to look into different numerical techniques to avoid overflow problems, given the scale that we intend to have Druid run at.

Even if we do go the route of having specialized aggregators, I would prefer if we exposed some form of function composition at the query API level, and then translate internally into special-purpose aggregations for the cases that we have optimized.

@nishantmonu51
Copy link
Copy Markdown
Member

+1 on having composing aggregators at the query level to do this and the implementation can can optimize things based on the compositions.

@himanshug
Copy link
Copy Markdown
Contributor Author

@drcrallen while I agree that a sketch approach specific to variance would give better results but complex types are a bit slower.
composable approach here is more general and allows computing any polynomial functions. e.g. this can be used to do things like simple linear/logistic regression given that model parameters are known upfront (say from training in a batch pipeline).
also , we can still implement variance specific complex aggregator at some point. However, for the sketches, going forward, I would put the sketching code in the datasketches library (so that same thing becomes available for pig/hive if users wish to pre-aggregate data in the batch pipeline) and implementing the druid aggregator inside datasketches-aggregators extension.

anyways, it seems like ppl are in general favor of supporting this functionality , so will do the necessary things and update this PR.

@himanshug himanshug removed the Discuss label Nov 15, 2015
@himanshug himanshug changed the title discuss - support variance and standard deviation add "exponent" field to long/double Sum aggs and "sqrt" to arithmetic post agg Nov 15, 2015
@himanshug himanshug force-pushed the more_aggs branch 2 times, most recently from 1ee3f02 to 3b261b2 Compare November 15, 2015 06:50
@drcrallen
Copy link
Copy Markdown
Contributor

I still think this should be a separate aggregator because it muddles user expectation.

We already have a count aggregator which returns different results at ingestion time vs query time, and that causes a lot of confusion.

A longSum aggregator in this PR cannot be used the same at ingestion time vs query time. For example, a longSum aggregator with a power 2 exponential specified at both ingestion and query time will result in a result of a power 4 result... ASSUMING no rollup.

If the exponential is applied at QUERY time, then for cases where the data was rolled up the results are very likely NOT going to be what the user is looking for. A simple scenario would be ingesting some double value at ingestion time (with doubleSum), and then issuing a doubleSum with an exponential in a query. In such a scenario the result would be some absurd polynomial of the original data that doesn't at all resemble what the user was actually looking for.

As such this functionality would be an advanced feature and, if possible, should be distinct from the default longSum/doubleSum at least in documentation, and (my personal preference) in nomenclature.

@himanshug
Copy link
Copy Markdown
Contributor Author

@drcrallen Even if we create a separate aggregator to handle this, user will have to provide different aggregator [configuration] at ingestion time vs query time because you can not detect query/ingestion context inside aggregator implementation. Please let me know if I am missing something here.
I still think it is OK to have new attribute exponent in same longSum and doubleSum aggregators to support this use case. It is totally backward compatible and we can document it separately if needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unrelated to this PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generated java files were failing to compile because of the constructor change. I decided to remove it instead of fixing as this is not used.

@xvrl
Copy link
Copy Markdown
Member

xvrl commented Nov 17, 2015

My two primary concerns with adding exponent to the existing aggregator factories are the following

  • it creates a precent for tacking on additional functionality to the existing aggregator primitives. Let's say I want to compute a geometric mean, I would probably do a sum of log, should we then add an ln flag to doubleSum?
  • it does not provide a clean path to support function composition at the API level, nor a clean implementation amenable to further optimizations.

How about instead of adding an exponent field on the aggregator-factory, we add a function field that allows passing some kind of function of the column value. For now we can support only exponent, and pass this exponent to a specialized aggregator, but at least it leaves the path open to support more things down the road.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs an updated cache key for all factories.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@himanshug
Copy link
Copy Markdown
Contributor Author

@xvrl function instead of exponent sounds good. do you mean adding support for predefined functions or do you mean accepting javascript ? I guess it is the former but just confirming.

@himanshug
Copy link
Copy Markdown
Contributor Author

@xvrl pls see 2ef5582 for generic "function" support instead of "exponenty" , does it look OK? (incomplete but demonstrates the idea)

@xvrl
Copy link
Copy Markdown
Member

xvrl commented Nov 25, 2015

@himanshug that seems reasonable, although we might want to separate out the type used for serde from the actual implementation of the function. This will make it easier to abstract the query api from the function implementation. What do you think? Anyone else have thoughts on this, maybe @cheddar ?

@himanshug himanshug changed the title add "exponent" field to long/double Sum aggs and "sqrt" to arithmetic post agg add "function" field to long/double Sum aggs and "sqrt" to arithmetic post agg Dec 2, 2015
@himanshug
Copy link
Copy Markdown
Contributor Author

@xvrl i confirmed with @cheddar and the "function" attribute OK with him.
However (may be in future), it would be nice to take a math expression as function so that user can do arbitrary functions without us having to implement them all.
I would like to know if there is a suitable math expression language implementation available already that we can use or else I can define and write a custom one? ( i understand that we can do it with javascript evaluator)

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jan 8, 2016

@himanshug can we rebase this from master again?

@himanshug
Copy link
Copy Markdown
Contributor Author

@fjy i'm waiting for #2090 to be merged first then get back to this one.

@drcrallen
Copy link
Copy Markdown
Contributor

@himanshug can you reconcile this and #2525 regarding functionality and how it relates to #2090 ?

@himanshug
Copy link
Copy Markdown
Contributor Author

@drcrallen I haven't gone through #2525 completely, but that is a dedicated aggregator for indexing and computing variance while the change proposed here is general feature to compute any polynomial function on the input. #2090 provides the expression language for computing those functions and as a result a pre-requisite.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Apr 12, 2016

@himanshug #2090 has been merged so I think this can be finished up

@himanshug
Copy link
Copy Markdown
Contributor Author

I think similar work is done in some other PRs, will take a look at those and will reopen if needed.

@himanshug himanshug closed this Jun 29, 2016
@himanshug himanshug deleted the more_aggs branch January 3, 2017 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants