First and Last Aggregator by acslk · Pull Request #3226 · apache/druid

acslk · 2016-07-07T01:16:46Z

This PR implements the 'first' and 'last' aggregator discussed in #2845.

The first and last aggregator can be used in the following format

{ "type" : "doubleFirst", "name" : <output_name>, "fieldName" : <metric_name>}
{ "type" : "longFirst", "name" : <output_name>, "fieldName" : <metric_name>}
{ "type" : "doubleLast", "name" : <output_name>, "fieldName" : <metric_name>}
{ "type" : "longLast", "name" : <output_name>, "fieldName" : <metric_name>}

The first aggregator output the value of fieldName with the smallest timestamp (using the __time column), while last aggregator output the value of fieldName with the largest timestamp. In case of multiple first and last times, one of them will be selected arbitrarily.

fjy · 2016-07-07T01:21:41Z

@acslk we need docs in docs/content/querying aggregations doc

you can probably just C&P the description in this PR for docs

fjy · 2016-07-07T01:25:12Z

do you need a @JsonCreator annotation?

I don't need it for this PR since I create the SerializablePair using object map in AggregatorFactory instead of using the object mapper, but it should be useful if this class is to be used in the future.

It's good code cleanliness to have deserializers when we have serializers, so let's add both (and a test for both).

gianm · 2016-07-13T20:11:32Z

These should be documented in aggregations.md.

gianm · 2016-07-13T20:11:39Z

valueType is probably clearer

gianm · 2016-07-13T21:16:10Z

I think this won't work at indexing time as-is; we would need a serde for writing out columns that have the (timestamp, last value) pairs in them.

acslk · 2016-07-27T02:26:59Z

@gianm Using first/last aggregator at ingestion time is kind of tricky with the value we want to store. Ideally, we want to persist the first and last metric as long/double column so other aggregator such as sum can aggregate them. However, doing so would cause merging persisted data to be incorrect since no time value are stored for the metric. If the values are instead stored as time value pair, the column could not be aggregated by standard aggregators. Basically the problem is that we want an intermediate storage format for merging and a different final storage format for querying, and this cannot be done in the ingestion process. Now I’ll just leave the comment in aggregations.md that first/last aggregators can't be used at ingestion time.

acslk · 2016-07-27T02:33:21Z

It seems that there are very little shared code between long and double type aggregator, so I changed the syntax to be the same with max, min, and sum with doubleFirst/Last and longFirst/Last in the newer commit.

fjy · 2016-08-02T21:15:08Z

we should really define these constants in another file somewhere. It is getting more and more difficult to track available values

Or some other better way to track cache unique key guarantees. Right now it is pretty much impossible for extensions to guarantee uniqueness of id across them.

we can store the cache keys in a helper class similar to how DimFilter cache keys are stored, but I think that can go in another PR

fixing in a larger way is outside the scope of this PR

fjy · 2016-08-03T00:07:00Z

👍

navis · 2016-08-03T00:43:10Z

I understand it's not useable for ingestion time, but at least shouldn't we return float or long here?

Sorry if I misunderstood, but are you suggesting to have getTypeName return some type other than float or long?

Ah, forget that. Now I understand the intention.

acslk · 2016-08-09T22:13:18Z

what should the default value for inner query finalize parameter? To keep it the same as before would make the default value false, but it feels more consistent with outer query to have it be true.

gianm · 2016-08-09T22:13:54Z

hmm, I agree finalize = true makes the most sense, but I think in this case compatibility concerns win. So let's make it false by default.

jon-wei · 2016-08-09T23:12:36Z

should be "DoubleLastAggregatorFactory{"

changed it to Double, strange that I got both Long and Double wrong

jon-wei · 2016-08-10T01:01:17Z

some minor comments, looks good so far, will review again after query finalization comments from @gianm are resolved

acslk · 2016-08-13T00:19:44Z

Added option to finalize inner query, and also slightly changed how v1 build inner incrementalIndex. v1 strategy process inner query result by building IncrementalIndex on query result with aggregators from AggregatorFactory.getRequiredColumn(). When building IncrementalIndex, getCombiningFactory is called for the passed in aggregators. This makes sense for the merging runners that uses IncrementalIndex, but not so much for copying value from results. I parameterized whether or not to use combining factory so indexing inner query does not use combining factory.

gauravkumar37 · 2016-08-14T12:21:50Z

@acslk Thanks for putting in the effort to make this possible. I tried this pull request on the latest master code branch. Though it works for aggregators, I have 3 issues:

It does not work if the aggregator's output is fed to a having clause. Group by strategy used was v1. Error:

{
  "error": "Unknown exception",
  "errorMessage": "Unknown type[class io.druid.collections.SerializablePair]",
  "errorClass": "com.metamx.common.parsers.ParseException",
  "host": null
}

It may be obvious but the first/last aggregators can only work on the output of the merged/stored data in druid. That is, within a single index-time query granularity, this cannot work and will give random results. I believe this should be highlighted in the docs as well.
In a nested group by query (v1 strategy), if the aggregator is used in the inner group by query, it is not accessible to the outer group by query. The error thrown is Encountered parse error for aggregator[last_agg].

acslk · 2016-08-16T20:05:23Z

@gauravkumar37 Thanks for the feedback, here's my thoughts on the issues:

I haven't run the query using having clause before, but looking at the tests, it seems like aggregators with complex intermediate type such as HyperUnique doesn't work with having clause havingSpec equalTo/greaterThan/lessThan do not work on complex types #2507. I can't think of a nice solution to fix this now, so perhaps we can add similar workaround as Add comparator to HyperUniquesFinalizingPostAggregator. #2496 if this feature is needed.
Maybe I'm not understanding this properly, but is there a use case for this other than ingestion?
Did you try with "context" : {"finalize" : true} in inner group by query? It should fix the error, check the comments above for explanation. This is not very intuitive, and should probably go into the doc.

jon-wei · 2016-08-20T00:01:22Z

              public AggregatorFactory apply(String input)
              {
-                return new JavaScriptAggregatorFactory(input, fieldNames, fnAggregate, fnReset, fnCombine, config);
+                return new JavaScriptAggregatorFactory(input, Lists.newArrayList(input), fnCombine, fnReset, fnCombine, config);


Why is this change to JavaScriptAggregatorFactory needed?

Previously getCombiningFactory was always called on top of getRequiredColumn to get the identity AggregatorFactory needed for copying the javascript aggregator values. Since getCombiningFactory is no longer called on this, the original getRequiredColumn does really make sense for copying values.

Apart from javascript, implementation of getRequiredColumn for other aggregatorFactories seems to work fine without converting to combiningFactory

jon-wei · 2016-08-23T21:22:58Z

LGTM, 👍

drcrallen · 2016-09-16T02:17:15Z

@acslk if you can fix the conflicts I can help finish up review.

gianm · 2016-09-20T15:19:54Z

since this is one of the last few PRs for 0.9.2, as discussed on the call last week, I'll bump it to 0.9.3.

acslk · 2016-09-22T18:24:43Z

rebased and resolved conflict

jon-wei · 2016-10-13T23:03:38Z

Closing this since @acslk is no longer active on druid development, rebased and re-opening at:

#3566

gianm · 2017-09-21T08:01:00Z

Superseded by #3566.

fjy added this to the 0.9.2 milestone Jul 7, 2016

fjy added the Feature label Jul 7, 2016

fjy reviewed Jul 7, 2016
View reviewed changes

gianm reviewed Jul 13, 2016
View reviewed changes

navis mentioned this pull request Jul 20, 2016

Support variance and standard deviation #2525

Merged

acslk force-pushed the feature-firstlast branch from f74ed83 to b5f7d9d Compare July 27, 2016 02:28

fjy reviewed Aug 2, 2016
View reviewed changes

navis reviewed Aug 3, 2016
View reviewed changes

jon-wei reviewed Aug 9, 2016
View reviewed changes

acslk force-pushed the feature-firstlast branch from f15f28e to 11167f9 Compare August 13, 2016 00:05

jon-wei reviewed Aug 20, 2016
View reviewed changes

gianm modified the milestones: 0.9.3, 0.9.2 Sep 20, 2016

acslk added 7 commits September 22, 2016 13:58

add first and last aggregator

af180a4

add test and fix

95a3c54

moving around

adaeee1

separate aggregator valueType

2836dc8

address PR comment

e593d23

add finalize inner query and adjust v1 inner indexing

04c1ab1

better test and fixes

7d24658

acslk force-pushed the feature-firstlast branch from 5bfcb5b to 7d24658 Compare September 22, 2016 18:24

jon-wei mentioned this pull request Oct 13, 2016

First and Last Aggregator #3566

Merged

jon-wei removed this from the 0.9.3 milestone Nov 4, 2016

gianm closed this Sep 21, 2017

Conversation

acslk commented Jul 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjy commented Jul 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Jul 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Jul 13, 2016

Uh oh!

acslk commented Jul 27, 2016

Uh oh!

acslk commented Jul 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjy commented Aug 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acslk commented Aug 9, 2016

Uh oh!

gianm commented Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei commented Aug 10, 2016

Uh oh!

acslk commented Aug 13, 2016

Uh oh!

gauravkumar37 commented Aug 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acslk commented Aug 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei commented Aug 23, 2016

Uh oh!

drcrallen commented Sep 16, 2016

Uh oh!

gianm commented Sep 20, 2016

Uh oh!

acslk commented Sep 22, 2016

Uh oh!

jon-wei commented Oct 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Sep 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

acslk commented Jul 7, 2016 •

edited

Loading

fjy commented Jul 7, 2016 •

edited

Loading

gianm commented Aug 9, 2016 •

edited

Loading

gauravkumar37 commented Aug 14, 2016 •

edited

Loading

jon-wei commented Oct 13, 2016 •

edited

Loading