Skip to content

First and Last Aggregator#3226

Closed
acslk wants to merge 7 commits intoapache:masterfrom
acslk:feature-firstlast
Closed

First and Last Aggregator#3226
acslk wants to merge 7 commits intoapache:masterfrom
acslk:feature-firstlast

Conversation

@acslk
Copy link
Copy Markdown
Contributor

@acslk acslk commented Jul 7, 2016

This PR implements the 'first' and 'last' aggregator discussed in #2845.

The first and last aggregator can be used in the following format

{ "type" : "doubleFirst", "name" : <output_name>, "fieldName" : <metric_name>}
{ "type" : "longFirst", "name" : <output_name>, "fieldName" : <metric_name>}
{ "type" : "doubleLast", "name" : <output_name>, "fieldName" : <metric_name>}
{ "type" : "longLast", "name" : <output_name>, "fieldName" : <metric_name>}

The first aggregator output the value of fieldName with the smallest timestamp (using the __time column), while last aggregator output the value of fieldName with the largest timestamp. In case of multiple first and last times, one of them will be selected arbitrarily.

@fjy fjy added this to the 0.9.2 milestone Jul 7, 2016
@fjy fjy added the Feature label Jul 7, 2016
@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jul 7, 2016

@acslk we need docs in docs/content/querying aggregations doc

you can probably just C&P the description in this PR for docs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need a @JsonCreator annotation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need it for this PR since I create the SerializablePair using object map in AggregatorFactory instead of using the object mapper, but it should be useful if this class is to be used in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good code cleanliness to have deserializers when we have serializers, so let's add both (and a test for both).

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jul 13, 2016

These should be documented in aggregations.md.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueType is probably clearer

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jul 13, 2016

I think this won't work at indexing time as-is; we would need a serde for writing out columns that have the (timestamp, last value) pairs in them.

@acslk
Copy link
Copy Markdown
Contributor Author

acslk commented Jul 27, 2016

@gianm Using first/last aggregator at ingestion time is kind of tricky with the value we want to store. Ideally, we want to persist the first and last metric as long/double column so other aggregator such as sum can aggregate them. However, doing so would cause merging persisted data to be incorrect since no time value are stored for the metric. If the values are instead stored as time value pair, the column could not be aggregated by standard aggregators. Basically the problem is that we want an intermediate storage format for merging and a different final storage format for querying, and this cannot be done in the ingestion process. Now I’ll just leave the comment in aggregations.md that first/last aggregators can't be used at ingestion time.

@acslk acslk force-pushed the feature-firstlast branch from f74ed83 to b5f7d9d Compare July 27, 2016 02:28
@acslk
Copy link
Copy Markdown
Contributor Author

acslk commented Jul 27, 2016

It seems that there are very little shared code between long and double type aggregator, so I changed the syntax to be the same with max, min, and sum with doubleFirst/Last and longFirst/Last in the newer commit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should really define these constants in another file somewhere. It is getting more and more difficult to track available values

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or some other better way to track cache unique key guarantees. Right now it is pretty much impossible for extensions to guarantee uniqueness of id across them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can store the cache keys in a helper class similar to how DimFilter cache keys are stored, but I think that can go in another PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing in a larger way is outside the scope of this PR

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Aug 3, 2016

👍

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it's not useable for ingestion time, but at least shouldn't we return float or long here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I misunderstood, but are you suggesting to have getTypeName return some type other than float or long?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, forget that. Now I understand the intention.

@acslk
Copy link
Copy Markdown
Contributor Author

acslk commented Aug 9, 2016

what should the default value for inner query finalize parameter? To keep it the same as before would make the default value false, but it feels more consistent with outer query to have it be true.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Aug 9, 2016

hmm, I agree finalize = true makes the most sense, but I think in this case compatibility concerns win. So let's make it false by default.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "DoubleLastAggregatorFactory{"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to Double, strange that I got both Long and Double wrong

@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Aug 10, 2016

some minor comments, looks good so far, will review again after query finalization comments from @gianm are resolved

@acslk acslk force-pushed the feature-firstlast branch from f15f28e to 11167f9 Compare August 13, 2016 00:05
@acslk
Copy link
Copy Markdown
Contributor Author

acslk commented Aug 13, 2016

Added option to finalize inner query, and also slightly changed how v1 build inner incrementalIndex. v1 strategy process inner query result by building IncrementalIndex on query result with aggregators from AggregatorFactory.getRequiredColumn(). When building IncrementalIndex, getCombiningFactory is called for the passed in aggregators. This makes sense for the merging runners that uses IncrementalIndex, but not so much for copying value from results. I parameterized whether or not to use combining factory so indexing inner query does not use combining factory.

@gauravkumar37
Copy link
Copy Markdown
Contributor

gauravkumar37 commented Aug 14, 2016

@acslk Thanks for putting in the effort to make this possible. I tried this pull request on the latest master code branch. Though it works for aggregators, I have 3 issues:

  • It does not work if the aggregator's output is fed to a having clause. Group by strategy used was v1. Error:
{
  "error": "Unknown exception",
  "errorMessage": "Unknown type[class io.druid.collections.SerializablePair]",
  "errorClass": "com.metamx.common.parsers.ParseException",
  "host": null
}
  • It may be obvious but the first/last aggregators can only work on the output of the merged/stored data in druid. That is, within a single index-time query granularity, this cannot work and will give random results. I believe this should be highlighted in the docs as well.
  • In a nested group by query (v1 strategy), if the aggregator is used in the inner group by query, it is not accessible to the outer group by query. The error thrown is Encountered parse error for aggregator[last_agg].

@acslk
Copy link
Copy Markdown
Contributor Author

acslk commented Aug 16, 2016

@gauravkumar37 Thanks for the feedback, here's my thoughts on the issues:

  • I haven't run the query using having clause before, but looking at the tests, it seems like aggregators with complex intermediate type such as HyperUnique doesn't work with having clause havingSpec equalTo/greaterThan/lessThan do not work on complex types #2507. I can't think of a nice solution to fix this now, so perhaps we can add similar workaround as Add comparator to HyperUniquesFinalizingPostAggregator. #2496 if this feature is needed.
  • Maybe I'm not understanding this properly, but is there a use case for this other than ingestion?
  • Did you try with "context" : {"finalize" : true} in inner group by query? It should fix the error, check the comments above for explanation. This is not very intuitive, and should probably go into the doc.

public AggregatorFactory apply(String input)
{
return new JavaScriptAggregatorFactory(input, fieldNames, fnAggregate, fnReset, fnCombine, config);
return new JavaScriptAggregatorFactory(input, Lists.newArrayList(input), fnCombine, fnReset, fnCombine, config);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change to JavaScriptAggregatorFactory needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously getCombiningFactory was always called on top of getRequiredColumn to get the identity AggregatorFactory needed for copying the javascript aggregator values. Since getCombiningFactory is no longer called on this, the original getRequiredColumn does really make sense for copying values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from javascript, implementation of getRequiredColumn for other aggregatorFactories seems to work fine without converting to combiningFactory

@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Aug 23, 2016

LGTM, 👍

@drcrallen
Copy link
Copy Markdown
Contributor

@acslk if you can fix the conflicts I can help finish up review.

@gianm gianm modified the milestones: 0.9.3, 0.9.2 Sep 20, 2016
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Sep 20, 2016

since this is one of the last few PRs for 0.9.2, as discussed on the call last week, I'll bump it to 0.9.3.

@acslk
Copy link
Copy Markdown
Contributor Author

acslk commented Sep 22, 2016

rebased and resolved conflict

@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Oct 13, 2016

Closing this since @acslk is no longer active on druid development, rebased and re-opening at:

#3566

@jon-wei jon-wei removed this from the 0.9.3 milestone Nov 4, 2016
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Sep 21, 2017

Superseded by #3566.

@gianm gianm closed this Sep 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants