Contributing Moving-Average Query to open source. by yurmix · Pull Request #6430 · apache/druid

yurmix · 2018-10-08T21:27:09Z

Implements #6320

Includes documentation and comments but I'd be glad to add more if needed.

yurmix · 2018-10-09T19:34:01Z

@nishantmonu51 Thank you!
Please look at the documentation .md for user perspective
Also please look at the high level comments in MovingAverageQueryRunner (Main one at class level and more inlined) to walk through the code.

yurmix

The build fails due to useDefaultValueForNull=false (A flag introduced in 0.13.0).
I will update the tests accordingly.

…nfiguration parameter.

yurmix · 2018-10-15T22:04:55Z

@nishantmonu51 the build passes successfully now. The extension doesn't support useDefaultValueForNull=false yet. In the meantime I had to explicitly turn off the flag in unit tests. (See issue #6472).
Please let me know if you would like to review the feature.
Thanks!

yurmix · 2018-12-03T23:11:31Z

@niketh would you like to review?

peferron · 2019-01-25T19:49:41Z

Very useful feature. Right now, moving averages are doable in SQL via nested SELECTs, but native queries don't support it. This extension would be great to bring native queries up to par. Hope it gets merged soon!

* Remove NullDimensionSelector. * Apply changes of RequestLogger. * Apply changes of TimelineServerView.

jihoonson

@yurmix thanks for the nice PR! I'm still looking at, but have a question for the general design.

Why do we need a new query type for moving average? Rather, can we generalize Averager and PostAverager to support any types of queries like PostAggregator? This question is related to MovingAverageQueryRunner. It internally generates a query and computes moving average on top of it. I don't think this is necessary, but we can generalize this logic similar to PostAggregator.

jihoonson

I'm still reviewing, but left some more comments.

jihoonson · 2019-02-22T22:08:53Z

+  {
+
+    private final Collection<DimensionSpec> dims;
+    private final Map<Map<String, Object>, Collection<Averager<?>>> averagers = new HashMap<>();


Please add a brief description of what are keys and values.

jihoonson · 2019-02-22T23:24:34Z

+public class LongMaxAverager extends BaseAverager<Number, Long>
+{
+
+  private int startFrom = 0;


What does this mean?

Added a comment (see commit e2a5317).

jihoonson · 2019-02-22T23:41:22Z

+    private Set<Map<String, Object>> seenKeys = new HashSet<>();
+    private Row saveNext;
+    private Map<String, AggregatorFactory> aggMap;
+    private Map<String, Object> fakeEvents;


What is fakeEvents?

Added a comment (see commit 4b425b2).

# Conflicts: # pom.xml

yurmix · 2019-02-23T09:29:13Z

@yurmix thanks for the nice PR! I'm still looking at, but have a question for the general design.

Why do we need a new query type for moving average? Rather, can we generalize Averager and PostAverager to support any types of queries like PostAggregator? This question is related to MovingAverageQueryRunner. It internally generates a query and computes moving average on top of it. I don't think this is necessary, but we can generalize this logic similar to PostAggregator.

I completely agree the work could be incorporated within the general Query. The reason it was created as a separate query type is for the sake of easier implementation (due to the separation of concerns). Calling the underlying query through an internal API is easier than making those changes directly to Queries or another class. Perhaps refactoring it into the core query framework could be a future work?

yurmix · 2019-02-23T09:31:53Z

@jihoonson BTW, I have a couple of my own design concerns and I welcome your thoughts on the matter:

The complexity of MovingAverageIterable. But I would be careful in changing that for the sake of refactoring.
Averager implements their own accumulation formulas (computeResults) which are separate from the formulas of Aggregator. I think this could be refactored in order to easily add more types of Averagers and perhaps in order to allign null-handling with the new standards.

jihoonson · 2019-02-25T19:26:35Z

I completely agree the work could be incorporated within the general Query. The reason it was created as a separate query type is for the sake of easier implementation (due to the separation of concerns). Calling the underlying query through an internal API is easier than making those changes directly to Queries or another class. Perhaps refactoring it into the core query framework could be a future work?

@yurmix I see. It makes sense to me.

The complexity of MovingAverageIterable. But I would be careful in changing that for the sake of refactoring.

It looks complex, but I think it would be easier to understand if you can add more comments. But, simpler implementation would be great if possible. Would you tell me a bit of details of what kind of refactoring you think?

Averager implements their own accumulation formulas (computeResults) which are separate from the formulas of Aggregator. I think this could be refactored in order to easily add more types of Averagers and perhaps in order to allign null-handling with the new standards.

It sounds good, but I'm wondering we can use our Aggregator for moving average instead of adding a new type of Averager. It would be best if we can because we don't have to implement almost same aggregation logic again (ex, DoubleMaxAverager and DoubleMaxAggregator). Probably possible if we can integrate Accumulator with Aggregator. Or, maybe it's even easier if we can use Grouper (like StreamingMergeSortedGrouper, but simpler non-thread-safe one).

…ollowing once DI conflicts with datasketches are resolved.

* Remove unused variables/prarameters.

…d hasNext().

yurmix · 2019-03-19T20:56:51Z

@jihoonson, thanks so much for your effort on this thorough review and sorry it took me that long to complete my response. I have addressed all comments, feel free to review and raise other concerns.

drcrallen · 2019-03-25T20:44:44Z

Ping to keep open

jihoonson · 2019-03-25T20:50:07Z

@drcrallen thanks for the ping.

@yurmix I'll take another look once 0.14.0 release is finalized.

jihoonson

@yurmix thank you for the fix. Left some more comments.

jihoonson · 2019-04-10T19:11:49Z

+  private final List<AveragerFactory<?, ?>> factories;
+  private final Map<String, PostAggregator> postAggMap;
+  private final Map<String, AggregatorFactory> aggMap;
+  private final Map<String, Object> fakeEvents;


Maybe emptyEvents?

jihoonson · 2019-04-10T19:12:41Z

+     *
+     * <p>Usually, the contents of key will be contained by the row R being passed in, but in the case of a
+     * dummy row, its possible that the dimensions will be known but the row empty. Hence, the values are
+     * passed as two separate arguments.


I'm still not sure why it should accept key and r separately instead of accepting MapBasedRow. Would you elaborate more?

The row's key (only dimensions, no metrics) is required but is not provided by Row's interface.
We use MovingAverageHelper.getDimKeyFromRow(dims, r) to extract the key.

I was able to remove the redundant parameter by calling getDimKeyFromRow inside computeMovingAverage() as well.

jihoonson · 2019-04-10T19:18:02Z

+  final I[] buckets;
+  private int index;
+
+  /* startFrom is needed because `buckets` field is a fixed array, not a list.


Hmm, I think we don't use multi line comments widely. How about changing it to javadoc? I think it's more clear anyway.

jihoonson · 2019-04-10T19:26:41Z

@@ -0,0 +1,2 @@
+druid.processing.buffer.sizeBytes=655360


Would you please let me know what tests failed without this file? I think you should be able to set these properties in each test.

I added the property (druid.processing.buffer.sizeBytes=655360) to MovingAverageQueryTest and removed runtime.properties.

jihoonson · 2019-04-10T20:08:59Z

+        // standard case. return regular row
+        yielder = yielder.next(currentBucket);
+        expectedBucket = expectedBucket.plus(period);
+        return currentBucket;


Hmm, let me add some more details. yielder is updated in this if clause which should be used to iterate all values in it. However, in hasNext(), it only checks expectedBucket is less than endTime. Since expectedBucket is also updated in this if clause, hasNext() can return false even though yielder is not used yet.

…g-average-query

jihoonson · 2019-04-24T23:31:08Z

@yurmix sorry, I've just checked your last update. Most of my last comments were addressed except these two: #6430 (comment), #6430 (comment). Would you please check them?

yurmix · 2019-04-25T19:45:33Z

@yurmix sorry, I've just checked your last update. Most of my last comments were addressed except these two [...]

Thanks for reminding me, I have addressed one of them and currently reviewing the other.

…g-average-query

yurmix · 2019-04-26T17:42:28Z

Hmm, let me add some more details. yielder is updated in this if clause which should be used to iterate all values in it. However, in hasNext(), it only checks expectedBucket is less than endTime. Since expectedBucket is also updated in this if clause, hasNext() can return false even though yielder is not used yet.

The reason we need expectedBucket is that RowBucketIterable needs to return empty row buckets for periods with no rows. That's why it traverses over intervals in addition to the rows seq.
I added a check for yielder as well to hasNext(), in addition to expectedBucket.

I think there could be a more elegant way for traversing over two levels (intervals/periods and rows) when one does not directly contain the other, but I won't be able to refactor it for this release.

jihoonson

I think there could be a more elegant way for traversing over two levels (intervals/periods and rows) when one does not directly contain the other, but I won't be able to refactor it for this release.

It sounds good to me. I don't think this refactoring is strictly required for this PR.

The latest change looks good to me. The CI failure looks a flaky Travis timeout and I just restarted it. +1 after CI.

@yurmix thank you for all your hard work!

yurmix · 2019-04-27T02:55:29Z

@jihoonson thanks for your dedication and for the insightful review!

yurmix added 2 commits October 8, 2018 14:22

Contributing Moving-Average Query to open source.

8abe7a5

Fix failing code inspections.

ef7e0e6

nishantmonu51 self-assigned this Oct 9, 2018

See if explicit types will invoke the correct comparison function.

136eb93

yurmix commented Oct 9, 2018

View reviewed changes

fjy added this to the 0.13.1 milestone Oct 10, 2018

Explicitly remove support for druid.generic.useDefaultValueForNull co…

1d98847

…nfiguration parameter.

yurmix mentioned this pull request Oct 15, 2018

Add support for SQL-compatible null handling to movingAverage query #6472

Open

drcrallen reviewed Jan 30, 2019

View reviewed changes

Comment thread extensions-contrib/moving-average-query/pom.xml Outdated

drcrallen reviewed Jan 30, 2019

View reviewed changes

Comment thread .../main/java/org/apache/druid/query/movingaverage/DefaultMovingAverageQueryMetricsFactory.java

yurmix and others added 7 commits January 29, 2019 16:50

Merge branch 'master' into moving-average-query

63cf6ee

Merge branch 'master' into moving-average-query

7bbfc6d

Update styling and headers for complience.

7b71a26

Refresh code with latest master changes:

6635751

* Remove NullDimensionSelector. * Apply changes of RequestLogger. * Apply changes of TimelineServerView.

Small checkstyle fix.

b731780

Checkstyle fixes.

95b803a

Fixing rat errors; Teamcity errors.

a21a3ce

jon-wei removed this from the 0.14.0 milestone Feb 5, 2019

jihoonson added the Feature label Feb 21, 2019

jihoonson reviewed Feb 22, 2019

View reviewed changes

Merge branch 'master' into moving-average-query

66daabf

# Conflicts: # pom.xml

Removing support theta sketches. Will be added back in this pr or a f…

9591a9d

…ollowing once DI conflicts with datasketches are resolved.

yurmix added 4 commits March 18, 2019 17:47

* internalNext() should return null instead of throwing exception.

6708720

* Remove unused variables/prarameters.

Harden MovingAverageIterableTest (Switch anyOf to exact match).

fa3fbbc

Change internalNext() from recursion to iteration; Simplify next() an…

304c43d

…d hasNext().

Remove unused imports.

81d0909

jihoonson reviewed Apr 10, 2019

View reviewed changes

jon-wei mentioned this pull request Apr 10, 2019

druid-orc-extensions hadoop-common dependency is broken #7438

Closed

yurmix and others added 5 commits April 10, 2019 15:51

Merge branch 'master' into moving-average-query

716665e

Address review comments.

1c577ae

Rename fakeEvents to emptyEvents.

ab1ae00

Merge branch 'master' into moving-average-query

295916d

Merge branch 'master' of github.com:apache/incubator-druid into movin…

f26c2f6

…g-average-query

Remove redundant parameter key from computeMovingAverage.

7b6f56e

yurmix added 4 commits April 25, 2019 13:21

Merge branch 'master' of github.com:apache/incubator-druid into movin…

167efa2

…g-average-query

Check yielder as well in RowBucketIterable#hasNext()

8fccf19

Fix javadoc.

001b061

Merge branch 'master' of github.com:apache/incubator-druid into movin…

ac43cc0

…g-average-query

jihoonson approved these changes Apr 26, 2019

View reviewed changes

jihoonson merged commit f02251a into apache:master Apr 27, 2019

yurmix mentioned this pull request Apr 27, 2019

Moving Average query type #6320

Closed

6 tasks

yurmix deleted the moving-average-query branch May 10, 2019 19:47

jihoonson added the Release Notes label Jun 6, 2019

jihoonson mentioned this pull request Jun 8, 2019

0.15.0-incubating release notes #7854

Closed

bjornm82 mentioned this pull request Jun 30, 2019

Druid moving average query results in circular reference error #7999

Closed

Conversation

yurmix commented Oct 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yurmix commented Oct 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yurmix left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yurmix commented Oct 15, 2018

Uh oh!

yurmix commented Dec 3, 2018

Uh oh!

peferron commented Jan 25, 2019

Uh oh!

Uh oh!

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yurmix Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yurmix Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yurmix Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yurmix commented Feb 23, 2019

Uh oh!

yurmix commented Feb 23, 2019

Uh oh!

jihoonson commented Feb 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yurmix commented Mar 19, 2019

Uh oh!

drcrallen commented Mar 25, 2019

Uh oh!

jihoonson commented Mar 25, 2019

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yurmix commented Oct 8, 2018 •

edited

Loading

yurmix commented Oct 9, 2018 •

edited

Loading

yurmix left a comment •

edited

Loading

yurmix Mar 18, 2019 •

edited

Loading

yurmix Mar 18, 2019 •

edited

Loading

yurmix Mar 18, 2019 •

edited

Loading

jihoonson commented Feb 25, 2019 •

edited

Loading