Contributing Moving-Average Query to open source.#6430
Contributing Moving-Average Query to open source.#6430jihoonson merged 55 commits intoapache:masterfrom yurmix:moving-average-query
Conversation
|
@nishantmonu51 Thank you! |
…nfiguration parameter.
|
@nishantmonu51 the build passes successfully now. The extension doesn't support useDefaultValueForNull=false yet. In the meantime I had to explicitly turn off the flag in unit tests. (See issue #6472). |
|
@niketh would you like to review? |
|
Very useful feature. Right now, moving averages are doable in SQL via nested |
* Remove NullDimensionSelector. * Apply changes of RequestLogger. * Apply changes of TimelineServerView.
jihoonson
left a comment
There was a problem hiding this comment.
@yurmix thanks for the nice PR! I'm still looking at, but have a question for the general design.
Why do we need a new query type for moving average? Rather, can we generalize Averager and PostAverager to support any types of queries like PostAggregator? This question is related to MovingAverageQueryRunner. It internally generates a query and computes moving average on top of it. I don't think this is necessary, but we can generalize this logic similar to PostAggregator.
jihoonson
left a comment
There was a problem hiding this comment.
I'm still reviewing, but left some more comments.
| { | ||
|
|
||
| private final Collection<DimensionSpec> dims; | ||
| private final Map<Map<String, Object>, Collection<Averager<?>>> averagers = new HashMap<>(); |
There was a problem hiding this comment.
Please add a brief description of what are keys and values.
| public class LongMaxAverager extends BaseAverager<Number, Long> | ||
| { | ||
|
|
||
| private int startFrom = 0; |
There was a problem hiding this comment.
Added a comment (see commit e2a5317).
| private Set<Map<String, Object>> seenKeys = new HashSet<>(); | ||
| private Row saveNext; | ||
| private Map<String, AggregatorFactory> aggMap; | ||
| private Map<String, Object> fakeEvents; |
There was a problem hiding this comment.
Added a comment (see commit 4b425b2).
# Conflicts: # pom.xml
I completely agree the work could be incorporated within the general Query. The reason it was created as a separate query type is for the sake of easier implementation (due to the separation of concerns). Calling the underlying query through an internal API is easier than making those changes directly to |
|
@jihoonson BTW, I have a couple of my own design concerns and I welcome your thoughts on the matter:
|
@yurmix I see. It makes sense to me.
It looks complex, but I think it would be easier to understand if you can add more comments. But, simpler implementation would be great if possible. Would you tell me a bit of details of what kind of refactoring you think?
It sounds good, but I'm wondering we can use our |
…ollowing once DI conflicts with datasketches are resolved.
* Remove unused variables/prarameters.
|
@jihoonson, thanks so much for your effort on this thorough review and sorry it took me that long to complete my response. I have addressed all comments, feel free to review and raise other concerns. |
|
Ping to keep open |
|
@drcrallen thanks for the ping. @yurmix I'll take another look once 0.14.0 release is finalized. |
| private final List<AveragerFactory<?, ?>> factories; | ||
| private final Map<String, PostAggregator> postAggMap; | ||
| private final Map<String, AggregatorFactory> aggMap; | ||
| private final Map<String, Object> fakeEvents; |
| * | ||
| * <p>Usually, the contents of key will be contained by the row R being passed in, but in the case of a | ||
| * dummy row, its possible that the dimensions will be known but the row empty. Hence, the values are | ||
| * passed as two separate arguments. |
There was a problem hiding this comment.
I'm still not sure why it should accept key and r separately instead of accepting MapBasedRow. Would you elaborate more?
There was a problem hiding this comment.
The row's key (only dimensions, no metrics) is required but is not provided by Row's interface.
We use MovingAverageHelper.getDimKeyFromRow(dims, r) to extract the key.
I was able to remove the redundant parameter by calling getDimKeyFromRow inside computeMovingAverage() as well.
| final I[] buckets; | ||
| private int index; | ||
|
|
||
| /* startFrom is needed because `buckets` field is a fixed array, not a list. |
There was a problem hiding this comment.
Hmm, I think we don't use multi line comments widely. How about changing it to javadoc? I think it's more clear anyway.
| @@ -0,0 +1,2 @@ | |||
| druid.processing.buffer.sizeBytes=655360 | |||
There was a problem hiding this comment.
Would you please let me know what tests failed without this file? I think you should be able to set these properties in each test.
There was a problem hiding this comment.
I added the property (druid.processing.buffer.sizeBytes=655360) to MovingAverageQueryTest and removed runtime.properties.
| // standard case. return regular row | ||
| yielder = yielder.next(currentBucket); | ||
| expectedBucket = expectedBucket.plus(period); | ||
| return currentBucket; |
There was a problem hiding this comment.
Hmm, let me add some more details. yielder is updated in this if clause which should be used to iterate all values in it. However, in hasNext(), it only checks expectedBucket is less than endTime. Since expectedBucket is also updated in this if clause, hasNext() can return false even though yielder is not used yet.
|
@yurmix sorry, I've just checked your last update. Most of my last comments were addressed except these two: #6430 (comment), #6430 (comment). Would you please check them? |
Thanks for reminding me, I have addressed one of them and currently reviewing the other. |
The reason we need I think there could be a more elegant way for traversing over two levels (intervals/periods and rows) when one does not directly contain the other, but I won't be able to refactor it for this release. |
jihoonson
left a comment
There was a problem hiding this comment.
I think there could be a more elegant way for traversing over two levels (intervals/periods and rows) when one does not directly contain the other, but I won't be able to refactor it for this release.
It sounds good to me. I don't think this refactoring is strictly required for this PR.
The latest change looks good to me. The CI failure looks a flaky Travis timeout and I just restarted it. +1 after CI.
@yurmix thank you for all your hard work!
|
@jihoonson thanks for your dedication and for the insightful review! |
Implements #6320
Includes documentation and comments but I'd be glad to add more if needed.