numeric quantiles sketch aggregator#5002
Conversation
|
Is anything wrong with the build configuration? |
@leventov do you know why this one is failing? |
|
@AlexanderSaydakov could you please try to update |
|
no, plexus-compiler-javac-errorprone-2.8.2 doesn't help |
|
I see that error_prone_core is a few versions behind, but all versions newer than currently used 2.0.19 are not happy about something in druid/java-util |
|
@AlexanderSaydakov currently teamcity appears to be failing due to foreach related rules , see you can find above in the "Code Inspection" tab |
|
I have a couple of questions:
|
|
| return sketch; | ||
| } | ||
| catch (NumberFormatException e) { | ||
| // Log.info("Expected Double. Got string with value " + |
There was a problem hiding this comment.
Consider adding Log.debug saying expected was Double but found base64Encoded string.
There was a problem hiding this comment.
why debug? should we use log.error? this should never happen, but it is not fatal either
There was a problem hiding this comment.
after a discussion with @akashdw we believe that it is better to throw an exception here
| @Override | ||
| public void deserializeColumn(final ByteBuffer buffer, final ColumnBuilder builder) | ||
| { | ||
| final GenericIndexed<DoublesSketch> column = GenericIndexed.read(buffer, strategy); |
There was a problem hiding this comment.
use GenericIndexed.read(buffer, strategy, builder.getFileMapper()) similar to thetaSketch to enable largeColumns. Also Override getSerializer method as
@Override
public GenericColumnSerializer getSerializer(IOPeon peon, String column)
{
return LargeColumnSupportedComplexColumnSerializer.create(peon, column, this.getObjectStrategy());
}
| import io.druid.query.aggregation.Aggregator; | ||
| import io.druid.segment.ColumnValueSelector; | ||
|
|
||
| public class DoublesSketchDoubleAggregator implements Aggregator |
There was a problem hiding this comment.
Consider renaming this class to DoublesQuantileSketchAggregator
There was a problem hiding this comment.
I am not sure about adding Quantiles into the class names. This is implied by the package name, and class names are long already. Regarding the second "Double", I believe the intention was to say that this aggregator works on double values as input (as opposed to sketches). I propose to rename it to BuildAggregator, and rename the other one to MergeAggregator (or leave Combining, but switch the words around for consistency)
| import it.unimi.dsi.fastutil.ints.Int2ObjectMap; | ||
| import it.unimi.dsi.fastutil.ints.Int2ObjectOpenHashMap; | ||
|
|
||
| public class DoublesSketchDoubleBufferAggregator implements BufferAggregator |
There was a problem hiding this comment.
Consider renaming this class to DoublesQuantileSketchBufferAggregator
| private final IdentityHashMap<ByteBuffer, WritableMemory> memCache = new IdentityHashMap<>(); | ||
| private final IdentityHashMap<ByteBuffer, Int2ObjectMap<UpdateDoublesSketch>> sketches = new IdentityHashMap<>(); | ||
|
|
||
| public DoublesSketchDoubleBufferAggregator(final ColumnValueSelector<Double> valueSelector, final int size, |
There was a problem hiding this comment.
please add a comment explaining sketch can grow on-heap also, also explain what happens in relocation when sketch grows on-heap.
| import io.druid.initialization.DruidModule; | ||
| import io.druid.segment.serde.ComplexMetrics; | ||
|
|
||
| public class DoublesSketchModule implements DruidModule |
There was a problem hiding this comment.
Consider renaming it to DoublesQuantileSketchModule or QuantilesDoublesSketch
|
I believe that I addressed all reviewers' suggestions so far. How can we proceed? Thank you. |
jihoonson
left a comment
There was a problem hiding this comment.
@AlexanderSaydakov nice work! I left some comments. Please consider them.
| private final String name; | ||
| private final String fieldName; | ||
| private final int k; | ||
| private final byte cacheTypeId; |
There was a problem hiding this comment.
Looks this is always AggregatorUtil.QUANTILES_DOUBLES_SKETCH_BUILD_CACHE_TYPE_ID and don't have to be a member variable.
There was a problem hiding this comment.
no, DoublesSketchMergeAggregatorFactory overrides it
| if (metricFactory.getColumnCapabilities(fieldName) != null | ||
| && ValueType.isNumeric(metricFactory.getColumnCapabilities(fieldName).getType())) { | ||
| final ColumnValueSelector<Double> valueSelector = metricFactory.makeColumnValueSelector(fieldName); | ||
| if (valueSelector == null) { |
There was a problem hiding this comment.
I'm curious when this can be null. I couldn't find it.
There was a problem hiding this comment.
I it seems to me that this can happen if a non-existent field is mentioned as the input. I am not convinced myself that we really need this special no-op aggregator. This is how it was done in the Theta sketch aggregator. I think that it is not worth optimizing an erroneous query.
There was a problem hiding this comment.
In Druid, NilColumnValueSelector is returned for non-existent input fields. I think it's worthwhile to optimize, but we already have NoopAggregator and NoopBufferAggregator, and you can use them instead of adding new ones.
There was a problem hiding this comment.
I believe I tested this case and got a null for selector. Regarding the dummy aggregators, I believe the idea was to always return a sketch, even an empty one. The custom dummy aggregators do just that.
There was a problem hiding this comment.
@jihoonson sketches post aggs expect a sketch object, not sure how a sketches post agg will behave if we use NoopAggregator and NoopBufferAggregator. This check was added considering you can have a sketch field in some segments(say I added a new sketch column from today onwards) but past data does not have that field.
There was a problem hiding this comment.
If it's nullable, every other aggregatorFactory should consider it too, but they don't. As I said, if a segment doesn't have a specified column, NilColumnValueSelector is returned instead of null.
There was a problem hiding this comment.
Yes, we should check for NilColumnValueSelector but will continue to return DoublesSketchNoOpAggregator as quantile postAggregators expect sketch values.
There was a problem hiding this comment.
You mean returning DoublesSketchNoOpAggregator for NilColumnValueSelector? It makes sense to me.
| if (selector == null) { | ||
| return new DoublesSketchNoOpAggregator(); | ||
| } | ||
| return new DoublesSketchMergeAggregator(selector, k); |
There was a problem hiding this comment.
I wonder why DoublesSketchAggregatorFactory is able to return the DoublesSketchMergeAggregator. I guess DoublesSketchMergeAggregatorFactory is to get mergeAggregator and DoublesSketchAggregatorFactory is for a plain aggregator.
There was a problem hiding this comment.
This is a common entry point associated with the type name. If the input field is a numeric field, then so-called "build" aggregator is used to build sketches. Otherwise, we assume that the input contains sketches to merge.
There was a problem hiding this comment.
Druid calls combine() or gets combiningFactories by calling getCombiningFactory() for merging aggregates. Calling DoublesSketchAggregatorFactory.factorize() should not happen for merging aggregates.
There was a problem hiding this comment.
Perhaps, I did not make myself clear. By merging sketches I meant such an aggregation in which the input field contains sketches as opposed to the raw values. I used to think about this as having two modes: building sketches (from raw input) and merging sketches. The closest thing to this is Theta sketch aggregator, with a twist that it cannot autodetect the input type (here the input is numeric, but there it can be of almost any type).
There was a problem hiding this comment.
Ah, sorry I misunderstood. Sounds good.
| if (selector == null) { | ||
| return new DoublesSketchNoOpBufferAggregator(); | ||
| } | ||
| return new DoublesSketchMergeBufferAggregator(selector, k, getMaxIntermediateSize()); |
There was a problem hiding this comment.
Similar comment here. Better to consider only a plain buffer aggregator.
There was a problem hiding this comment.
Plain? Here is the same selection of an aggregator based on the input type, just for the buffered case.
There was a problem hiding this comment.
Sorry, I used 'plain' for non-mergingAggregatorFactory. Similar to the above comment, Calling DoublesSketchAggregatorFactory.factorizeBuffered() should not happen for merging aggregates.
There was a problem hiding this comment.
same as above
| new DoublesSketchAggregatorFactory( | ||
| fieldName, | ||
| fieldName, | ||
| k)); |
There was a problem hiding this comment.
Please break this line like
k
)
);There was a problem hiding this comment.
I am using the Druid formatter for Eclipse, which is a part of this repo (eclipse_formatting.xml)
There was a problem hiding this comment.
Yeah, sorry but it can't catch everything.
There was a problem hiding this comment.
ok, will change
| } else if (serializedSketch instanceof DoublesSketch) { | ||
| return (DoublesSketch) serializedSketch; | ||
| } | ||
| throw new IllegalStateException( |
There was a problem hiding this comment.
We usually use ISE() because it supports string format.
| return splitPoints; | ||
| } | ||
|
|
||
| // comparing histograms doesn't make much sense, so this comparator pretends that everything is equal |
There was a problem hiding this comment.
I think it's better to throw an exception.
BTW, ApproximateHistogramPostAggregator returns a comparator comparing histogram's count. I'm not sure the same approach is possible for this class.
There was a problem hiding this comment.
I think I was told that throwing here would be a bad idea. And comparing by total count doesn't make much sense. I think we can just do nothing.
There was a problem hiding this comment.
Hmm, would you tell me why it's a bad idea? I think it's better because getComparator() is used for limitting the number of results like in TopN query, and users can get completely different results even though they run the same query multiple times.
Well, a better way is to check LimitSpec is specified with DoublesSketchToHistogramPostAggregator together before query execution, but I think it's beyond the scope of this PR.
There was a problem hiding this comment.
I heard that there are some systems like dashboard frontends and such, which generate queries automatically, and always use some ordering. We don't want to break compatibility with such systems. And providing some ordering, which doesn't make much sense, also sounds like a bad idea.
There was a problem hiding this comment.
I agree on that providing some ordering doesn't make sense.
But, I think there is no compatibility issue because this is a new feature and the systems generating Druid queries can change their logic to not include ordering for this new postAggregator.
There was a problem hiding this comment.
Yes, this particular aggregator is a new feature, but we don't want it to have some properties, which would be obstacles to integration with other systems like Hive or Pivot. These systems just assume some ordering by default. I don't think we are in a position to dictate the rules.
There was a problem hiding this comment.
@jihoonson not comparable does not means its an error, IMO it means we can expect random ordered set instead of sorted ordered set.
Some clients (including pivot) add a default limit spec with order by on the selected metric, not sure if we want to fail the request or return a random ordered set ?
There was a problem hiding this comment.
What I'm concerned with is, when people get a result of a query with ordering and a limit, they expect an ordered top-n result. But we cannot guarantee that the result is ordered in some order with this post aggregator, so I think this is an error of unsupported feature.
I understand what you guys are concerned with, but I think we can help the clients to do the right thing by like explicitly specifying that ordering by this post aggregator is not supported in the release note.
There was a problem hiding this comment.
Yes, this particular aggregator is a new feature, but we don't want it to have some properties, which would be obstacles to integration with other systems like Hive or Pivot.
I don't think this is an obstacle to other ecosystems of Druid. This post aggregator is newly added in this patch, and they can add some special logic to handle this post aggregator when they decide to support it.
These systems just assume some ordering by default. I don't think we are in a position to dictate the rules.
I'm not sure why you think so. Every system evolves as time goes by, and their ecosystems should be evolved together. Assuming every result can be ordered may be true so far, but it becomes wrong with this post aggregator. There is a recent example similar to this case. We recently added support for numeric dimensions. Druid's ecosystems can assume that every dimension has the string type before, but now it's a wrong assumption.
| return sketch.getQuantiles(fractions); | ||
| } | ||
|
|
||
| // comparing arrays of quantiles doesn't make much sense, so this comparator |
There was a problem hiding this comment.
I don't see how comparing these arrays would make any sense at all.
| return sketch.toString(); | ||
| } | ||
|
|
||
| // comparing sketch summaries doesn't make much sense, so this comparator |
There was a problem hiding this comment.
Again, I don't see how comparing sketch summaries can be helpful.
|
|
||
| import com.yahoo.sketches.quantiles.UpdateDoublesSketch; | ||
|
|
||
| public class GenerateTestData |
There was a problem hiding this comment.
Hmm, do you think we may need to modify test data someday? It should be rare because we should also fix the expected results in unit tests accordingly.
If you think it's needed, please add some comments on this class to let others know this is used for generating test data for DoublesSketchAggregatorTest.
|
@AlexanderSaydakov please comment when this is ready for another round of review. This looks like a very useful feature. |
|
I believe it is ready. I addressed the last two points: NilColumnValueSelector instead of null and throwing exceptions instead of providing dummy comparators |
|
@AlexanderSaydakov thanks for the update. I'll finish my review soon. |
|
@AlexanderSaydakov the latest change looks good to me. Would you please add some documents for these new aggregators? Please see https://github.com/druid-io/druid/blob/master/docs/content/development/extensions-core/datasketches-aggregators.md as an example. Also please add this new extension to the core extensions list here. |
|
This is not a new extension, but a part of the existing datasketches extension. We need to rewrite the document so that it would no longer assume datasketches means Theta sketch aggregator for approximate count-distinct. We are going to have more soon: new HLL sketch aggregator and Tuple sketch aggregator (ArrayOfDoubles). I have them almost ready, just need to brush up based on this review of Quantiles sketch aggregator. |
|
You are right. Do you want to rewrite the document at once after merging all your patches? It sounds good to me. |
|
@drcrallen @himanshug @leventov @gianm @akashdw do you have further comments? If not, I'm going to merge this PR. |
|
looks like there's a conflict with this PR that removed the IOPeon class: #4762 |
|
@jon-wei thanks. Raised a PR. |
|
@jihoonson thanks for reviewing it ... LGTM . We actually have had this code internally reviewed and used a bit beforehand. |
|
@himanshug good to know. This patch is awesome. I'm looking forward to the follow-up patches. |
|
@AlexanderSaydakov Can you provide a doc update for this patch? |
| { | ||
| if (metricFactory.getColumnCapabilities(fieldName) != null | ||
| && ValueType.isNumeric(metricFactory.getColumnCapabilities(fieldName).getType())) { | ||
| final ColumnValueSelector<Double> selector = metricFactory.makeColumnValueSelector(fieldName); |
There was a problem hiding this comment.
This variable should have type BaseDoubleColumnValueSelector, as well as all the way down.
| } | ||
| return new DoublesSketchBuildAggregator(selector, k); | ||
| } | ||
| final ColumnValueSelector<DoublesSketch> selector = metricFactory.makeColumnValueSelector(fieldName); |
There was a problem hiding this comment.
Should use BaseObjectColumnValueSelector<DoublesSketch>, as well as all the way down
| return (DoublesSketch) serializedSketch; | ||
| } | ||
| throw new ISE( | ||
| "Object is not of a type that can be deserialized to a quantiles DoublsSketch: " |
There was a problem hiding this comment.
Use String.format format instead of concat
|
Would someone be able to compare/contrast using this doubles sketch aggregator versus the existing approxHistogramFold aggregator? thanks |
|
I didn't study the approximateHistogram in detail. I see the following
claims in the documentation: "there are no formal error bounds on the
approximation" and "the algorithm only works well if the data is randomly
distributed" (horrible approximation for sorted input). Both these things
trigger a loud alarm in my head. I would go so far as to say that any
approximate method is useless if you don't know how accurate the result is.
…On Tue, Jan 30, 2018 at 8:07 AM, Kyle Boyle ***@***.***> wrote:
Would someone be able to compare/contrast using this doubles sketch
aggregator versus the existing approxHistogramFold aggregator? thanks
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5002 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMhMHolFwTSAbOmb_uzr4LyCbAikjlwbks5tPz48gaJpZM4QE9qL>
.
|
|
@AlexanderSaydakov can you provide docs for this feature? I'd like to provide a link to them in the 0.12.0 release notes |
|
I need to work on the docs. So far the datasketches module had the
count-distinct Theta sketch aggregator. Now we have the quantiles sketch,
tuple sketch pull request and HLL sketch (also count-distinct, but super
compact) ready to be submitted.
The page for Theta sketch aggregator is already quite big, perhaps we need
to have an index page and separate pages for each sketch type.
…On Tue, Jan 30, 2018 at 3:40 PM, Jonathan Wei ***@***.***> wrote:
@AlexanderSaydakov <https://github.com/alexandersaydakov> can you provide
docs for this feature? I'd like to provide a link to them in the 0.12.0
release notes
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5002 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMhMHrP5Ph1XW3q9-XyGWmhoB2LdGvOAks5tP6hkgaJpZM4QE9qL>
.
|

This is to support numeric quantiles sketch (DoublesSketch) to estimate distributions: obtain quantiles, ranks, probability mass functions (PMFs and CDFs) or histograms.