numeric quantiles sketch aggregator by AlexanderSaydakov · Pull Request #5002 · apache/druid

AlexanderSaydakov · 2017-10-24T19:26:47Z

This is to support numeric quantiles sketch (DoublesSketch) to estimate distributions: obtain quantiles, ranks, probability mass functions (PMFs and CDFs) or histograms.

AlexanderSaydakov · 2017-10-24T22:10:50Z

Is anything wrong with the build configuration?

drcrallen · 2017-10-25T13:56:33Z

[INFO] Compiling 36 source files to /home/travis/build/druid-io/druid/extensions-core/datasketches/target/classes
/home/travis/build/druid-io/druid/extensions-core/datasketches/src/main/java/io/druid/query/aggregation/datasketches/theta/SynchronizedUnion.java:71: error: [ParameterPackage] Method parameter has wrong package
 public synchronized void update(byte[] data)
 ^
 (see http://errorprone.info/bugpattern/ParameterPackage)
 Did you mean 'public synchronized void update(Array data)'?
/home/travis/build/druid-io/druid/extensions-core/datasketches/src/main/java/io/druid/query/aggregation/datasketches/theta/SynchronizedUnion.java:77: error: [ParameterPackage] Method parameter has wrong package
 public synchronized void update(int[] data)
 ^
 (see http://errorprone.info/bugpattern/ParameterPackage)
 Did you mean 'public synchronized void update(Array data)'?
/home/travis/build/druid-io/druid/extensions-core/datasketches/src/main/java/io/druid/query/aggregation/datasketches/theta/SynchronizedUnion.java:83: error: [ParameterPackage] Method parameter has wrong package
 public synchronized void update(char[] chars)
 ^
 (see http://errorprone.info/bugpattern/ParameterPackage)
 Did you mean 'public synchronized void update(Array chars)'?
Note: /home/travis/build/druid-io/druid/extensions-core/datasketches/src/main/java/io/druid/query/aggregation/datasketches/theta/SketchMergeComplexMetricSerde.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

@leventov do you know why this one is failing?

leventov · 2017-10-25T18:06:38Z

@AlexanderSaydakov could you please try to update plexus-compiler-javac-errorprone dependency in pom.xml to 2.8.2 and compile without @SuppressWarnings?

AlexanderSaydakov · 2017-10-25T18:47:39Z

no, plexus-compiler-javac-errorprone-2.8.2 doesn't help

AlexanderSaydakov · 2017-10-25T18:59:45Z

I see that error_prone_core is a few versions behind, but all versions newer than currently used 2.0.19 are not happy about something in druid/java-util

himanshug · 2017-10-30T19:37:01Z

@AlexanderSaydakov currently teamcity appears to be failing due to foreach related rules , see

you can find above in the "Code Inspection" tab

AlexanderSaydakov · 2017-10-31T20:08:36Z

I have a couple of questions:

Is the synchronization in aggregators still necessary? I looked at the Theta sketch aggregator as an example.There a SynchronizedUnion was used for many years. I was shown this: https://github.com/druid-io/druid/pull/1027/files, which seems to suggest that Druid takes care of synchronization. If so, I can remove synchronization in this aggregator.
I hard-coded cache IDs in factories and post-aggregators. Perhaps they should be kept in a central registry somewhere instead.

gianm · 2017-11-01T17:34:24Z

@AlexanderSaydakov

Druid synchronizes calls to "aggregate", but it can call "aggregate" and "get" simultaneously. So you do need to make sure that this combination is thread-safe. I'm not too familiar with the theta sketch code, but that could be the reason that it uses synchronization.
For core extensions you can put the cache key ID in AggregatorUtil (for aggregator factories) and PostAggregatorIds (for post aggregators). Theta sketches already do that.

akashdw · 2017-11-02T18:11:07Z

+              return sketch;
+            }
+            catch (NumberFormatException e) {
+              // Log.info("Expected Double. Got string with value " +


Consider adding Log.debug saying expected was Double but found base64Encoded string.

why debug? should we use log.error? this should never happen, but it is not fatal either

after a discussion with @akashdw we believe that it is better to throw an exception here

akashdw · 2017-11-02T18:14:13Z

+  @Override
+  public void deserializeColumn(final ByteBuffer buffer, final ColumnBuilder builder)
+  {
+    final GenericIndexed<DoublesSketch> column = GenericIndexed.read(buffer, strategy);


use GenericIndexed.read(buffer, strategy, builder.getFileMapper()) similar to thetaSketch to enable largeColumns. Also Override getSerializer method as

@Override public GenericColumnSerializer getSerializer(IOPeon peon, String column) { return LargeColumnSupportedComplexColumnSerializer.create(peon, column, this.getObjectStrategy()); }

akashdw · 2017-11-02T18:16:50Z

+import io.druid.query.aggregation.Aggregator;
+import io.druid.segment.ColumnValueSelector;
+
+public class DoublesSketchDoubleAggregator implements Aggregator


Consider renaming this class to DoublesQuantileSketchAggregator

I am not sure about adding Quantiles into the class names. This is implied by the package name, and class names are long already. Regarding the second "Double", I believe the intention was to say that this aggregator works on double values as input (as opposed to sketches). I propose to rename it to BuildAggregator, and rename the other one to MergeAggregator (or leave Combining, but switch the words around for consistency)

akashdw · 2017-11-02T18:17:11Z

+import it.unimi.dsi.fastutil.ints.Int2ObjectMap;
+import it.unimi.dsi.fastutil.ints.Int2ObjectOpenHashMap;
+
+public class DoublesSketchDoubleBufferAggregator implements BufferAggregator


Consider renaming this class to DoublesQuantileSketchBufferAggregator

akashdw · 2017-11-02T18:19:44Z

+  private final IdentityHashMap<ByteBuffer, WritableMemory> memCache = new IdentityHashMap<>();
+  private final IdentityHashMap<ByteBuffer, Int2ObjectMap<UpdateDoublesSketch>> sketches = new IdentityHashMap<>();
+
+  public DoublesSketchDoubleBufferAggregator(final ColumnValueSelector<Double> valueSelector, final int size,


please add a comment explaining sketch can grow on-heap also, also explain what happens in relocation when sketch grows on-heap.

akashdw · 2017-11-02T18:22:12Z

+import io.druid.initialization.DruidModule;
+import io.druid.segment.serde.ComplexMetrics;
+
+public class DoublesSketchModule implements DruidModule


Consider renaming it to DoublesQuantileSketchModule or QuantilesDoublesSketch

AlexanderSaydakov · 2017-11-08T20:48:31Z

I believe that I addressed all reviewers' suggestions so far. How can we proceed? Thank you.

jihoonson

@AlexanderSaydakov nice work! I left some comments. Please consider them.

jihoonson · 2017-11-19T01:45:28Z

+  private final String name;
+  private final String fieldName;
+  private final int k;
+  private final byte cacheTypeId;


Looks this is always AggregatorUtil.QUANTILES_DOUBLES_SKETCH_BUILD_CACHE_TYPE_ID and don't have to be a member variable.

no, DoublesSketchMergeAggregatorFactory overrides it

jihoonson · 2017-11-19T01:53:10Z

+    if (metricFactory.getColumnCapabilities(fieldName) != null
+        && ValueType.isNumeric(metricFactory.getColumnCapabilities(fieldName).getType())) {
+      final ColumnValueSelector<Double> valueSelector = metricFactory.makeColumnValueSelector(fieldName);
+      if (valueSelector == null) {


I'm curious when this can be null. I couldn't find it.

I it seems to me that this can happen if a non-existent field is mentioned as the input. I am not convinced myself that we really need this special no-op aggregator. This is how it was done in the Theta sketch aggregator. I think that it is not worth optimizing an erroneous query.

In Druid, NilColumnValueSelector is returned for non-existent input fields. I think it's worthwhile to optimize, but we already have NoopAggregator and NoopBufferAggregator, and you can use them instead of adding new ones.

I believe I tested this case and got a null for selector. Regarding the dummy aggregators, I believe the idea was to always return a sketch, even an empty one. The custom dummy aggregators do just that.

@jihoonson sketches post aggs expect a sketch object, not sure how a sketches post agg will behave if we use NoopAggregator and NoopBufferAggregator. This check was added considering you can have a sketch field in some segments(say I added a new sketch column from today onwards) but past data does not have that field.

If it's nullable, every other aggregatorFactory should consider it too, but they don't. As I said, if a segment doesn't have a specified column, NilColumnValueSelector is returned instead of null.

Yes, we should check for NilColumnValueSelector but will continue to return DoublesSketchNoOpAggregator as quantile postAggregators expect sketch values.

You mean returning DoublesSketchNoOpAggregator for NilColumnValueSelector? It makes sense to me.

jihoonson · 2017-11-19T02:03:03Z

+    if (selector == null) {
+      return new DoublesSketchNoOpAggregator();
+    }
+    return new DoublesSketchMergeAggregator(selector, k);


I wonder why DoublesSketchAggregatorFactory is able to return the DoublesSketchMergeAggregator. I guess DoublesSketchMergeAggregatorFactory is to get mergeAggregator and DoublesSketchAggregatorFactory is for a plain aggregator.

This is a common entry point associated with the type name. If the input field is a numeric field, then so-called "build" aggregator is used to build sketches. Otherwise, we assume that the input contains sketches to merge.

Druid calls combine() or gets combiningFactories by calling getCombiningFactory() for merging aggregates. Calling DoublesSketchAggregatorFactory.factorize() should not happen for merging aggregates.

Perhaps, I did not make myself clear. By merging sketches I meant such an aggregation in which the input field contains sketches as opposed to the raw values. I used to think about this as having two modes: building sketches (from raw input) and merging sketches. The closest thing to this is Theta sketch aggregator, with a twist that it cannot autodetect the input type (here the input is numeric, but there it can be of almost any type).

Ah, sorry I misunderstood. Sounds good.

jihoonson · 2017-11-19T02:03:43Z

+    if (selector == null) {
+      return new DoublesSketchNoOpBufferAggregator();
+    }
+    return new DoublesSketchMergeBufferAggregator(selector, k, getMaxIntermediateSize());


Similar comment here. Better to consider only a plain buffer aggregator.

Plain? Here is the same selection of an aggregator based on the input type, just for the buffered case.

Sorry, I used 'plain' for non-mergingAggregatorFactory. Similar to the above comment, Calling DoublesSketchAggregatorFactory.factorizeBuffered() should not happen for merging aggregates.

same as above

Sounds good.

jihoonson · 2017-11-19T02:12:52Z

+        new DoublesSketchAggregatorFactory(
+            fieldName,
+            fieldName,
+            k));


Please break this line like

k ) );

I am using the Druid formatter for Eclipse, which is a part of this repo (eclipse_formatting.xml)

Yeah, sorry but it can't catch everything.

ok, will change

jihoonson · 2017-11-19T02:44:23Z

+    } else if (serializedSketch instanceof DoublesSketch) {
+      return (DoublesSketch) serializedSketch;
+    }
+    throw new IllegalStateException(


We usually use ISE() because it supports string format.

jihoonson · 2017-11-19T02:52:33Z

+    return splitPoints;
+  }
+
+  // comparing histograms doesn't make much sense, so this comparator pretends that everything is equal


I think it's better to throw an exception.
BTW, ApproximateHistogramPostAggregator returns a comparator comparing histogram's count. I'm not sure the same approach is possible for this class.

I think I was told that throwing here would be a bad idea. And comparing by total count doesn't make much sense. I think we can just do nothing.

Hmm, would you tell me why it's a bad idea? I think it's better because getComparator() is used for limitting the number of results like in TopN query, and users can get completely different results even though they run the same query multiple times.
Well, a better way is to check LimitSpec is specified with DoublesSketchToHistogramPostAggregator together before query execution, but I think it's beyond the scope of this PR.

I heard that there are some systems like dashboard frontends and such, which generate queries automatically, and always use some ordering. We don't want to break compatibility with such systems. And providing some ordering, which doesn't make much sense, also sounds like a bad idea.

I agree on that providing some ordering doesn't make sense.
But, I think there is no compatibility issue because this is a new feature and the systems generating Druid queries can change their logic to not include ordering for this new postAggregator.

Yes, this particular aggregator is a new feature, but we don't want it to have some properties, which would be obstacles to integration with other systems like Hive or Pivot. These systems just assume some ordering by default. I don't think we are in a position to dictate the rules.

@jihoonson not comparable does not means its an error, IMO it means we can expect random ordered set instead of sorted ordered set.
Some clients (including pivot) add a default limit spec with order by on the selected metric, not sure if we want to fail the request or return a random ordered set ?

What I'm concerned with is, when people get a result of a query with ordering and a limit, they expect an ordered top-n result. But we cannot guarantee that the result is ordered in some order with this post aggregator, so I think this is an error of unsupported feature.

I understand what you guys are concerned with, but I think we can help the clients to do the right thing by like explicitly specifying that ordering by this post aggregator is not supported in the release note.

Yes, this particular aggregator is a new feature, but we don't want it to have some properties, which would be obstacles to integration with other systems like Hive or Pivot.

I don't think this is an obstacle to other ecosystems of Druid. This post aggregator is newly added in this patch, and they can add some special logic to handle this post aggregator when they decide to support it.

These systems just assume some ordering by default. I don't think we are in a position to dictate the rules.

I'm not sure why you think so. Every system evolves as time goes by, and their ecosystems should be evolved together. Assuming every result can be ordered may be true so far, but it becomes wrong with this post aggregator. There is a recent example similar to this case. We recently added support for numeric dimensions. Druid's ecosystems can assume that every dimension has the string type before, but now it's a wrong assumption.

jihoonson · 2017-11-19T02:54:14Z

+    return sketch.getQuantiles(fractions);
+  }
+
+  // comparing arrays of quantiles doesn't make much sense, so this comparator


Same comment here.

I don't see how comparing these arrays would make any sense at all.

jihoonson · 2017-11-19T02:55:28Z

+    return sketch.toString();
+  }
+
+  // comparing sketch summaries doesn't make much sense, so this comparator


Same comment here.

Again, I don't see how comparing sketch summaries can be helpful.

jihoonson · 2017-11-19T03:04:49Z

+
+import com.yahoo.sketches.quantiles.UpdateDoublesSketch;
+
+public class GenerateTestData


Hmm, do you think we may need to modify test data someday? It should be rare because we should also fix the expected results in unit tests accordingly.
If you think it's needed, please add some comments on this class to let others know this is used for generating test data for DoublesSketchAggregatorTest.

drcrallen · 2017-11-30T14:49:47Z

@AlexanderSaydakov please comment when this is ready for another round of review. This looks like a very useful feature.

AlexanderSaydakov · 2017-11-30T18:58:35Z

I believe it is ready. I addressed the last two points: NilColumnValueSelector instead of null and throwing exceptions instead of providing dummy comparators

jihoonson · 2017-11-30T23:55:25Z

@AlexanderSaydakov thanks for the update. I'll finish my review soon.

jihoonson · 2017-12-01T04:49:39Z

@AlexanderSaydakov the latest change looks good to me. Would you please add some documents for these new aggregators? Please see https://github.com/druid-io/druid/blob/master/docs/content/development/extensions-core/datasketches-aggregators.md as an example. Also please add this new extension to the core extensions list here.

AlexanderSaydakov · 2017-12-01T20:26:11Z

This is not a new extension, but a part of the existing datasketches extension. We need to rewrite the document so that it would no longer assume datasketches means Theta sketch aggregator for approximate count-distinct. We are going to have more soon: new HLL sketch aggregator and Tuple sketch aggregator (ArrayOfDoubles). I have them almost ready, just need to brush up based on this review of Quantiles sketch aggregator.

jihoonson · 2017-12-01T20:43:19Z

You are right. Do you want to rewrite the document at once after merging all your patches? It sounds good to me.

jihoonson · 2017-12-05T03:49:39Z

@drcrallen @himanshug @leventov @gianm @akashdw do you have further comments? If not, I'm going to merge this PR.

jon-wei · 2017-12-06T04:47:26Z

looks like there's a conflict with this PR that removed the IOPeon class: #4762

jihoonson · 2017-12-06T05:00:57Z

@jon-wei thanks. Raised a PR.

himanshug · 2017-12-06T16:44:30Z

@jihoonson thanks for reviewing it ... LGTM . We actually have had this code internally reviewed and used a bit beforehand.

jihoonson · 2017-12-06T23:56:07Z

@himanshug good to know. This patch is awesome. I'm looking forward to the follow-up patches.

jon-wei · 2018-01-05T22:20:39Z

@AlexanderSaydakov Can you provide a doc update for this patch?

leventov · 2018-01-11T14:10:05Z

+  {
+    if (metricFactory.getColumnCapabilities(fieldName) != null
+        && ValueType.isNumeric(metricFactory.getColumnCapabilities(fieldName).getType())) {
+      final ColumnValueSelector<Double> selector = metricFactory.makeColumnValueSelector(fieldName);


This variable should have type BaseDoubleColumnValueSelector, as well as all the way down.

leventov · 2018-01-11T14:10:47Z

+      }
+      return new DoublesSketchBuildAggregator(selector, k);
+    }
+    final ColumnValueSelector<DoublesSketch> selector = metricFactory.makeColumnValueSelector(fieldName);


Should use BaseObjectColumnValueSelector<DoublesSketch>, as well as all the way down

leventov · 2018-01-11T14:21:45Z

+      return (DoublesSketch) serializedSketch;
+    }
+    throw new ISE(
+        "Object is not of a type that can be deserialized to a quantiles DoublsSketch: "


Use String.format format instead of concat

kyleboyle · 2018-01-30T16:07:07Z

Would someone be able to compare/contrast using this doubles sketch aggregator versus the existing approxHistogramFold aggregator? thanks

AlexanderSaydakov · 2018-01-30T23:36:56Z

I didn't study the approximateHistogram in detail. I see the following claims in the documentation: "there are no formal error bounds on the approximation" and "the algorithm only works well if the data is randomly distributed" (horrible approximation for sorted input). Both these things trigger a loud alarm in my head. I would go so far as to say that any approximate method is useless if you don't know how accurate the result is.

…

On Tue, Jan 30, 2018 at 8:07 AM, Kyle Boyle ***@***.***> wrote: Would someone be able to compare/contrast using this doubles sketch aggregator versus the existing approxHistogramFold aggregator? thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5002 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMhMHolFwTSAbOmb_uzr4LyCbAikjlwbks5tPz48gaJpZM4QE9qL> .

jon-wei · 2018-01-30T23:40:05Z

@AlexanderSaydakov can you provide docs for this feature? I'd like to provide a link to them in the 0.12.0 release notes

AlexanderSaydakov · 2018-01-30T23:56:34Z

I need to work on the docs. So far the datasketches module had the count-distinct Theta sketch aggregator. Now we have the quantiles sketch, tuple sketch pull request and HLL sketch (also count-distinct, but super compact) ready to be submitted. The page for Theta sketch aggregator is already quite big, perhaps we need to have an index page and separate pages for each sketch type.

…

On Tue, Jan 30, 2018 at 3:40 PM, Jonathan Wei ***@***.***> wrote: @AlexanderSaydakov <https://github.com/alexandersaydakov> can you provide docs for this feature? I'd like to provide a link to them in the 0.12.0 release notes — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5002 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMhMHrP5Ph1XW3q9-XyGWmhoB2LdGvOAks5tP6hkgaJpZM4QE9qL> .

numeric quantiles sketch aggregator

8440c8a

it seems that we need to synchronize all methods, which modify the state

b6eae59

Seems like a false positive with -Pstrict

826a54e

AlexanderSaydakov added 3 commits October 30, 2017 12:54

code style fix

5b64b52

code style fix

a2d0396

use sketches-core-0.10.3

e57a870

moved cache ids to the central place

65e193d

akashdw reviewed Nov 2, 2017

View reviewed changes

AlexanderSaydakov added 4 commits November 2, 2017 13:37

better class names

a2e0d77

support large columns

3f23722

explained autodetection, added exception

7c9d0b6

added comments regarding sketches moving on heap

7e15029

support reindexing

cacd6a5

jihoonson requested changes Nov 19, 2017

View reviewed changes

AlexanderSaydakov added 5 commits November 20, 2017 16:03

implemented suggestions from jihoonson

84a9c79

style fix

a503542

use max(k, other.k) for better accuracy

d217263

check for NilColumnValueSelector instead of null

3466bab

throw exceptions instead of providing no-op comparators

6d2f903

jihoonson approved these changes Dec 1, 2017

View reviewed changes

akashdw approved these changes Dec 5, 2017

View reviewed changes

jihoonson merged commit 45f91a2 into apache:master Dec 5, 2017

jihoonson mentioned this pull request Dec 6, 2017

Fix DoublesSketchComplexMetricSerde.getSerializer() #5140

Merged

jon-wei added this to the 0.12.0 milestone Jan 5, 2018

jon-wei mentioned this pull request Jan 5, 2018

[WIP] Druid 0.12.0 release notes #5211

Closed

leventov reviewed Jan 11, 2018

View reviewed changes

AlexanderSaydakov mentioned this pull request Mar 21, 2018

documentation for quantiles and tuple sketch modules druid-io/druid-io.github.io#448

Merged

jihoonson mentioned this pull request May 8, 2019

TDigest backed sketch aggregators #7331

Merged

AlexanderSaydakov mentioned this pull request Jun 21, 2019

locking in Theta sketch buffer aggregator #7938

Closed


		import com.yahoo.sketches.quantiles.UpdateDoublesSketch;

		public class GenerateTestData

Conversation

AlexanderSaydakov commented Oct 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexanderSaydakov commented Oct 24, 2017

Uh oh!

drcrallen commented Oct 25, 2017

Uh oh!

leventov commented Oct 25, 2017

Uh oh!

AlexanderSaydakov commented Oct 25, 2017

Uh oh!

AlexanderSaydakov commented Oct 25, 2017

Uh oh!

himanshug commented Oct 30, 2017

Uh oh!

AlexanderSaydakov commented Oct 31, 2017

Uh oh!

gianm commented Nov 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akashdw Nov 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexanderSaydakov commented Nov 8, 2017

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akashdw Nov 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson Nov 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

AlexanderSaydakov commented Oct 24, 2017 •

edited

Loading

akashdw Nov 2, 2017 •

edited

Loading

akashdw Nov 21, 2017 •

edited

Loading

jihoonson Nov 22, 2017 •

edited

Loading