Adds bloom filter aggregator to 'druid-bloom-filters' extension by clintropolis · Pull Request #6397 · apache/druid

clintropolis · 2018-09-28T07:53:09Z

This PR, building on top of the work introduced in #6222, extends druid-bloom-filters with a query time aggregator, allowing bloom filters to be computed from query results which can then be used as input to bloom filters in subsequent queries.

example query:

{
  "queryType": "timeseries",
  "dataSource": "wikiticker",
  "intervals": [ "2015-09-12T00:00:00.000/2015-09-13T00:00:00.000" ],
  "granularity": "day",
  "aggregations": [
    {
      "type": "bloom",
      "name": "userBloom",
      "maxNumEntries": 100000,
      "field": {
        "type":"default",
        "dimension":"user",
        "outputType": "STRING"
      }
    }
  ]
}

example results:

[{"timestamp":"2015-09-12T00:00:00.000Z","result":{"userBloom":"BAAAJhAAAA..."}}]

dclim · 2018-10-01T23:48:44Z

space after period

dclim · 2018-10-01T23:51:15Z

Consider making this valid JSON so it doesn't get syntax highlighted

I believe that this is really only ugly on github and it looks ok translated to the website docs

dclim · 2018-10-01T23:51:19Z

Consider making this valid JSON so it doesn't get syntax highlighted

dclim · 2018-10-01T23:51:58Z

Refers to bloom aggregator here, but in the JSON spec the type is bloomFilter.

bloom is the correct value to be consistent with the filter type name, updated docs to reflect that.

dclim · 2018-10-01T23:57:56Z

I think it'd be worthwhile under maxNumEntries to discuss the implications of having more elements than the value provided here. Also, any discussion on how to choose an appropriate value here to get a given false-positive rate would also be helpful.

Hmm, digging into it, in BloomKFilter the false positive rate is not controllable in the manner of BloomFilter, and is fixed to the default of 5%. However I guess that can be indirectly controlled by increasing the maxNumEntries, though that's kind of lame. Having a higher cardinality than the value of maxNumEntries will cause the false positive probability to reach 1, constructing a useless bloom filter, so that should definitely be added to the docs.

Updated docs to include fixed 5% false positive rate, though no formula for how changing maxNumEntries affects that yet.

dclim · 2018-10-02T00:23:44Z

Perhaps make these final and similarly for other classes

dclim · 2018-10-02T02:18:25Z

There's quite a lot of duplicate or similar code between this and BloomFilterBufferAggregator. Any opportunity to consolidate?

Consolidated common code into BaseBloomFilterAggregator and BaseBloomFilterBufferAggregator

leventov · 2018-10-03T20:23:16Z

0x30 and 0x31 are already in use. Also position those variables in the end of the list.

Oops, originally did this work in an older branch where these didn't exist yet I think, will fix.

Could you make this kind of mistake impossible in the future by moving all codes in a enum with byte id; field, and adding a static initializer that checks that all enum constants have different codes?

Also CacheKeyBuilder's constructor to accept the enum constant instead of byte param to prohibit bypassing this enum

opened #6823

leventov · 2018-10-03T20:25:06Z

Could avoid creating an array by iterating the row itself.

leventov · 2018-10-03T20:26:05Z

leventov · 2018-10-03T20:27:45Z

DoubleColumnSelector must appear only in implements clauses and nowhere else. See it's Javadoc. Same for float and long.

leventov · 2018-10-03T20:30:21Z

Please add comment "nothing to close"

leventov · 2018-10-03T20:41:58Z

Please add a comment "nothing to close"

leventov · 2018-10-03T20:43:58Z

Is it sure that "Aggregator" should be a part of the name of this class?

I think it's not necessary, though I was just following cardinality aggregator.

leventov · 2018-10-03T20:44:57Z

Cache the value in a static, don't create a throwaway each time.

What sort of cache is most appropriate? A static int2object map? caffeine?

"cache" = set to a static final field, I meant here

I don't think a static works since the result depends on maxNumEntries which is a query time parameter. If this method gets called multiple times i can set an instance field to this value at constructor time, or I can see if the math that computes the size of the long array inside is accessible to call that directly.

Direct computation, if not supported by the library directly (maybe we could contribute that?) could be dangerous if the algorithm in the library changes in some version.

I think most of the time only one maxNumEntries value will be seen per JVM, small Int2Object map (stopped being populated after say 10 entires) should work

*Int2IntMap, you could cache the final values.

Added a method to our copied BloomKFilter to compute the size required given a number of entries, avoiding this throwaway

leventov · 2018-10-03T20:45:18Z

What is +5?

It's a header written during serialization I'll add a comment/link to https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomKFilter.java#L302 or see if I can find a better way to get this information

Now uses new method on copied BloomKFilter to compute the size

leventov · 2018-10-03T20:47:01Z

Could compare the number of set bits.

Added method to copied BloomKFilter to count the number of set bits, allowing ordering results by density of bloom filter

leventov · 2018-10-03T20:58:02Z

What does this variable name mean?

Heh, oops, missed some clean up before opening the PR

clintropolis · 2018-11-02T01:10:48Z

This PR is dependent on #6546 being resolved.

…simplify some things, more tests

…tor, fixes

clintropolis · 2019-01-09T01:43:48Z

@dclim, @leventov (and anyone else interested) I think this PR is ready for review again. I think I've addressed all existing comments via code changes, apologies for the rebases - I jumped the gun a bit opening this before the rest of the bloom filter extension had stabilized.

I've also added a bunch of methods to our copy of the Hive BloomKFilter to enable in situ manipulation of BloomKFilters that are serialized into a ByteBuffer, to allow much more memory efficient buffer aggs, which I will attempt to get pushed to the upstream implementation soon to avoid diverging too much. Additionally the documentation has been improved a bit for the extension, and have simplified the code where possible in response to comments.

…g buffer offsets

…egator

clintropolis · 2019-01-22T20:18:48Z

Any more comments on this PR @leventov ?

clintropolis · 2019-01-24T08:31:24Z

@leventov would you mind if I merge this soon so I can unblock another branch I'm sitting on that adds sql support on top of this? If you find any additional issues I am happy to address them in a future patch.

leventov · 2019-01-24T10:59:42Z

+  @Override
+  public void inspectRuntimeShape(RuntimeShapeInspector inspector)
+  {
+    inspector.visit("selector", selectorPlus.getSelector());


Apparently selectorPlus.getColumnSelectorStrategy() should be inspected too. Please, don't believe me and read the documentation of HotLoopCallee.inspectRuntimeShape() to verify that.

After refactor all that is left is 'selector' which i think is all that needs inspected.

leventov · 2019-01-24T11:03:57Z

+        collector.merge((BloomKFilter) other);
+      } else if (other instanceof ByteBuffer) {
+        // fun fact: because bloom filter agg factory deserialize returns a byte buffer to avoid unnecessary serde,
+        // but group by v1 ends up trying to merge bytebuffers from buffer aggs with this agg instead of the buffer


Please refer as GroupByQueryEngine (so-called groupBy V1) to ease navigation and for clarity.

ByteBuffers

leventov · 2019-01-24T11:05:58Z

+  {
+    final ColumnValueSelector<BloomKFilter> selector = metricFactory.makeColumnValueSelector(fieldName);
+    if (selector instanceof NilColumnValueSelector) {
+      throw new ISE("WTF?! Unexpected NilColumnValueSelector");


Why this? Many other object aggregators support absent columns.

To expand on this comment, when examining how this would be called my conclusion was that this would be an unexpected condition in the merge aggregator, because the output of a null column from the aggregator would still be a bloom filter with the null or default value which is what this would see. Did I misinterpret or miss a situation where this could actually happen?

I just noticed that this is the merge aggregator. Then it probably makes sense, but there should be a comment and the error message should be more descriptive.

But then it appears that BloomFilterAggregatorFactory doesn't special-case NilColumnValueSelector, however it probably should for performance.

After refactor this now special cases NilColumnValueSelector and I think behaves in a slightly more correct manner of producing a totally empty bloom filter instead of a bloom filter with a null value

Please add a comment

leventov · 2019-01-24T11:07:28Z

+import java.nio.ByteBuffer;
+
+/**
+ * This exists so bloom filter agg has something to register so group by v1 will work, but isn't actually used


Please make "bloom filter agg" and "group by v1" Javadoc links.

leventov · 2019-01-24T11:09:30Z

+  @Override
+  public ComplexMetricExtractor getExtractor()
+  {
+    throw new UnsupportedOperationException("How can this be?");


A message like "Bloom filter aggregator is query-time only" might be more constructive.

leventov · 2019-01-24T11:18:11Z

+
+import java.nio.ByteBuffer;
+
+public interface BloomFilterAggregatorColumnSelectorStrategy<TValueSelector> extends ColumnSelectorStrategy


It seems to me that BloomFilterAggregatorColumnSelectorStrategy and BloomFilterAggregatorColumnSelectorStrategyFactory are too shallow abstractions. Probably it's better to inline the logic of BloomFilterAggregatorColumnSelectorStrategyFactory into factorize() and factorizeBuffered() and the logic of BloomFilterAggregatorColumnSelectorStrategy into respective subclasses of BloomFilterBufferAggregator and BloomFilterAggregator.

I used the same abstraction as CardinalityAggregator, should it be refactored in this manner as well? (not in this PR, but later at least)

To clarify you're suggesting something like having factorize produce a StringBloomFilterAggregator, LongBloomFilterAggregator, etc with the same for buffer aggs? I'll have a look at reworking it.

Yes, exactly. I think this Strategy and StrategyFactory is too much and the logic is lost between endless classes.

Yes, I think cardinality should be refactored too. I've created #6909.

Thanks 👍

Refactored as suggested, fairly large shuffling around of stuff, but I think is maybe cleaner 👍

leventov · 2019-01-24T11:24:05Z

+  {
+    if (selector.getRow().size() > 1) {
+      selector.getRow().forEach(v -> {
+        String value = selector.lookupName(v);


I guess when DimensionSelector.nameLookupPossibleInAdvance() is true for this dimension selector, it might be much faster to hash indexes rather than strings. See other usages of nameLookupPossibleInAdvance() in the codebase.

Javadocs aren't super clear and it isn't obvious to me yet looking at code, does nameLookupPossibleInAdvance indicate that indexes uniquely map to the same value across all segments? That would need to be true for this to work, and it would also require modifications to the BloomDimFilter implementation to be able to work with this as well.

I think the main use case of this aggregator right now is to be able to produce bloom filters from data in druid, which can be used as input to additional queries to filter other data in druid using BloomDimFilter which this extension originally introduced, so the goal is to have them operate in the same manner.

It definitely is expensive to do the hashing, it feels even apparent on the BloomDimFilter side of things, but I think this optimization is maybe out of scope of this PR

Javadocs aren't super clear and it isn't obvious to me yet looking at code, does nameLookupPossibleInAdvance indicate that indexes uniquely map to the same value across all segments? That would need to be true for this to work, and it would also require modifications to the BloomDimFilter implementation to be able to work with this as well.

Yes, map uniquely. Feel free to clarify that Javadoc right in this PR.

I think the main use case of this aggregator right now is to be able to produce bloom filters from data in druid, which can be used as input to additional queries to filter other data in druid using BloomDimFilter which this extension originally introduced, so the goal is to have them operate in the same manner.

It definitely is expensive to do the hashing, it feels even apparent on the BloomDimFilter side of things, but I think this optimization is maybe out of scope of this PR

OK, then please leave a comment somewhere with an explanation why the optimization is not applied.

leventov · 2019-01-24T11:26:38Z

+ * ByteBuffer, e.g. all add and merge methods. Test methods were not added because we don't need them.. but would
+ * probably be chill to do so it is symmetrical.
+ *
+ * Todo: remove this and begin using hive-storage-api version again once


But above in this PR it is mentioned that currently Bloom Filter aggregator is compute-time only, how is the Hive integration relevant then?

…yFactory to instead use specialized aggregators for each supported column type, other review comments

leventov · 2019-01-25T11:12:58Z

+  {
+    final ColumnValueSelector<BloomKFilter> selector = metricFactory.makeColumnValueSelector(fieldName);
+    if (selector instanceof NilColumnValueSelector) {
+      throw new ISE("WTF?! Unexpected NilColumnValueSelector");


Please add a comment

leventov · 2019-01-25T11:20:52Z


-    return new BloomFilterAggregator(selectorPlus, maxNumEntries);
+    if (selector instanceof NilColumnValueSelector) {
+      return new NilBloomFilterAggregator((NilColumnValueSelector) selector, filter);


Is it important that BloomKFilter has specifically maxNumEntries? If not, new BloomKFilter(0) (or 1) could be cached in a constant.

In any case, (NilColumnValueSelector) selector shouldn't be passed down, NilColumnValueSelector could call super(NilColumnValueSelector.instance()) in its constructor.

BloomKFilter must be the same size to merge, so it's not possible to make a constant. 👍 on the signature change though, will do that.

Ok, please add a comment noting this.

leventov · 2019-01-25T11:25:16Z

+import java.util.List;
+import java.util.Objects;
+
+public class BloomFilterAggregatorFactory extends AggregatorFactory


Missing makeAggregateCombiner()? See #6093 and #6882

As far as I could tell while implementing this, makeAggregateCombiner is only called to merge segments at ingestion time; the issue that those pulls reference, #6877, maybe jives with that since it is an exception during index merging.

The long term idea is to replace the remaining uses of combine() with AggregateCombiner and remove combine() method, to reduce repetition. When this time comes, it will be harder to implement that for every aggregator. So IMO it's better to implement makeAggregateCombiner() in all aggregators in core Druid eagerly.

Added BloomFilterAggregateCombiner

leventov · 2019-01-25T11:27:33Z

+    if (capabilities == null) {
+      BaseNullableColumnValueSelector selector = columnFactory.makeColumnValueSelector(field.getDimension());
+      if (selector instanceof NilColumnValueSelector) {
+        return new NilBloomFilterAggregator((NilColumnValueSelector) selector, filter);


In some other complex aggregators, this thing is called "NoOp" aggregator. But IMO it would be more correct to call it "Empty". "Nil" sounds like it should return null from Aggregator.get(), but it returns an empty bloom filter.

👍 agree, will change to "Empty"

Could you please rename "NoOp" aggregators (I see it in ArrayOfDoublesSketch), and whatever other nonstandard names other complex aggregators use (I only see that DistinctCount already uses "Empty") and align everything to the same convention?

Created #6934 and assigned to myself to do as a follow-up PR

… comments

…egator

jon-wei · 2019-01-28T18:54:38Z

Still LGTM after latest changes

…egator

clintropolis · 2019-01-29T21:09:07Z

Thanks for review everyone 🤘

…he#6397) * blooming aggs * partially address review * fix docs * minor test refactor after rebase * use copied bloomkfilter * add ByteBuffer methods to BloomKFilter to allow agg to use in place, simplify some things, more tests * add methods to BloomKFilter to get number of set bits, use in comparator, fixes * more docs * fix * fix style * simplify bloomfilter bytebuffer merge, change methods to allow passing buffer offsets * oof, more fixes * more sane docs example * fix it * do the right thing in the right place * formatting * fix * avoid conflict * typo fixes, faster comparator, docs for comparator behavior * unused imports * use buffer comparator instead of deserializing * striped readwrite lock for buffer agg, null handling comparator, other review changes * style fixes * style * remove sync for now * oops * consistency * inspect runtime shape of selector instead of selector plus, static comparator, add inner exception on serde exception * CardinalityBufferAggregator inspect selectors instead of selectorPluses * fix style * refactor away from using ColumnSelectorPlus and ColumnSelectorStrategyFactory to instead use specialized aggregators for each supported column type, other review comments * adjustment * fix teamcity error? * rename nil aggs to empty, change empty agg constructor signature, add comments * use stringutils base64 stuff to be chill with master * add aggregate combiner, comment

clintropolis force-pushed the bloom-filter-aggregator branch from a4f9984 to a75b278 Compare September 28, 2018 17:34

gianm requested a review from dclim September 28, 2018 22:31

dclim requested changes Oct 2, 2018

View reviewed changes

leventov requested changes Oct 3, 2018

View reviewed changes

leventov reviewed Oct 3, 2018

View reviewed changes

dclim mentioned this pull request Oct 10, 2018

Druid 0.13.0-incubating release notes #6442

Closed

gianm added the WIP label Oct 15, 2018

clintropolis force-pushed the bloom-filter-aggregator branch from fb270f6 to 03d6b9e Compare October 23, 2018 21:19

clintropolis mentioned this pull request Nov 7, 2018

fix druid-bloom-filter thread-safety #6584

Merged

clintropolis force-pushed the bloom-filter-aggregator branch from 03d6b9e to bf44d3f Compare November 13, 2018 21:19

clintropolis added 6 commits January 7, 2019 13:16

blooming aggs

7dbc75e

partially address review

e1c9f77

fix docs

935a28a

minor test refactor after rebase

c17f8b5

use copied bloomkfilter

03a99bc

add ByteBuffer methods to BloomKFilter to allow agg to use in place, …

21eb78f

…simplify some things, more tests

clintropolis force-pushed the bloom-filter-aggregator branch from bf44d3f to 21eb78f Compare January 8, 2019 21:13

clintropolis added WIP and removed WIP labels Jan 8, 2019

add methods to BloomKFilter to get number of set bits, use in compara…

d1ba9d4

…tor, fixes

clintropolis removed the WIP label Jan 8, 2019

clintropolis added 2 commits January 8, 2019 16:45

more docs

f284aeb

fix

71d00cf

clintropolis mentioned this pull request Jan 9, 2019

Use an enum for aggregator cache key byte identifiers instead of static values #6823

Closed

clintropolis added 4 commits January 8, 2019 20:14

fix style

cec7706

simplify bloomfilter bytebuffer merge, change methods to allow passin…

ee91f3b

…g buffer offsets

oof, more fixes

6470dc6

more sane docs example

233aa9e

clintropolis added 4 commits January 21, 2019 15:14

CardinalityBufferAggregator inspect selectors instead of selectorPluses

435e784

fix style

3bdddb1

Merge remote-tracking branch 'upstream/master' into bloom-filter-aggr…

d11f784

…egator

Merge remote-tracking branch 'upstream/master' into bloom-filter-aggr…

0f08686

…egator

leventov requested changes Jan 24, 2019

View reviewed changes

leventov mentioned this pull request Jan 24, 2019

Remove CardinalityAggregatorColumnSelectorStrategy and CardinalityAggregatorColumnSelectorStrategyFactory as too shallow abstractions. #6909

Open

clintropolis added 3 commits January 24, 2019 15:33

refactor away from using ColumnSelectorPlus and ColumnSelectorStrateg…

74feb97

…yFactory to instead use specialized aggregators for each supported column type, other review comments

adjustment

3136ce7

fix teamcity error?

a50b2b2

leventov requested changes Jan 25, 2019

View reviewed changes

leventov reviewed Jan 25, 2019

View reviewed changes

clintropolis added 3 commits January 25, 2019 12:54

rename nil aggs to empty, change empty agg constructor signature, add…

68bb28f

… comments

Merge remote-tracking branch 'upstream/master' into bloom-filter-aggr…

b61e6f3

…egator

use stringutils base64 stuff to be chill with master

8ebe1d9

clintropolis added 2 commits January 28, 2019 14:04

Merge remote-tracking branch 'upstream/master' into bloom-filter-aggr…

d1a3c44

…egator

add aggregate combiner, comment

a56615b

clintropolis mentioned this pull request Jan 29, 2019

Consistent naming of 'no-op', 'empty' aggregators #6934

Closed

leventov merged commit a6d81c0 into apache:master Jan 29, 2019

jon-wei mentioned this pull request Jan 29, 2019

Moments Sketch custom aggregator #6581

Merged

clintropolis deleted the bloom-filter-aggregator branch January 29, 2019 21:11

clintropolis mentioned this pull request Feb 6, 2019

extension for exactly distinct count for single long type dimension:accurate-cardinality #6768

Closed

jon-wei added this to the 0.14.0 milestone Feb 20, 2019

jon-wei mentioned this pull request Feb 22, 2019

0.14.0-incubating release notes #7126

Closed

clintropolis mentioned this pull request Apr 17, 2019

refactor druid-bloom-filter aggregators #7496

Merged

navis mentioned this pull request Jan 16, 2020

Backport bloom filter aggregator metatron-app/metatron-discovery#3032

Closed


		import java.nio.ByteBuffer;

		public interface BloomFilterAggregatorColumnSelectorStrategy<TValueSelector> extends ColumnSelectorStrategy

Conversation

clintropolis commented Sep 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis commented Sep 28, 2018 •

edited

Loading

clintropolis Jan 24, 2019 •

edited

Loading