apache · leventov · Jan 29, 2019 · Sep 22, 2018 · Oct 6, 2018 · Oct 8, 2018
diff --git a/docs/content/development/extensions-core/bloom-filter.md b/docs/content/development/extensions-core/bloom-filter.md
@@ -24,22 +24,44 @@ title: "Bloom Filter"
 
 # Bloom Filter
 
-Make sure to [include](../../operations/including-extensions.html) `druid-bloom-filter` as an extension.
+This extension adds the ability to both construct bloom filters from query results, and filter query results by testing 
+against a bloom filter. Make sure to [include](../../operations/including-extensions.html) `druid-bloom-filter` as an 
+extension.
 
-BloomFilter is a probabilistic data structure for set membership check.
-Following are some characterstics of BloomFilter
+A BloomFilter is a probabilistic data structure for performing a set membership check. A bloom filter is a good candidate 
+to use with Druid for cases where an explicit filter is impossible, e.g. filtering a query against a set of millions of
+ values.
+
+Following are some characteristics of BloomFilters:
 - BloomFilters are highly space efficient when compared to using a HashSet.
-- Because of the probabilistic nature of bloom filter false positive (element not present in bloom filter but test() says true) are possible
-- false negatives are not possible (if element is present then test() will never say false).
-- The false positive probability is configurable (default: 5%) depending on which storage requirement may increase or decrease.
-- Lower the false positive probability greater is the space requirement.
-- Bloom filters are sensitive to number of elements that will be inserted in the bloom filter.
-- During the creation of bloom filter expected number of entries must be specified.If the number of insertions exceed the specified initial number of entries then false positive probability will increase accordingly.
+- Because of the probabilistic nature of bloom filters, false positive results are possible (element was not actually 
+inserted into a bloom filter during construction, but `test()` says true)
+- False negatives are not possible (if element is present then `test()` will never say false). 
+- The false positive probability of this implementation is currently fixed at 5%, but increasing the number of entries 
+that the filter can hold can decrease this false positive rate in exchange for overall size.
+- Bloom filters are sensitive to number of elements that will be inserted in the bloom filter. During the creation of bloom filter expected number of entries must be specified. If the number of insertions exceed
+ the specified initial number of entries then false positive probability will increase accordingly.
+
+This extension is currently based on `org.apache.hive.common.util.BloomKFilter` from `hive-storage-api`. Internally, 
+this implementation uses Murmur3 as the hash algorithm.
+
+To construct a BloomKFilter externally with Java to use as a filter in a Druid query:
+
+```java
+BloomKFilter bloomFilter = new BloomKFilter(1500);
+bloomFilter.addString("value 1");
+bloomFilter.addString("value 2");
+bloomFilter.addString("value 3");
+ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
+BloomKFilter.serialize(byteArrayOutputStream, bloomFilter);
+String base64Serialized = Base64.encodeBase64String(byteArrayOutputStream.toByteArray());
+```
 
-Internally, this implementation of bloom filter uses Murmur3 fast non-cryptographic hash algorithm.
+This string can then be used in the native or sql Druid query.
 
-### JSON Representation of Bloom Filter
+## Filtering queries with a Bloom Filter
 
+### JSON Specification of Bloom Filter
 ```json
 {
   "type" : "bloom",
@@ -75,12 +97,68 @@ Bloom filters are supported in SQL via the `bloom_filter_test` operator:
 SELECT COUNT(*) FROM druid.foo WHERE bloom_filter_test(<expr>, '<serialized_bytes_for_BloomKFilter>')
 ```
 
-
 ### Expression and Virtual Column Support
 
 The bloom filter extension also adds a bloom filter [Druid expression](../../misc/math-expr.html) which shares syntax 
 with the SQL operator.
 
 ```sql
 bloom_filter_test(<expr>, '<serialized_bytes_for_BloomKFilter>')
-```
+```
+
+## Bloom Filter Query Aggregator
+
+Input for a `bloomKFilter` can also be created from a druid query with the `bloom` aggregator.
+
+### JSON Specification of Bloom Filter Aggregator
+
+```json
+{
+      "type": "bloom",
+      "name": <output_field_name>,
+      "maxNumEntries": <maximum_number_of_elements_for_BloomKFilter>
+      "field": <dimension_spec>
+    }
+```
+
+|Property                 |Description                   |required?                           |
+|-------------------------|------------------------------|----------------------------------|
+|`type`                   |Aggregator Type. Should always be `bloom`|yes|
+|`name`                   |Output field name |yes|
+|`field`                  |[DimensionSpec](./../dimensionspecs.html) to add to `org.apache.hive.common.util.BloomKFilter` | yes |
+|`maxNumEntries`          |Maximum number of distinct values supported by `org.apache.hive.common.util.BloomKFilter`, default `1500`| no |
+
+### Example
+
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": "wikiticker",
+  "intervals": [ "2015-09-12T00:00:00.000/2015-09-13T00:00:00.000" ],
+  "granularity": "day",
+  "aggregations": [
+    {
+      "type": "bloom",
+      "name": "userBloom",
+      "maxNumEntries": 100000,
+      "field": {
+        "type":"default",
+        "dimension":"user",
+        "outputType": "STRING"
+      }
+    }
+  ]
+}
+```
+
+response
+
+```json
+[{"timestamp":"2015-09-12T00:00:00.000Z","result":{"userBloom":"BAAAJhAAAA..."}}]
+```
+
+These values can then be set in the filter specification above. 
+
+Ordering results by a bloom filter aggregator, for example in a TopN query, will perform a comparatively expensive 
+linear scan _of the filter itself_ to count the number of set bits as a means of approximating how many items have been 
+added to the set. As such, ordering by an alternate aggregation is recommended if possible. 
diff --git a/...on/src/main/java/org/apache/druid/query/materializedview/DerivativeDataSourceManager.java b/...on/src/main/java/org/apache/druid/query/materializedview/DerivativeDataSourceManager.java
@@ -210,7 +210,7 @@ public Pair<String, DerivativeDataSourceMetadata> map(int index, ResultSet r, St
   }
 
   /**
-   * caculate the average data size per segment granularity for a given datasource.
+   * calculate the average data size per segment granularity for a given datasource.
    * 
    * e.g. for a datasource, there're 5 segments as follows,
    * interval = "2018-04-01/2017-04-02", segment size = 1024 * 1024 * 2

diff --git a/...election/src/main/java/org/apache/druid/query/materializedview/MaterializedViewUtils.java b/...election/src/main/java/org/apache/druid/query/materializedview/MaterializedViewUtils.java
@@ -85,7 +85,7 @@ private static Set<String> extractFieldsFromAggregations(List<AggregatorFactory>
   }
 
   /**
-   * caculate the intervals which are covered by interval2, but not covered by interval1.
+   * calculate the intervals which are covered by interval2, but not covered by interval1.
    * result intervals = interval2 - interval1 ∩ interval2
    * e.g. 
    * a list of interval2: ["2018-04-01T00:00:00.000Z/2018-04-02T00:00:00.000Z",

diff --git a/...druid-bloom-filter/src/main/java/org/apache/druid/guice/BloomFilterSerializersModule.java b/...druid-bloom-filter/src/main/java/org/apache/druid/guice/BloomFilterSerializersModule.java
@@ -27,9 +27,12 @@
 import com.fasterxml.jackson.databind.jsontype.NamedType;
 import com.fasterxml.jackson.databind.module.SimpleModule;
 import com.fasterxml.jackson.databind.ser.std.StdSerializer;
+import org.apache.druid.query.aggregation.bloom.BloomFilterAggregatorFactory;
+import org.apache.druid.query.aggregation.bloom.BloomFilterSerde;
 import org.apache.druid.query.filter.BloomDimFilter;
 import org.apache.druid.query.filter.BloomKFilter;
 import org.apache.druid.query.filter.BloomKFilterHolder;
+import org.apache.druid.segment.serde.ComplexMetrics;
 
 import java.io.ByteArrayInputStream;
 import java.io.ByteArrayOutputStream;
@@ -41,10 +44,17 @@ public class BloomFilterSerializersModule extends SimpleModule
 
   public BloomFilterSerializersModule()
   {
-    registerSubtypes(new NamedType(BloomDimFilter.class, BLOOM_FILTER_TYPE_NAME));
+    registerSubtypes(
+        new NamedType(BloomDimFilter.class, BLOOM_FILTER_TYPE_NAME),
+        new NamedType(BloomFilterAggregatorFactory.class, BLOOM_FILTER_TYPE_NAME)
+    );
     addSerializer(BloomKFilter.class, new BloomKFilterSerializer());
     addDeserializer(BloomKFilter.class, new BloomKFilterDeserializer());
     addDeserializer(BloomKFilterHolder.class, new BloomKFilterHolderDeserializer());
+
+    if (ComplexMetrics.getSerdeForType(BLOOM_FILTER_TYPE_NAME) == null) {
+      ComplexMetrics.registerSerde(BLOOM_FILTER_TYPE_NAME, new BloomFilterSerde());
+    }
   }
 
   private static class BloomKFilterSerializer extends StdSerializer<BloomKFilter>

diff --git a/...ter/src/main/java/org/apache/druid/query/aggregation/bloom/BaseBloomFilterAggregator.java b/...ter/src/main/java/org/apache/druid/query/aggregation/bloom/BaseBloomFilterAggregator.java
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.bloom;
+
+import org.apache.druid.query.aggregation.Aggregator;
+import org.apache.druid.query.filter.BloomKFilter;
+import org.apache.druid.segment.BaseNullableColumnValueSelector;
+
+import javax.annotation.Nullable;
+
+public abstract class BaseBloomFilterAggregator<TSelector extends BaseNullableColumnValueSelector> implements Aggregator
+{
+  final BloomKFilter collector;
+  protected final TSelector selector;
+
+  BaseBloomFilterAggregator(TSelector selector, BloomKFilter collector)
+  {
+    this.collector = collector;
+    this.selector = selector;
+  }
+
+  @Nullable
+  @Override
+  public Object get()
+  {
+    return collector;
+  }
+
+  @Override
+  public float getFloat()
+  {
+    throw new UnsupportedOperationException("BloomFilterAggregator does not support getFloat()");
+  }
+
+  @Override
+  public long getLong()
+  {
+    throw new UnsupportedOperationException("BloomFilterAggregator does not support getLong()");
+  }
+
+  @Override
+  public double getDouble()
+  {
+    throw new UnsupportedOperationException("BloomFilterAggregator does not support getDouble()");
+  }
+
+  @Override
+  public void close()
+  {
+    // nothing to close
+  }
+}
diff --git a/...c/main/java/org/apache/druid/query/aggregation/bloom/BaseBloomFilterBufferAggregator.java b/...c/main/java/org/apache/druid/query/aggregation/bloom/BaseBloomFilterBufferAggregator.java
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.bloom;
+
+import org.apache.druid.query.aggregation.BufferAggregator;
+import org.apache.druid.query.filter.BloomKFilter;
+import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
+import org.apache.druid.segment.BaseNullableColumnValueSelector;
+
+import java.nio.ByteBuffer;
+
+public abstract class BaseBloomFilterBufferAggregator<TSelector extends BaseNullableColumnValueSelector> implements BufferAggregator
+{
+  protected final int maxNumEntries;
+  protected final TSelector selector;
+
+  BaseBloomFilterBufferAggregator(TSelector selector, int maxNumEntries)
+  {
+    this.selector = selector;
+    this.maxNumEntries = maxNumEntries;
+  }
+
+  abstract void bufferAdd(ByteBuffer buf);
+
+  @Override
+  public void init(ByteBuffer buf, int position)
+  {
+    final ByteBuffer mutationBuffer = buf.duplicate();
+    mutationBuffer.position(position);
+    BloomKFilter filter = new BloomKFilter(maxNumEntries);
+    BloomKFilter.serialize(mutationBuffer, filter);
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int position)
+  {
+    final int oldPosition = buf.position();
+    buf.position(position);
+    bufferAdd(buf);
+    buf.position(oldPosition);
+  }
+
+
+  @Override
+  public Object get(ByteBuffer buf, int position)
+  {
+    ByteBuffer mutationBuffer = buf.duplicate();
+    mutationBuffer.position(position);
+    // | k (byte) | numLongs (int) | bitset (long[numLongs]) |
+    int sizeBytes = 1 + Integer.BYTES + (buf.getInt(position + 1) * Long.BYTES);
+    mutationBuffer.limit(position + sizeBytes);
+    return mutationBuffer.slice();
+  }
+
+  @Override
+  public float getFloat(ByteBuffer buf, int position)
+  {
+    throw new UnsupportedOperationException("BloomFilterBufferAggregator does not support getFloat()");
+  }
+
+  @Override
+  public long getLong(ByteBuffer buf, int position)
+  {
+    throw new UnsupportedOperationException("BloomFilterBufferAggregator does not support getLong()");
+  }
+
+  @Override
+  public double getDouble(ByteBuffer buf, int position)
+  {
+    throw new UnsupportedOperationException("BloomFilterBufferAggregator does not support getDouble()");
+  }
+
+  @Override
+  public void close()
+  {
+    // nothing to close
+  }
+
+  @Override
+  public void inspectRuntimeShape(RuntimeShapeInspector inspector)
+  {
+    inspector.visit("selector", selector);
+  }
+}