-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Adds bloom filter aggregator to 'druid-bloom-filters' extension #6397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7dbc75e
e1c9f77
935a28a
c17f8b5
03a99bc
21eb78f
d1ba9d4
f284aeb
71d00cf
cec7706
ee91f3b
6470dc6
233aa9e
654a994
6c04d24
2e5f43d
a12bad1
ee6ecd6
70882c9
3858cb8
2ccc137
ff87a37
a635a09
34183ac
daad5a6
b310e52
8ded684
d6a3809
d0b90b2
435e784
3bdddb1
d11f784
0f08686
74feb97
3136ce7
a50b2b2
68bb28f
b61e6f3
8ebe1d9
d1a3c44
a56615b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,22 +24,44 @@ title: "Bloom Filter" | |
|
|
||
| # Bloom Filter | ||
|
|
||
| Make sure to [include](../../operations/including-extensions.html) `druid-bloom-filter` as an extension. | ||
| This extension adds the ability to both construct bloom filters from query results, and filter query results by testing | ||
| against a bloom filter. Make sure to [include](../../operations/including-extensions.html) `druid-bloom-filter` as an | ||
| extension. | ||
|
|
||
| BloomFilter is a probabilistic data structure for set membership check. | ||
| Following are some characterstics of BloomFilter | ||
| A BloomFilter is a probabilistic data structure for performing a set membership check. A bloom filter is a good candidate | ||
| to use with Druid for cases where an explicit filter is impossible, e.g. filtering a query against a set of millions of | ||
| values. | ||
|
|
||
| Following are some characteristics of BloomFilters: | ||
| - BloomFilters are highly space efficient when compared to using a HashSet. | ||
| - Because of the probabilistic nature of bloom filter false positive (element not present in bloom filter but test() says true) are possible | ||
| - false negatives are not possible (if element is present then test() will never say false). | ||
| - The false positive probability is configurable (default: 5%) depending on which storage requirement may increase or decrease. | ||
| - Lower the false positive probability greater is the space requirement. | ||
| - Bloom filters are sensitive to number of elements that will be inserted in the bloom filter. | ||
| - During the creation of bloom filter expected number of entries must be specified.If the number of insertions exceed the specified initial number of entries then false positive probability will increase accordingly. | ||
| - Because of the probabilistic nature of bloom filters, false positive results are possible (element was not actually | ||
| inserted into a bloom filter during construction, but `test()` says true) | ||
| - False negatives are not possible (if element is present then `test()` will never say false). | ||
| - The false positive probability of this implementation is currently fixed at 5%, but increasing the number of entries | ||
| that the filter can hold can decrease this false positive rate in exchange for overall size. | ||
| - Bloom filters are sensitive to number of elements that will be inserted in the bloom filter. During the creation of bloom filter expected number of entries must be specified. If the number of insertions exceed | ||
| the specified initial number of entries then false positive probability will increase accordingly. | ||
|
|
||
| This extension is currently based on `org.apache.hive.common.util.BloomKFilter` from `hive-storage-api`. Internally, | ||
| this implementation uses Murmur3 as the hash algorithm. | ||
|
|
||
| To construct a BloomKFilter externally with Java to use as a filter in a Druid query: | ||
|
|
||
| ```java | ||
| BloomKFilter bloomFilter = new BloomKFilter(1500); | ||
| bloomFilter.addString("value 1"); | ||
| bloomFilter.addString("value 2"); | ||
| bloomFilter.addString("value 3"); | ||
| ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(); | ||
| BloomKFilter.serialize(byteArrayOutputStream, bloomFilter); | ||
| String base64Serialized = Base64.encodeBase64String(byteArrayOutputStream.toByteArray()); | ||
| ``` | ||
|
|
||
| Internally, this implementation of bloom filter uses Murmur3 fast non-cryptographic hash algorithm. | ||
| This string can then be used in the native or sql Druid query. | ||
|
|
||
| ### JSON Representation of Bloom Filter | ||
| ## Filtering queries with a Bloom Filter | ||
|
|
||
| ### JSON Specification of Bloom Filter | ||
| ```json | ||
| { | ||
| "type" : "bloom", | ||
|
|
@@ -75,12 +97,68 @@ Bloom filters are supported in SQL via the `bloom_filter_test` operator: | |
| SELECT COUNT(*) FROM druid.foo WHERE bloom_filter_test(<expr>, '<serialized_bytes_for_BloomKFilter>') | ||
| ``` | ||
|
|
||
|
|
||
| ### Expression and Virtual Column Support | ||
|
|
||
| The bloom filter extension also adds a bloom filter [Druid expression](../../misc/math-expr.html) which shares syntax | ||
| with the SQL operator. | ||
|
|
||
| ```sql | ||
| bloom_filter_test(<expr>, '<serialized_bytes_for_BloomKFilter>') | ||
| ``` | ||
| ``` | ||
|
|
||
| ## Bloom Filter Query Aggregator | ||
|
|
||
| Input for a `bloomKFilter` can also be created from a druid query with the `bloom` aggregator. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Refers to
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| ### JSON Specification of Bloom Filter Aggregator | ||
|
|
||
| ```json | ||
| { | ||
| "type": "bloom", | ||
| "name": <output_field_name>, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider making this valid JSON so it doesn't get syntax highlighted |
||
| "maxNumEntries": <maximum_number_of_elements_for_BloomKFilter> | ||
| "field": <dimension_spec> | ||
| } | ||
| ``` | ||
|
|
||
| |Property |Description |required? | | ||
| |-------------------------|------------------------------|----------------------------------| | ||
| |`type` |Aggregator Type. Should always be `bloom`|yes| | ||
| |`name` |Output field name |yes| | ||
| |`field` |[DimensionSpec](./../dimensionspecs.html) to add to `org.apache.hive.common.util.BloomKFilter` | yes | | ||
| |`maxNumEntries` |Maximum number of distinct values supported by `org.apache.hive.common.util.BloomKFilter`, default `1500`| no | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it'd be worthwhile under
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, digging into it, in
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated docs to include fixed 5% false positive rate, though no formula for how changing |
||
|
|
||
| ### Example | ||
|
|
||
| ```json | ||
| { | ||
| "queryType": "timeseries", | ||
| "dataSource": "wikiticker", | ||
| "intervals": [ "2015-09-12T00:00:00.000/2015-09-13T00:00:00.000" ], | ||
| "granularity": "day", | ||
| "aggregations": [ | ||
| { | ||
| "type": "bloom", | ||
| "name": "userBloom", | ||
| "maxNumEntries": 100000, | ||
| "field": { | ||
| "type":"default", | ||
| "dimension":"user", | ||
| "outputType": "STRING" | ||
| } | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| response | ||
|
|
||
| ```json | ||
| [{"timestamp":"2015-09-12T00:00:00.000Z","result":{"userBloom":"BAAAJhAAAA..."}}] | ||
| ``` | ||
|
|
||
| These values can then be set in the filter specification above. | ||
|
|
||
| Ordering results by a bloom filter aggregator, for example in a TopN query, will perform a comparatively expensive | ||
| linear scan _of the filter itself_ to count the number of set bits as a means of approximating how many items have been | ||
| added to the set. As such, ordering by an alternate aggregation is recommended if possible. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.druid.query.aggregation.bloom; | ||
|
|
||
| import org.apache.druid.query.aggregation.Aggregator; | ||
| import org.apache.druid.query.filter.BloomKFilter; | ||
| import org.apache.druid.segment.BaseNullableColumnValueSelector; | ||
|
|
||
| import javax.annotation.Nullable; | ||
|
|
||
| public abstract class BaseBloomFilterAggregator<TSelector extends BaseNullableColumnValueSelector> implements Aggregator | ||
| { | ||
| final BloomKFilter collector; | ||
| protected final TSelector selector; | ||
|
|
||
| BaseBloomFilterAggregator(TSelector selector, BloomKFilter collector) | ||
| { | ||
| this.collector = collector; | ||
| this.selector = selector; | ||
| } | ||
|
|
||
| @Nullable | ||
| @Override | ||
| public Object get() | ||
| { | ||
| return collector; | ||
| } | ||
|
|
||
| @Override | ||
| public float getFloat() | ||
| { | ||
| throw new UnsupportedOperationException("BloomFilterAggregator does not support getFloat()"); | ||
| } | ||
|
|
||
| @Override | ||
| public long getLong() | ||
| { | ||
| throw new UnsupportedOperationException("BloomFilterAggregator does not support getLong()"); | ||
| } | ||
|
|
||
| @Override | ||
| public double getDouble() | ||
| { | ||
| throw new UnsupportedOperationException("BloomFilterAggregator does not support getDouble()"); | ||
| } | ||
|
|
||
| @Override | ||
| public void close() | ||
| { | ||
| // nothing to close | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,101 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.druid.query.aggregation.bloom; | ||
|
|
||
| import org.apache.druid.query.aggregation.BufferAggregator; | ||
| import org.apache.druid.query.filter.BloomKFilter; | ||
| import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector; | ||
| import org.apache.druid.segment.BaseNullableColumnValueSelector; | ||
|
|
||
| import java.nio.ByteBuffer; | ||
|
|
||
| public abstract class BaseBloomFilterBufferAggregator<TSelector extends BaseNullableColumnValueSelector> implements BufferAggregator | ||
| { | ||
| protected final int maxNumEntries; | ||
| protected final TSelector selector; | ||
|
|
||
| BaseBloomFilterBufferAggregator(TSelector selector, int maxNumEntries) | ||
| { | ||
| this.selector = selector; | ||
| this.maxNumEntries = maxNumEntries; | ||
| } | ||
|
|
||
| abstract void bufferAdd(ByteBuffer buf); | ||
|
|
||
| @Override | ||
| public void init(ByteBuffer buf, int position) | ||
| { | ||
| final ByteBuffer mutationBuffer = buf.duplicate(); | ||
| mutationBuffer.position(position); | ||
| BloomKFilter filter = new BloomKFilter(maxNumEntries); | ||
| BloomKFilter.serialize(mutationBuffer, filter); | ||
| } | ||
|
|
||
| @Override | ||
| public void aggregate(ByteBuffer buf, int position) | ||
| { | ||
| final int oldPosition = buf.position(); | ||
| buf.position(position); | ||
| bufferAdd(buf); | ||
| buf.position(oldPosition); | ||
| } | ||
|
|
||
|
|
||
| @Override | ||
| public Object get(ByteBuffer buf, int position) | ||
| { | ||
| ByteBuffer mutationBuffer = buf.duplicate(); | ||
| mutationBuffer.position(position); | ||
| // | k (byte) | numLongs (int) | bitset (long[numLongs]) | | ||
| int sizeBytes = 1 + Integer.BYTES + (buf.getInt(position + 1) * Long.BYTES); | ||
| mutationBuffer.limit(position + sizeBytes); | ||
| return mutationBuffer.slice(); | ||
| } | ||
|
|
||
| @Override | ||
| public float getFloat(ByteBuffer buf, int position) | ||
| { | ||
| throw new UnsupportedOperationException("BloomFilterBufferAggregator does not support getFloat()"); | ||
| } | ||
|
|
||
| @Override | ||
| public long getLong(ByteBuffer buf, int position) | ||
| { | ||
| throw new UnsupportedOperationException("BloomFilterBufferAggregator does not support getLong()"); | ||
| } | ||
|
|
||
| @Override | ||
| public double getDouble(ByteBuffer buf, int position) | ||
| { | ||
| throw new UnsupportedOperationException("BloomFilterBufferAggregator does not support getDouble()"); | ||
| } | ||
|
|
||
| @Override | ||
| public void close() | ||
| { | ||
| // nothing to close | ||
| } | ||
|
|
||
| @Override | ||
| public void inspectRuntimeShape(RuntimeShapeInspector inspector) | ||
| { | ||
| inspector.visit("selector", selector); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this valid JSON so it doesn't get syntax highlighted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this is really only ugly on github and it looks ok translated to the website docs