Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion distribution/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@
<!-- the default value is a repeated flag from the command line, since blank value is not allowed -->
<druid.distribution.pulldeps.opts>--clean</druid.distribution.pulldeps.opts>
</properties>

<profiles>
<profile>
<id>dist</id>
Expand Down Expand Up @@ -91,6 +90,8 @@
<argument>-c</argument>
<argument>org.apache.druid.extensions:druid-avro-extensions</argument>
<argument>-c</argument>
<argument>org.apache.druid.extensions:druid-bloom-filter</argument>
<argument>-c</argument>
<argument>org.apache.druid.extensions:druid-datasketches</argument>
<argument>-c</argument>
<argument>org.apache.druid.extensions:druid-hdfs-storage</argument>
Expand Down
45 changes: 45 additions & 0 deletions docs/content/development/extensions-core/bloom-filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
layout: doc_page
---

# Druid Bloom Filter

Make sure to [include](../../operations/including-extensions.html) `druid-bloom-filter` as an extension.

BloomFilter is a probabilistic data structure for set membership check.
Following are some characterstics of BloomFilter
- BloomFilters are highly space efficient when compared to using a HashSet.
- Because of the probabilistic nature of bloom filter false positive (element not present in bloom filter but test() says true) are possible
- false negatives are not possible (if element is present then test() will never say false).
- The false positive probability is configurable (default: 5%) depending on which storage requirement may increase or decrease.
- Lower the false positive probability greater is the space requirement.
- Bloom filters are sensitive to number of elements that will be inserted in the bloom filter.
- During the creation of bloom filter expected number of entries must be specified.If the number of insertions exceed the specified initial number of entries then false positive probability will increase accordingly.

Internally, this implementation of bloom filter uses Murmur3 fast non-cryptographic hash algorithm.

### Json Representation of Bloom Filter
```json
{
"type" : "bloom",
"dimension" : <dimension_name>,
"bloomKFilter" : <serialized_bytes_for_BloomKFilter>,
"extractionFn" : <extraction_fn>
}
```

|Property |Description |required? |
|-------------------------|------------------------------|----------------------------------|
|`type` |Filter Type. Should always be `bloom`|yes|
|`dimension` |The dimension to filter over. | yes |
|`bloomKFilter` |Base64 encoded Binary representation of `org.apache.hive.common.util.BloomKFilter`| yes |
|`extractionFn`|[Extraction function](./../dimensionspecs.html#extraction-functions) to apply to the dimension values |no|


### Serialized Format for BloomKFilter
Serialized BloomKFilter format:
- 1 byte for the number of hash functions.
- 1 big endian int(That is how OutputStream works) for the number of longs in the bitset
- big endian longs in the BloomKFilter bitset

Note: `org.apache.hive.common.util.BloomKFilter` provides a serialize method which can be used to serialize bloom filters to outputStream.
1 change: 1 addition & 0 deletions docs/content/development/extensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Core extensions are maintained by Druid committers.
|----|-----------|----|
|druid-avro-extensions|Support for data in Apache Avro data format.|[link](../development/extensions-core/avro.html)|
|druid-basic-security|Support for Basic HTTP authentication and role-based access control.|[link](../development/extensions-core/druid-basic-security.html)|
|druid-bloom-filter|Support for providing Bloom filters in druid queries.|[link](../development/extensions-core/bloom-filter.html)|
|druid-caffeine-cache|A local cache implementation backed by Caffeine.|[link](../development/extensions-core/caffeine-cache.html)|
|druid-datasketches|Support for approximate counts and set operations with [DataSketches](http://datasketches.github.io/).|[link](../development/extensions-core/datasketches-extension.html)|
|druid-hdfs-storage|HDFS deep storage.|[link](../development/extensions-core/hdfs.html)|
Expand Down
65 changes: 65 additions & 0 deletions extensions-core/druid-bloom-filter/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>org.apache.druid.extensions</groupId>
<artifactId>druid-bloom-filter</artifactId>
<name>druid-bloom-filter</name>
<description>druid-bloom-filter</description>

<parent>
<groupId>org.apache.druid</groupId>
<artifactId>druid</artifactId>
<version>0.13.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
</parent>

<dependencies>
<dependency>
<groupId>org.apache.druid</groupId>
<artifactId>druid-processing</artifactId>
<version>${project.parent.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-storage-api</artifactId>
<version>2.7.0</version>
</dependency>

<!-- Tests -->
<dependency>
<groupId>org.apache.druid</groupId>
<artifactId>druid-processing</artifactId>
<version>${project.parent.version}</version>
<scope>test</scope>
<type>test-jar</type>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
</dependencies>

</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.apache.druid.guice;

import com.fasterxml.jackson.databind.Module;
import com.google.inject.Binder;
import org.apache.druid.initialization.DruidModule;

import java.util.Collections;
import java.util.List;

public class BloomFilterExtensionModule implements DruidModule
{

@Override
public List<? extends Module> getJacksonModules()
{
return Collections.singletonList(new BloomFilterSerializersModule());
}

@Override
public void configure(Binder binder)
{

}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.druid.guice;

import com.fasterxml.jackson.core.JsonGenerator;
import com.fasterxml.jackson.core.JsonParser;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.DeserializationContext;
import com.fasterxml.jackson.databind.SerializerProvider;
import com.fasterxml.jackson.databind.deser.std.StdDeserializer;
import com.fasterxml.jackson.databind.jsontype.NamedType;
import com.fasterxml.jackson.databind.module.SimpleModule;
import com.fasterxml.jackson.databind.ser.std.StdSerializer;
import org.apache.druid.query.filter.BloomDimFilter;
import org.apache.hive.common.util.BloomKFilter;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;

public class BloomFilterSerializersModule extends SimpleModule
{
public static String BLOOM_FILTER_TYPE_NAME = "bloom";

public BloomFilterSerializersModule()
{
registerSubtypes(
new NamedType(BloomDimFilter.class, BLOOM_FILTER_TYPE_NAME)
);
addSerializer(BloomKFilter.class, new BloomKFilterSerializer());
addDeserializer(BloomKFilter.class, new BloomKFilterDeserializer());
}

public static class BloomKFilterSerializer extends StdSerializer<BloomKFilter>
{

public BloomKFilterSerializer()
{
super(BloomKFilter.class);
}

@Override
public void serialize(
BloomKFilter bloomKFilter, JsonGenerator jsonGenerator, SerializerProvider serializerProvider
) throws IOException
{
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
BloomKFilter.serialize(byteArrayOutputStream, bloomKFilter);
byte[] bytes = byteArrayOutputStream.toByteArray();
jsonGenerator.writeBinary(bytes);
}
}

public static class BloomKFilterDeserializer extends StdDeserializer<BloomKFilter>
{

protected BloomKFilterDeserializer()
{
super(BloomKFilter.class);
}

@Override
public BloomKFilter deserialize(
JsonParser jsonParser, DeserializationContext deserializationContext
) throws IOException, JsonProcessingException
{
byte[] bytes = jsonParser.getBinaryValue();
return BloomKFilter.deserialize(new ByteArrayInputStream(bytes));

}
}
}
Loading