Add ability to pass in Bloom filter from Hive Queries#6222
Add ability to pass in Bloom filter from Hive Queries#6222jon-wei merged 3 commits intoapache:masterfrom
Conversation
6458fd3 to
59d6a07
Compare
|
Hi @nishantmonu51, this is really cool and something that seems useful outside of Hive / Druid interop too. Is there a way to do this without making it so Hive specific? Could the core bloom filter code be extracted into a library and put into a I am thinking of use cases like approximate
It's not exact but with appropriately sized filters it would still be useful for some use cases. |
|
@gianm: sure, I agree its a useful feature in general. |
|
A cool extension.IMO it's better to extract the implementation of BloomKFilter to avoid any possible problems. |
59d6a07 to
ea066c1
Compare
|
@gianm: Updated based on your comments, please review. |
|
@nishantmonu51 first there is no doc about how to use this extension. Secondly, I think you forget to add this extension to |
There was a problem hiding this comment.
would it be better to put this in DimFilterUtils? I think it'd be easier to have all the filter cache IDs in one place, and this is a core extension
There was a problem hiding this comment.
BloomKFilterserializer -> BloomKFilterSerializer
|
Hi @nishantmonu51, do you think this should be a blocking issue for 0.13.0 release? |
|
@nishantmonu51 Can you add docs for this? An example of what a serialized BloomFilter input looks like and what it represents would be helpful IMO. |
There was a problem hiding this comment.
this file has the wrong name, should be org.apache
There was a problem hiding this comment.
This test is located in a folder that does't match the package structure (io/druid/..)
|
@nishantmonu51 can you address comments? Otherwise we can move this to 0.13.1 |
|
Will update the PR today with the fixes. It will be good to get this in 0.13.0 |
ea066c1 to
e8aec5c
Compare
|
@jon-wei @clintropolis @QiuMM Thanks for the review. Added docs and handled review comments. |
|
This travis failure looks legitimate, but 👍 after travis and conflicts are resolved |
fix checkstyle review comments Fix wierd failure review comments Revert "Fix wierd failure" This reverts commit a13a83a.
e8aec5c to
305d083
Compare
|
By the way, I built a bloom filter aggregator on top of this last weekend once I figured out |
| |-------------------------|------------------------------|----------------------------------| | ||
| |`type` |Filter Type. Should always be `bloom`|yes| | ||
| |`dimension` |The dimension to filter over. | yes | | ||
| |`bloomKFilter` |Binary representation of `org.apache.hive.common.util.BloomKFilter`| yes | |
There was a problem hiding this comment.
nits:
"Base64 encoded Binary representation....." ?
Also, It appears to me that this module(and property) are really usable only if someone was using hive util BloomKFilter class on the client side because serialized representation is specific to that class and not something standard, so maybe we could add "hive" in the names everywhere e.g. hive-bloom-filter module and this property being hiveUtilBloomKFilter .
or I'm missing something :)
There was a problem hiding this comment.
updated, yes, its true that the implementation is currently shared from hive-storage-api jar, but the implementation as such does not depend on any of the hive specific functionality.
I initially made this a hive-bloom-filter but later on renamed to bloom-filter based on review comments.
There was a problem hiding this comment.
I think it's better to not have hive in the names. I can imagine evolving this extension to support some non-hive format too, and make it more divorced from hive. Especially since with the aggregator @clintropolis mentioned, people could create serialized bloom filters without really needing to worry about the fact that they came from hive.
There was a problem hiding this comment.
To be clear, if we did evolve it that way, it should still support the hive format (to work with the original use case from @nishantmonu51). I just think there is a lot of potential for this extension to be useful beyond hive integration.
There was a problem hiding this comment.
ah, I understand the naming now. ( I knew I was missing something ).
jihoonson
left a comment
There was a problem hiding this comment.
Looks that teamcity fail isn't related to this PR.
This PR adds a BloomDimFilter which can be used by Apache Hive to pass in BloomFilters.
Use Case -
We have fact table in druid and slowly changing dimension/lookup tables in Apache Hive and need to join those tables.
e.g. Consider the case of SSB Benchmark when lineorder is stored in Druid and parts table is in hive For following query from SSB Benchmark -
In the above query Hive can scan parts table, create a bloom filter for possible values for p_part_key where p_category = 'MFGR#14'. This bloom filter can then be pushed to Druid reducing the data that needs to scanned and transferred between Druid and Hive.
Since BloomFilter is probablistic data structure and can have false positives. Hive will still need to do filtering while processing joins.