extension for exactly distinct count for single long type dimension:accurate-cardinality#6768
extension for exactly distinct count for single long type dimension:accurate-cardinality#6768elloooooo wants to merge 2 commits intoapache:masterfrom
Conversation
| implements AccurateCardinalityAggregatorColumnSelectorStrategy<DimensionSelector> | ||
| { | ||
| @Override | ||
| public long getUniversalUniqueCode(DimensionSelector dimSelector) |
There was a problem hiding this comment.
Is this truly universal? I'm not familiar with the classes we're interacting with but I'm curious if distinct values from different segments could return colliding ints here.
org.apache.druid.query.aggregation.cardinality.types.StringCardinalityAggregatorColumnSelectorStrategy#hashValues looks up the actual value of the dimension before adding it to the cardinality aggregator.
There was a problem hiding this comment.
Thank you for your reply. Because of lack of globle dict, cardinality of string type dimension can not be computed correctly as you said. So this extension can only support long type dimension.
There was a problem hiding this comment.
Maybe this code should be removed if it is invalid?
clintropolis
left a comment
There was a problem hiding this comment.
Very sorry for the delayed review, finally got around to making an initial pass. I think overall concept makes sense, will have another look very soon.
| <inspection_tool class="DuplicateBooleanBranch" enabled="true" level="ERROR" enabled_by_default="true" /> | ||
| <inspection_tool class="DuplicateCondition" enabled="true" level="ERROR" enabled_by_default="true"> | ||
| <option name="ignoreSideEffectConditions" value="true" /> | ||
| <option name="ignoreMethodCalls" value="false" /> |
There was a problem hiding this comment.
These changes seem unrelated to this PR?
There was a problem hiding this comment.
These changes seem unrelated to this PR?
Yes, I'll remove the change.
| <component name="InspectionProjectProfileManager"> | ||
| <settings> | ||
| <option name="PROJECT_PROFILE" value="Druid" /> | ||
| <option name="USE_PROJECT_PROFILE" value="true" /> |
|
|
||
| public class AccurateCardinalityAggregatorFactory extends AggregatorFactory | ||
| { | ||
| private static final AccurateCardinalityAggregatorColumnSelectorStrategyFactory STRATEGY_FACTORY = |
There was a problem hiding this comment.
I think we are trying to get away from using the ColumnSelectorStrategyFactory pattern when possible in aggregators, see #6909 for details and #6397 (comment) for potential ideas of how to refactor.
|
|
||
| import java.nio.ByteBuffer; | ||
|
|
||
| public interface BitmapFactory |
There was a problem hiding this comment.
Suggest using a different name to avoid confusion with org.apache.druid.collections.bitmap.BitmapFactory, maybe LongBitmapFactory to be similar to other Druid class naming conventions.
| * This class is meant to represent a simple wrapper around an immutable bitmap | ||
| * class. | ||
| */ | ||
| public interface ImmutableBitmap |
There was a problem hiding this comment.
Likewise, suggest using LongImmutableBitmap or something like here to distinguish from processing library types.
There was a problem hiding this comment.
Actually, I've pulled a copy of this branch and.had a closer look and some time to think about it a bit. The only reason for all of this abstraction is if this extension needed to support using multiple bitmap implementations for computing the distinct count, but it only seems to be using roaring. I think it might be better to drop all of this extra structure and just using roaring bitmap directly, either into a concrete collector, also without all of the interface infrastructure, or just pull it all the way up into the aggregators themselves. I think this would significantly simplify the implementation as well.
Is there any reason to support multiple bitmap implementations that I am missing?
|
|
||
| import java.nio.ByteBuffer; | ||
|
|
||
| public class RoaringBitmapFactory implements BitmapFactory |
There was a problem hiding this comment.
Same comment about renaming to distinguish between processing module.
|
|
||
| import java.nio.ByteBuffer; | ||
|
|
||
| public interface Collector extends Comparable<Collector> |
There was a problem hiding this comment.
Collector is a bit generic of a name, maybe LongBitmapCollector?
| implements AccurateCardinalityAggregatorColumnSelectorStrategy<DimensionSelector> | ||
| { | ||
| @Override | ||
| public long getUniversalUniqueCode(DimensionSelector dimSelector) |
There was a problem hiding this comment.
Maybe this code should be removed if it is invalid?
| @Override | ||
| public int getMaxIntermediateSize() | ||
| { | ||
| return 1024; |
There was a problem hiding this comment.
Where did this number come from?
| @Override | ||
| public int getMaxIntermediateSize() | ||
| { | ||
| return 1024; |
There was a problem hiding this comment.
Where did this number come from?
| public static final byte LONG_LAST_CACHE_TYPE_ID = 0x18; | ||
| public static final byte TIMESTAMP_CACHE_TYPE_ID = 0x19; | ||
| public static final byte VARIANCE_CACHE_TYPE_ID = 0x1A; | ||
| public static final byte ACCURATE_CARDINALITY_CACHE_TYPE_ID = 0x1B; |
There was a problem hiding this comment.
These cache ids are already in use, they all need to be distinct. Hopefully soon we can create a mechanism to make this a less manual process with #6823, but for now we need to keep this correct by hand. 0x35 is the current highest, but there is another aggregator PR that is using up to 0x37 so you should probably just start with 0x38.
| * This class is meant to represent a simple wrapper around an immutable bitmap | ||
| * class. | ||
| */ | ||
| public interface ImmutableBitmap |
There was a problem hiding this comment.
Actually, I've pulled a copy of this branch and.had a closer look and some time to think about it a bit. The only reason for all of this abstraction is if this extension needed to support using multiple bitmap implementations for computing the distinct count, but it only seems to be using roaring. I think it might be better to drop all of this extra structure and just using roaring bitmap directly, either into a concrete collector, also without all of the interface infrastructure, or just pull it all the way up into the aggregators themselves. I think this would significantly simplify the implementation as well.
Is there any reason to support multiple bitmap implementations that I am missing?
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
reactive |
|
Can I revive the PR? |
Now, Druid offers the ability by nested group by query.
Its logic can be described as follows:
For a sql like:
the exactly query will be like:
For high cardinality case, the size of result transfered from historical node to broker node can be really large and leads to poor performance.
So this extension try to use bitmap(64bit RoaringBitmap) as the container for the result data from the historical.
The performance can be 10 times better the nested group by method.
Welcome suggestions to this MR.