implement special distinctcount#2602
implement special distinctcount#2602fjy merged 1 commit intoapache:masterfrom binlijin:distinctcount
Conversation
|
@binlijin how fast is the exact distinct count? |
|
@binlijin can you add a README to the module? Something to help explain how to do the distinct count and also what to watch out for. If you guys use this in production, can you also note it is used in production in the README? |
|
@fjy, the exact distinct count is faster than thetaSketch ( which use size 16384). |
|
@binlijin there's some merge conflicts |
|
@binlijin we'll need some docs on how to use this |
There was a problem hiding this comment.
let's put this in extensions-contrib for now
|
Performance: Take time: |
|
DistinctCount query: Take time: |
|
DistinctCount query: |
|
DistinctCount query: |
There was a problem hiding this comment.
instead of 2 links above can you please document the process of how this extension could be used.
There was a problem hiding this comment.
also pls indicate the assumption that one value has to be present in only one segment or this might overcount.
There was a problem hiding this comment.
i think this might need to encdoe the name and bitmap type too.
|
Do you mind adding an entry for this new extension under and instead of the README.md, create a new file under ? This will make it much easier to version extensions and list them on the Druid webpage |
|
@binlijin also please squash commits @himanshug any more comments? |
|
Also, wait for #2698 to be merged before making doc changes :) |
There was a problem hiding this comment.
you might want to take care of the case where one of the segment does not have fieldName column
see https://github.com/druid-io/druid/blob/master/extensions-core/datasketches/src/main/java/io/druid/query/aggregation/datasketches/theta/SketchAggregatorFactory.java#L77
|
👍 after #2602 (comment) and #2602 (comment) are resolved and commits are squashed. |
|
@fjy i find the link in docs/content/development/extensions.md is wrong, for example: |
|
@binlijin the html file is generated when teh docs are ported over to the webpage, you shouldn't need to fix anything |
|
@binlijin for example, under docs/content, you can run |
|
@fjy, ok, so i need to minor change this patch. |
|
@fjy @himanshug i think all comments have been resolved. |
|
👍 for me |
| (2) Second use distinctCount to calculate exact distinct count, make sure queryGranularity is divide exactly by segmentGranularity or else the result will be wrong. | ||
| There is some limitations, when use with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment, if exceed the result will wrong. And when use with topN, numValuesPerPass should not too big, if too big the distinctCount will use many memory and cause the JVM out of service. | ||
|
|
||
| This has been used in production. |
There was a problem hiding this comment.
I don't think this line is needed.
There was a problem hiding this comment.
i think it is useful to know this
|
Approach looks good. If you are wanting JUST counts, this can be sped up by implementing this functionality at a query level, and simply returning the cardinality of the resulting filter. |
|
Strongly suggest fixing #2602 (comment) but the rest is at the author's discretion. 👍 |
|
@binlijin implementation of |
|
@binlijin When I used distinctCount aggregation type, I find some bugs.For example,when dimensions size is two,the result will be wrong.I don't know if it is a bug. |
Discuss at https://groups.google.com/forum/#!topic/druid-development/GXRpXfBzfJs
(1) First use #2570 to partition data by visitor_id.
(2) Second use distinctcount to calculate exact UV.
There is some limitations, when use with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment, if exceed the result will wrong. And when use with topN, numValuesPerPass should not too big, if too big the distinctcount will use many memory and cause the JVM out of service.