extension for exactly distinct count for single long type dimension:accurate-cardinality#8185
extension for exactly distinct count for single long type dimension:accurate-cardinality#8185elloooooo wants to merge 6 commits intoapache:masterfrom
Conversation
|
Sorry for the delayed review. I'll look through this PR soon. From the PR description, I'm wondering the SQL syntax could be better. I guess it might not easy to use this feature if they have tons of segments. My question is, can we use the same SQL syntax and add a new query context to enable this feature? For example, it could be something like this. {
"query": "select count(distinct pid) from DATASOURCE where col='val'",
"context": {
"accurateCardinarlity": true
}
}Also, please update the PR description as detailed as possible based on https://github.com/apache/incubator-druid/blob/master/.github/pull_request_template.md. The PR template now has a section to explain design choices which can help others to understand this PR better. |
|
Oh maybe I was confused. From the documentation, looks like you aded a new |
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
|
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
@elloooooo Hi, have you had any stress tests for this PR? |
Description
Now, druid offers the ability of getting accurate cardinality for dimension by nested group by query.
It can be described as follows conceptually:
For a sql like:
the exactly query will be like:
For high cardinality case, the size of result transfered from historical node to broker node can be really large and leads to poor performance.
So this extension try to use bitmap(64bit RoaringBitmap) as the container for the result data from the historical to reduce the data tranfered from historical to broker.
The performance can be 10 times better the nested group by method.
But as there isn't globlal dictionary built in realtime ingestion, the PR can just support for single long dimension which can be put into bitmap directly.
And add a ACCURATE_CARDINALITY() method in SQL like:
This PR is after PR #6768. Sorry for making the previous PR staled.
This PR has: