Skip to content

extension for exactly distinct count for single long type dimension:accurate-cardinality#8185

Closed
elloooooo wants to merge 6 commits intoapache:masterfrom
elloooooo:feature_accurate_distinctcount
Closed

extension for exactly distinct count for single long type dimension:accurate-cardinality#8185
elloooooo wants to merge 6 commits intoapache:masterfrom
elloooooo:feature_accurate_distinctcount

Conversation

@elloooooo
Copy link
Copy Markdown
Contributor

@elloooooo elloooooo commented Jul 29, 2019

Description

Now, druid offers the ability of getting accurate cardinality for dimension by nested group by query.
It can be described as follows conceptually:
For a sql like:

select count(distinct pid) from DATASOURCE where col="val"

the exactly query will be like:

select count(*) from (
  select pid from DATASOURCE_segments_in_historical1 where col="val" group by pid
  UNION ALL
  select pid from DATASOURCE_segments_in_historical2 where col="val" group by pid
  UNION ALL
  select pid from DATASOURCE_segments_in_historical3 where col="val" group by pid
  ...
) group by pid

For high cardinality case, the size of result transfered from historical node to broker node can be really large and leads to poor performance.
So this extension try to use bitmap(64bit RoaringBitmap) as the container for the result data from the historical to reduce the data tranfered from historical to broker.
The performance can be 10 times better the nested group by method.

image

But as there isn't globlal dictionary built in realtime ingestion, the PR can just support for single long dimension which can be put into bitmap directly.

And add a ACCURATE_CARDINALITY() method in SQL like:

select ACCURATE_CARDINALITY(pid) from DATASOURCE where col="val"

This PR is after PR #6768. Sorry for making the previous PR staled.


This PR has:

  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.
  • added unit tests or modified existing tests to cover new code paths.
  • been tested in a test Druid cluster.

@jihoonson
Copy link
Copy Markdown
Contributor

Sorry for the delayed review. I'll look through this PR soon.

From the PR description, I'm wondering the SQL syntax could be better. I guess it might not easy to use this feature if they have tons of segments. My question is, can we use the same SQL syntax and add a new query context to enable this feature? For example, it could be something like this.

{
  "query": "select count(distinct pid) from DATASOURCE where col='val'",
  "context": {
    "accurateCardinarlity": true
  }
}

Also, please update the PR description as detailed as possible based on https://github.com/apache/incubator-druid/blob/master/.github/pull_request_template.md. The PR template now has a section to explain design choices which can help others to understand this PR better.

@jihoonson
Copy link
Copy Markdown
Contributor

Oh maybe I was confused. From the documentation, looks like you aded a new ACCURATE_CARDINALITY method for this feature. Then it looks good to me.

@stale
Copy link
Copy Markdown

stale Bot commented Oct 12, 2019

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale Bot added the stale label Oct 12, 2019
@stale
Copy link
Copy Markdown

stale Bot commented Nov 9, 2019

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale Bot closed this Nov 9, 2019
@tomscut
Copy link
Copy Markdown
Contributor

tomscut commented Dec 5, 2019

@elloooooo Hi, have you had any stress tests for this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants