Exact Cardinality Count extension#18021
Conversation
There was a problem hiding this comment.
Pull Request Overview
This pull request adds a new Druid extension (“druid-exact-cardinality”) that provides an exact distinct count aggregator based on RoaringBitmap64. The changes introduce new aggregator factories (build and merge), buffer aggregators, SQL support, JSON serialization/deserialization, and the corresponding module registration.
Reviewed Changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| META-INF/services/org.apache.druid.initialization.DruidModule | Registers the new extension module. |
| Bitmap64ExactCardinalitySqlAggregator.java | Implements the SQL aggregator function for exact cardinality counting. |
| RoaringBitmap64Counter*.java | Adds the RoaringBitmap64Counter and its JSON serializer for precise distinct count handling. |
| Bitmap64ExactCardinalityPostAggregator.java | Provides a post-aggregator to extract the cardinality count. |
| Bitmap64ExactCardinalityObjectStrategy.java | Implements the object strategy for serializing/deserializing the bitmap counter. |
| Bitmap64ExactCardinalityModule.java | Registers subtypes and serdes with the Druid framework. |
| Bitmap64ExactCardinalityAggregator.java & BufferAggregators | Add both build and merge aggregator implementations along with their factories and buffer aggregator variants. |
| pom.xml (extension and distribution) | Defines the Maven project and includes the new extension in the Druid distribution. |
| README.md | Documents how the extension works and provides usage examples. |
FrankChen021
left a comment
There was a problem hiding this comment.
the implementation itself generally looks good to me. I think we need more document and test cases.
Document
- a user doc to state how to use this extension, and the build & merge function provided in this extension
- SQL function explained and some SQL query examples as well as native query examples
- how can we use this extension during ingestion phase
Test Case
I see many UT test cases have been added. I think we need some IT test cases to cover the case from ingestion to query(both native queries and SQL queries)
Revert "Add wikipedia datasource walkthrough" This reverts commit 83dfef9. Revert "Add SQL Test" This reverts commit e81a0fdc2f07b71958bc32734abe16bacbac920d.
FrankChen021
left a comment
There was a problem hiding this comment.
the druid-exact-count name is too generic, can we change it to druid-exact-count-bitmap ?
| .buildMMappedIndex(); | ||
|
|
||
| return walker.add( | ||
| DataSegment.builder() |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note test
Description
This PR introduces the
druid-exact-countextension, providing a new aggregation function for computing the exact distinct count of values within a dimension. Unlike approximate estimators like HyperLogLog, this extension guarantees precision, which is crucial for use cases demanding exact figures.The patch achieves this by leveraging RoaringBitmap, a data structure optimized for storing and manipulating sets of 64-bit integers with good compression and performance. The extension includes the necessary components for integrating this functionality into Druid's query processing engine.
Exact Count Aggregation
The core of this PR is the implementation of an exact count aggregator using RoaringBitmap64.
The aggregator is invoked via the
bitmap64ExactCounttype in native queries or a corresponding SQL function. It's designed to ingest values from any dimension as long as they are of typelong.Integration Tests
./it.shto also includeextension-contribpackages when building Druid image.Differences with Distinct Count Aggregator
Release note
Introduced a new extension
druid-exact-countwhich provides an aggregatorBITMAP64_EXACT_COUNT(columnName)for computing exact distinct counts on numerical columns.Key changed/added classes in this PR
Bitmap64ExactCountAggregatorFactoryBitmap64ExactCountBuildAggregatorFactoryBitmap64ExactCountMergeAggregatorFactoryBitmap64ExactCountBuildAggregatorBitmap64ExactCountMergeAggregatorBitmap64ExactCountBuildBufferAggregatorBitmap64ExactCountMergeBufferAggregatorRoaringBitmap64CounterBitmap64interface, for extensibility when newer/faster bitmap functions are introduced.Bitmap64ExactCountBuildComplexMetricSerdeBitmap64ExactCountMergeComplexMetricSerdeBitmap64ExactCountModuleBitmap64ExactCountPostAggregatorBitmap64ExactCountSqlAggregatorit.shdruid-exact-count.mdThis PR has: