Skip to content

Exact Cardinality Count extension#18021

Merged
FrankChen021 merged 71 commits intoapache:masterfrom
GWphua:bitmap64-extension
Jul 23, 2025
Merged

Exact Cardinality Count extension#18021
FrankChen021 merged 71 commits intoapache:masterfrom
GWphua:bitmap64-extension

Conversation

@GWphua
Copy link
Copy Markdown
Contributor

@GWphua GWphua commented May 20, 2025

Description

This PR introduces the druid-exact-count extension, providing a new aggregation function for computing the exact distinct count of values within a dimension. Unlike approximate estimators like HyperLogLog, this extension guarantees precision, which is crucial for use cases demanding exact figures.

The patch achieves this by leveraging RoaringBitmap, a data structure optimized for storing and manipulating sets of 64-bit integers with good compression and performance. The extension includes the necessary components for integrating this functionality into Druid's query processing engine.

Exact Count Aggregation

The core of this PR is the implementation of an exact count aggregator using RoaringBitmap64.

  • Behavioral aspects:
    The aggregator is invoked via the bitmap64ExactCount type in native queries or a corresponding SQL function. It's designed to ingest values from any dimension as long as they are of type long.
    • Configuration is minimal.
    • Empty inputs or all-null inputs correctly result in a count of 0.
    • String columns are not supported.

Integration Tests

  • Changed ./it.sh to also include extension-contrib packages when building Druid image.
  • Added IT for druid-exact-count.

Differences with Distinct Count Aggregator

Exact Count Distinct Count
No prerequisites to configuring hash partition, segment granularity Prerequisites needed to perform aggregation
Works on 64-bit number columns only (BIGINT) Works on dimension columns (Including Strings, Complex Types, etc)

Release note

Introduced a new extension druid-exact-count which provides an aggregator BITMAP64_EXACT_COUNT(columnName) for computing exact distinct counts on numerical columns.


Key changed/added classes in this PR
  • Bitmap64ExactCountAggregatorFactory
  • Bitmap64ExactCountBuildAggregatorFactory
  • Bitmap64ExactCountMergeAggregatorFactory
  • Bitmap64ExactCountBuildAggregator
  • Bitmap64ExactCountMergeAggregator
  • Bitmap64ExactCountBuildBufferAggregator
  • Bitmap64ExactCountMergeBufferAggregator
  • RoaringBitmap64Counter
  • Bitmap64 interface, for extensibility when newer/faster bitmap functions are introduced.
  • Bitmap64ExactCountBuildComplexMetricSerde
  • Bitmap64ExactCountMergeComplexMetricSerde
  • Bitmap64ExactCountModule
  • Bitmap64ExactCountPostAggregator
  • Bitmap64ExactCountSqlAggregator
  • it.sh
  • Docs @ druid-exact-count.md
  • Integration Test files

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@FrankChen021 FrankChen021 requested a review from Copilot May 20, 2025 10:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds a new Druid extension (“druid-exact-cardinality”) that provides an exact distinct count aggregator based on RoaringBitmap64. The changes introduce new aggregator factories (build and merge), buffer aggregators, SQL support, JSON serialization/deserialization, and the corresponding module registration.

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
META-INF/services/org.apache.druid.initialization.DruidModule Registers the new extension module.
Bitmap64ExactCardinalitySqlAggregator.java Implements the SQL aggregator function for exact cardinality counting.
RoaringBitmap64Counter*.java Adds the RoaringBitmap64Counter and its JSON serializer for precise distinct count handling.
Bitmap64ExactCardinalityPostAggregator.java Provides a post-aggregator to extract the cardinality count.
Bitmap64ExactCardinalityObjectStrategy.java Implements the object strategy for serializing/deserializing the bitmap counter.
Bitmap64ExactCardinalityModule.java Registers subtypes and serdes with the Druid framework.
Bitmap64ExactCardinalityAggregator.java & BufferAggregators Add both build and merge aggregator implementations along with their factories and buffer aggregator variants.
pom.xml (extension and distribution) Defines the Maven project and includes the new extension in the Druid distribution.
README.md Documents how the extension works and provides usage examples.

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the implementation itself generally looks good to me. I think we need more document and test cases.

Document

  1. a user doc to state how to use this extension, and the build & merge function provided in this extension
  2. SQL function explained and some SQL query examples as well as native query examples
  3. how can we use this extension during ingestion phase

Test Case

I see many UT test cases have been added. I think we need some IT test cases to cover the case from ingestion to query(both native queries and SQL queries)

GWphua added 7 commits May 22, 2025 18:28
Revert "Add wikipedia datasource walkthrough"

This reverts commit 83dfef9.

Revert "Add SQL Test"

This reverts commit e81a0fdc2f07b71958bc32734abe16bacbac920d.
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the druid-exact-count name is too generic, can we change it to druid-exact-count-bitmap ?

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

.buildMMappedIndex();

return walker.add(
DataSegment.builder()

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note test

Invoking
DataSegment.builder
should be avoided because it has been deprecated.
@FrankChen021 FrankChen021 merged commit f9575d5 into apache:master Jul 23, 2025
78 checks passed
@GWphua GWphua deleted the bitmap64-extension branch September 7, 2025 13:58
@cecemei cecemei added this to the 35.0.0 milestone Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants