Motivation
TDigest (https://github.com/tdunning/t-digest) is a popular datastructure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The datastructure is also designed for parallel programming use cases like distributed aggregations or map reduce jobs by making combining two intermediate t-digests easy and efficient.
There are various other projects like Apache Mahout, streaming-lib and Elastic Search which have adopted T-Digest. It would be good to add T-Digest based aggregators in Druid as well. This would be complimentary to existing approximate sketch generation algorithms in Druid like moments and yahoo quantile sketches.
Proposed changes
A new module called druid-tdigestsketch will be added in the the extension-contrib module. Proposal is to add following aggregators:
- buildTDigestSketch - this aggregator will generate t-digest based sketches over numeric value. This generally would be used during the indexing phase where a pre-aggregated sketch over a metric's values will be created. This aggregator could also be used for generating sketches on the fly during query time itself.
- mergeTDigestSketch - this aggregator will take care of combining existing t-digest based sketches. This aggregator will generally be used during query time to combine sketches generated during the indexing phase by buildTDigestSketch aggregator.
- quantilesFromTDigestSketch - this post aggregator will take in an array of fractions, and generate quantiles on the t-digest sketches generated by the above two aggregators.
Rationale
At my work, various data engineering teams have been using t-digest based sketch aggregations both in and outside of Druid. They have found it to be a good fit for their various use cases.
Operational impact
No operational impact.
Test plan (optional)
There is enough literature out there that has tested out performance and correctness of t-digest. Other than unit tests, the plan would be verify on a dev Druid cluster that the results returned by this aggregator are similar to t-digest aggregation used in other frameworks like Spark, mapreduce, etc.
Future work (optional)
- Add SQL support
- When a new version of t-digest library gets rolled out, and if the serialization format changes, it would be tricky to make the old and new versions interoperable. An option would be to write a new module every time the t-digest library is updated. Or we would need to devise a scheme of versioning aggregators.
Motivation
TDigest (https://github.com/tdunning/t-digest) is a popular datastructure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The datastructure is also designed for parallel programming use cases like distributed aggregations or map reduce jobs by making combining two intermediate t-digests easy and efficient.
There are various other projects like Apache Mahout, streaming-lib and Elastic Search which have adopted T-Digest. It would be good to add T-Digest based aggregators in Druid as well. This would be complimentary to existing approximate sketch generation algorithms in Druid like moments and yahoo quantile sketches.
Proposed changes
A new module called druid-tdigestsketch will be added in the the extension-contrib module. Proposal is to add following aggregators:
Rationale
At my work, various data engineering teams have been using t-digest based sketch aggregations both in and outside of Druid. They have found it to be a good fit for their various use cases.
Operational impact
No operational impact.
Test plan (optional)
There is enough literature out there that has tested out performance and correctness of t-digest. Other than unit tests, the plan would be verify on a dev Druid cluster that the results returned by this aggregator are similar to t-digest aggregation used in other frameworks like Spark, mapreduce, etc.
Future work (optional)