Skip to content

[PROPOSAL] Add support for t-digest backed aggregators #7303

@samarthjain

Description

@samarthjain

Motivation

TDigest (https://github.com/tdunning/t-digest) is a popular datastructure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The datastructure is also designed for parallel programming use cases like distributed aggregations or map reduce jobs by making combining two intermediate t-digests easy and efficient.

There are various other projects like Apache Mahout, streaming-lib and Elastic Search which have adopted T-Digest. It would be good to add T-Digest based aggregators in Druid as well. This would be complimentary to existing approximate sketch generation algorithms in Druid like moments and yahoo quantile sketches.

Proposed changes

A new module called druid-tdigestsketch will be added in the the extension-contrib module. Proposal is to add following aggregators:

  1. buildTDigestSketch - this aggregator will generate t-digest based sketches over numeric value. This generally would be used during the indexing phase where a pre-aggregated sketch over a metric's values will be created. This aggregator could also be used for generating sketches on the fly during query time itself.
  2. mergeTDigestSketch - this aggregator will take care of combining existing t-digest based sketches. This aggregator will generally be used during query time to combine sketches generated during the indexing phase by buildTDigestSketch aggregator.
  3. quantilesFromTDigestSketch - this post aggregator will take in an array of fractions, and generate quantiles on the t-digest sketches generated by the above two aggregators.

Rationale

At my work, various data engineering teams have been using t-digest based sketch aggregations both in and outside of Druid. They have found it to be a good fit for their various use cases.

Operational impact

No operational impact.

Test plan (optional)

There is enough literature out there that has tested out performance and correctness of t-digest. Other than unit tests, the plan would be verify on a dev Druid cluster that the results returned by this aggregator are similar to t-digest aggregation used in other frameworks like Spark, mapreduce, etc.

Future work (optional)

  1. Add SQL support
  2. When a new version of t-digest library gets rolled out, and if the serialization format changes, it would be tricky to make the old and new versions interoperable. An option would be to write a new module every time the t-digest library is updated. Or we would need to devise a scheme of versioning aggregators.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions