Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 18 additions & 14 deletions docs/development/extensions-core/datasketches-hll.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,29 +23,33 @@ title: "DataSketches HLL Sketch module"
-->


This module provides Apache Druid aggregators for distinct counting based on HLL sketch from [Apache DataSketches](https://datasketches.apache.org/) library. At ingestion time, this aggregator creates the HLL sketch objects to be stored in Druid segments. At query time, sketches are read and merged together. In the end, by default, you receive the estimate of the number of distinct values presented to the sketch. Also, you can use post aggregator to produce a union of sketch columns in the same row.
You can use the HLL sketch aggregator on columns of any identifiers. It will return estimated cardinality of the column.
This module provides Apache Druid aggregators for distinct counting based on HLL sketch from [Apache DataSketches](https://datasketches.apache.org/) library. At ingestion time, this aggregator creates the HLL sketch objects to store in Druid segments. By default, Druid reads and merges sketches at query time. The default result is
the estimate of the number of distinct values presented to the sketch. You can also use post aggregators to produce a union of sketch columns in the same row.
You can use the HLL sketch aggregator on any column to estimate its cardinality.

To use this aggregator, make sure you [include](../../development/extensions.md#loading-extensions) the extension in your config file:

```
druid.extensions.loadList=["druid-datasketches"]
```

### Aggregators
For additional sketch types supported in Druid, see [DataSketches extension](datasketches-extension.md).

|property|description|required?|
## Aggregators

|Property|Description|Required?|
|--------|-----------|---------|
|`type`|This String should be [`HLLSketchBuild`](#hllsketchbuild-aggregator) or [`HLLSketchMerge`](#hllsketchmerge-aggregator)|yes|
|`name`|A String for the output (result) name of the calculation.|yes|
|`fieldName`|A String for the name of the input field.|yes|
|`type`|Either [`HLLSketchBuild`](#hllsketchbuild-aggregator) or [`HLLSketchMerge`](#hllsketchmerge-aggregator).|yes|
|`name`|String representing the output column to store sketch values.|yes|
|`fieldName`|The name of the input field.|yes|
|`lgK`|log2 of K that is the number of buckets in the sketch, parameter that controls the size and the accuracy. Must be between 4 and 21 inclusively.|no, defaults to `12`|
|`tgtHllType`|The type of the target HLL sketch. Must be `HLL_4`, `HLL_6` or `HLL_8` |no, defaults to `HLL_4`|
|`round`|Round off values to whole numbers. Only affects query-time behavior and is ignored at ingestion-time.|no, defaults to `false`|
|`shouldFinalize`|Return the final double type representing the estimate rather than the intermediate sketch type itself. In addition to controlling the finalization of this aggregator, you can control whether all aggregators are finalized with the query context parameters [`finalize`](../../querying/query-context.md) and [`sqlFinalizeOuterSketches`](../../querying/sql-query-context.md).|no, defaults to `true`|

> The default `lgK` value has proven to be sufficient for most use cases; expect only very negligible improvements in accuracy with `lgK` values over `16` in normal circumstances.

#### HLLSketchBuild Aggregator
### HLLSketchBuild aggregator

```
{
Expand Down Expand Up @@ -76,7 +80,7 @@ When applied at query time on an existing dimension, you can use the resulting c
> ```
>

#### HLLSketchMerge Aggregator
### HLLSketchMerge aggregator

```
{
Expand All @@ -91,9 +95,9 @@ When applied at query time on an existing dimension, you can use the resulting c

You can use the `HLLSketchMerge` aggregator to ingest pre-generated sketches from an input dataset. For example, you can set up a batch processing job to generate the sketches before sending the data to Druid. You must serialize the sketches in the input dataset to Base64-encoded bytes. Then, specify `HLLSketchMerge` for the input column in the native ingestion `metricsSpec`.

### Post Aggregators
## Post aggregators

#### Estimate
### Estimate

Returns the distinct count estimate as a double.

Expand All @@ -106,7 +110,7 @@ Returns the distinct count estimate as a double.
}
```

#### Estimate with bounds
### Estimate with bounds

Returns a distinct count estimate and error bounds from an HLL sketch.
The result will be an array containing three double values: estimate, lower bound and upper bound.
Expand All @@ -122,7 +126,7 @@ This must be an integer value of 1, 2 or 3 corresponding to approximately 68.3%,
}
```

#### Union
### Union

```
{
Expand All @@ -134,7 +138,7 @@ This must be an integer value of 1, 2 or 3 corresponding to approximately 68.3%,
}
```

#### Sketch to string
### Sketch to string

Human-readable sketch summary for debugging.

Expand Down
30 changes: 16 additions & 14 deletions docs/development/extensions-core/datasketches-kll.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,9 @@ To use this aggregator, make sure you [include](../../development/extensions.md#
druid.extensions.loadList=["druid-datasketches"]
```

### Aggregator
For additional sketch types supported in Druid, see [DataSketches extension](datasketches-extension.md).

## Aggregator

The result of the aggregation is a KllFloatsSketch or KllDoublesSketch that is the union of all sketches either built from raw data or read from the segments.

Expand All @@ -50,17 +52,17 @@ The result of the aggregation is a KllFloatsSketch or KllDoublesSketch that is t
}
```

|property|description|required?|
|Property|Description|Required?|
|--------|-----------|---------|
|type|This String should be "KllFloatsSketch" or "KllDoublesSketch"|yes|
|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be from 8 to 65535. See [KLL Sketch Accuracy and Size](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html).|no, defaults to 200|
|maxStreamLength|This parameter defines the number of items that can be presented to each sketch before it may need to move from off-heap to on-heap memory. This is relevant to query types that use off-heap memory, including [TopN](../../querying/topnquery.md) and [GroupBy](../../querying/groupbyquery.md). Ideally, should be set high enough such that most sketches can stay off-heap.|no, defaults to 1000000000|
|`type`|Either "KllFloatsSketch" or "KllDoublesSketch"|yes|
|`name`|A String for the output (result) name of the calculation.|yes|
|`fieldName`|String for the name of the input field, which may contain sketches or raw numeric values.|yes|
|`k`|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be from 8 to 65535. See [KLL Sketch Accuracy and Size](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html).|no, defaults to 200|
|`maxStreamLength`|This parameter defines the number of items that can be presented to each sketch before it may need to move from off-heap to on-heap memory. This is relevant to query types that use off-heap memory, including [TopN](../../querying/topnquery.md) and [GroupBy](../../querying/groupbyquery.md). Ideally, should be set high enough such that most sketches can stay off-heap.|no, defaults to 1000000000|

### Post Aggregators
## Post aggregators

#### Quantile
### Quantile

This returns an approximation to the value that would be preceded by a given fraction of a hypothetical sorted version of the input stream.

Expand All @@ -73,7 +75,7 @@ This returns an approximation to the value that would be preceded by a given fra
}
```

#### Quantiles
### Quantiles

This returns an array of quantiles corresponding to a given array of fractions

Expand All @@ -86,7 +88,7 @@ This returns an array of quantiles corresponding to a given array of fractions
}
```

#### Histogram
### Histogram

This returns an approximation to the histogram given an array of split points that define the histogram bins or a number of bins (not both). An array of <i>m</i> unique, monotonically increasing split points divide the real number line into <i>m+1</i> consecutive disjoint intervals. The definition of an interval is inclusive of the left split point and exclusive of the right split point. If the number of bins is specified instead of split points, the interval between the minimum and maximum values is divided into the given number of equally-spaced bins.

Expand All @@ -100,7 +102,7 @@ This returns an approximation to the histogram given an array of split points th
}
```

#### Rank
### Rank

This returns an approximation to the rank of a given value that is the fraction of the distribution less than that value.

Expand All @@ -112,7 +114,7 @@ This returns an approximation to the rank of a given value that is the fraction
"value" : <value>
}
```
#### CDF
### CDF

This returns an approximation to the Cumulative Distribution Function given an array of split points that define the edges of the bins. An array of <i>m</i> unique, monotonically increasing split points divide the real number line into <i>m+1</i> consecutive disjoint intervals. The definition of an interval is inclusive of the left split point and exclusive of the right split point. The resulting array of fractions can be viewed as ranks of each split point with one additional rank that is always 1.

Expand All @@ -125,7 +127,7 @@ This returns an approximation to the Cumulative Distribution Function given an a
}
```

#### Sketch Summary
### Sketch Summary

This returns a summary of the sketch that can be used for debugging. This is the result of calling toString() method.

Expand Down
31 changes: 17 additions & 14 deletions docs/development/extensions-core/datasketches-quantiles.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,9 @@ To use this aggregator, make sure you [include](../../development/extensions.md#
druid.extensions.loadList=["druid-datasketches"]
```

### Aggregator
For additional sketch types supported in Druid, see [DataSketches extension](datasketches-extension.md).

## Aggregator

The result of the aggregation is a DoublesSketch that is the union of all sketches either built from raw data or read from the segments.

Expand All @@ -50,17 +52,18 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
}
```

|property|description|required?|
|Property|Description|Required?|
|--------|-----------|---------|
|type|This String should always be "quantilesDoublesSketch"|yes|
|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
|maxStreamLength|This parameter defines the number of items that can be presented to each sketch before it may need to move from off-heap to on-heap memory. This is relevant to query types that use off-heap memory, including [TopN](../../querying/topnquery.md) and [GroupBy](../../querying/groupbyquery.md). Ideally, should be set high enough such that most sketches can stay off-heap.|no, defaults to 1000000000|
|`type`|This string should always be "quantilesDoublesSketch"|yes|
|`name`|String representing the output column to store sketch values.|yes|
|`fieldName`|A string for the name of the input field (can contain sketches or raw numeric values).|yes|
|`k`|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
|`maxStreamLength`|This parameter defines the number of items that can be presented to each sketch before it may need to move from off-heap to on-heap memory. This is relevant to query types that use off-heap memory, including [TopN](../../querying/topnquery.md) and [GroupBy](../../querying/groupbyquery.md). Ideally, should be set high enough such that most sketches can stay off-heap.|no, defaults to 1000000000|
|`shouldFinalize`|Return the final double type representing the estimate rather than the intermediate sketch type itself. In addition to controlling the finalization of this aggregator, you can control whether all aggregators are finalized with the query context parameters [`finalize`](../../querying/query-context.md) and [`sqlFinalizeOuterSketches`](../../querying/sql-query-context.md).|no, defaults to `true`|

### Post Aggregators
## Post aggregators

#### Quantile
### Quantile

This returns an approximation to the value that would be preceded by a given fraction of a hypothetical sorted version of the input stream.

Expand All @@ -73,7 +76,7 @@ This returns an approximation to the value that would be preceded by a given fra
}
```

#### Quantiles
### Quantiles

This returns an array of quantiles corresponding to a given array of fractions

Expand All @@ -86,7 +89,7 @@ This returns an array of quantiles corresponding to a given array of fractions
}
```

#### Histogram
### Histogram

This returns an approximation to the histogram given an array of split points that define the histogram bins or a number of bins (not both). An array of <i>m</i> unique, monotonically increasing split points divide the real number line into <i>m+1</i> consecutive disjoint intervals. The definition of an interval is inclusive of the left split point and exclusive of the right split point. If the number of bins is specified instead of split points, the interval between the minimum and maximum values is divided into the given number of equally-spaced bins.

Expand All @@ -100,7 +103,7 @@ This returns an approximation to the histogram given an array of split points th
}
```

#### Rank
### Rank

This returns an approximation to the rank of a given value that is the fraction of the distribution less than that value.

Expand All @@ -112,7 +115,7 @@ This returns an approximation to the rank of a given value that is the fraction
"value" : <value>
}
```
#### CDF
### CDF

This returns an approximation to the Cumulative Distribution Function given an array of split points that define the edges of the bins. An array of <i>m</i> unique, monotonically increasing split points divide the real number line into <i>m+1</i> consecutive disjoint intervals. The definition of an interval is inclusive of the left split point and exclusive of the right split point. The resulting array of fractions can be viewed as ranks of each split point with one additional rank that is always 1.

Expand All @@ -125,7 +128,7 @@ This returns an approximation to the Cumulative Distribution Function given an a
}
```

#### Sketch Summary
### Sketch summary

This returns a summary of the sketch that can be used for debugging. This is the result of calling toString() method.

Expand Down
Loading