Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 38 additions & 5 deletions docs/development/extensions-core/datasketches-tuple.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,19 +39,52 @@ druid.extensions.loadList=["druid-datasketches"]
"name" : <output_name>,
"fieldName" : <metric_name>,
"nominalEntries": <number>,
"numberOfValues" : <number>,
"metricColumns" : <array of strings>
"metricColumns" : <array of strings>,
"numberOfValues" : <number>
}
```

|property|description|required?|
|--------|-----------|---------|
|type|This String should always be "arrayOfDoublesSketch"|yes|
|name|A String for the output (result) name of the calculation.|yes|
|name|String representing the output column to store sketch values.|yes|
|fieldName|A String for the name of the input field.|yes|
|nominalEntries|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2. See the [Theta sketch accuracy](https://datasketches.apache.org/docs/Theta/ThetaErrorTable) for details. |no, defaults to 16384|
|numberOfValues|Number of values associated with each distinct key. |no, defaults to 1|
|metricColumns|If building sketches from raw data, an array of names of the input columns containing numeric values to be associated with each distinct key.|no, defaults to empty array|
|metricColumns|When building sketches from raw data, an array input column that contain numeric values to associate with each distinct key. If not provided, assumes `fieldName` is an `arrayOfDoublesSketch`|no, if not provided `fieldName` is assumed to be an arrayOfDoublesSketch|
|numberOfValues|Number of values associated with each distinct key. |no, defaults to the length of `metricColumns` if provided and 1 otherwise|

You can use the `arrayOfDoublesSketch` aggregator to:

- Build a sketch from raw data. In this case, set `metricColumns` to an array.
- Build a sketch from an existing `ArrayOfDoubles` sketch . In this case, leave `metricColumns` unset and set the `fieldName` to an `ArrayOfDoubles` sketch with `numberOfValues` doubles. At ingestion time, you must base64 encode `ArrayOfDoubles` sketches at ingestion time.

#### Example on top of raw data

Compute a theta of unique users. For each user store the `added` and `deleted` scores. The new sketch column will be called `users_theta`.

```json
{
"type": "arrayOfDoublesSketch",
"name": "users_theta",
"fieldName": "user",
"nominalEntries": 16384,
"metricColumns": ["added", "deleted"],
}
```

#### Example ingesting a precomputed sketch column

Ingest a sketch column called `user_sketches` that has a base64 encoded value of two doubles in its array and store it in a column called `users_theta`.

```json
{
"type": "arrayOfDoublesSketch",
"name": "users_theta",
"fieldName": "user_sketches",
"nominalEntries": 16384,
"numberOfValues": 2,
}
```

### Post Aggregators

Expand Down
3 changes: 2 additions & 1 deletion docs/multi-stage-query/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,8 @@ happens:
The [`maxNumTasks`](./reference.md#context-parameters) query parameter determines the maximum number of tasks your
query will use, including the one `query_controller` task. Generally, queries perform better with more workers. The
lowest possible value of `maxNumTasks` is two (one worker and one controller). Do not set this higher than the number of
free slots available in your cluster; doing so will result in a [TaskStartTimeout](reference.md#error-codes) error.
free slots available in your cluster; doing so will result in a [TaskStartTimeout](reference.md#error_TaskStartTimeout)
error.

When [reading external data](#extern), EXTERN can read multiple files in parallel across
different worker tasks. However, EXTERN does not split individual files across multiple worker tasks. If you have a
Expand Down
8 changes: 5 additions & 3 deletions docs/multi-stage-query/known-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,18 @@ sidebar_label: Known issues

- Worker task stage outputs are stored in the working directory given by `druid.indexer.task.baseDir`. Stages that
generate a large amount of output data may exhaust all available disk space. In this case, the query fails with
an [UnknownError](./reference.md#error-codes) with a message including "No space left on device".
an [UnknownError](./reference.md#error_UnknownError) with a message including "No space left on device".

## SELECT

- SELECT from a Druid datasource does not include unpublished real-time data.

- GROUPING SETS and UNION ALL are not implemented. Queries using these features return a
[QueryNotSupported](reference.md#error-codes) error.
[QueryNotSupported](reference.md#error_QueryNotSupported) error.

- For some COUNT DISTINCT queries, you'll encounter a [QueryNotSupported](reference.md#error-codes) error that includes `Must not have 'subtotalsSpec'` as one of its causes. This is caused by the planner attempting to use GROUPING SETs, which are not implemented.
- For some COUNT DISTINCT queries, you'll encounter a [QueryNotSupported](reference.md#error_QueryNotSupported) error
that includes `Must not have 'subtotalsSpec'` as one of its causes. This is caused by the planner attempting to use
GROUPING SETs, which are not implemented.

- The numeric varieties of the EARLIEST and LATEST aggregators do not work properly. Attempting to use the numeric
varieties of these aggregators lead to an error like
Expand Down
Loading