Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ The Approximate Histogram aggregator is deprecated. Please use <a href="../exten
This aggregator is based on
[http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf](http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf)
to compute approximate histograms, with the following modifications:

- some tradeoffs in accuracy were made in the interest of speed (see below)
- the sketch maintains the exact original data as long as the number of
distinct data points is fewer than the resolutions (number of centroids),
Expand Down
1 change: 1 addition & 0 deletions docs/content/development/extensions-core/bloom-filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ to use with Druid for cases where an explicit filter is impossible, e.g. filteri
values.

Following are some characteristics of BloomFilters:

- BloomFilters are highly space efficient when compared to using a HashSet.
- Because of the probabilistic nature of bloom filters, false positive results are possible (element was not actually
inserted into a bloom filter during construction, but `test()` says true)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ title: "Basic Security"
# Druid Basic Security

This Apache Druid (incubating) extension adds:

- an Authenticator which supports [HTTP Basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication)
- an Authorizer which implements basic role-based access control

Expand Down Expand Up @@ -342,6 +343,7 @@ Unassign role {roleName} from user {userName}
Set the permissions of {roleName}. This replaces the previous set of permissions on the role.

Content: List of JSON Resource-Action objects, e.g.:

```
[
{
Expand Down
1 change: 1 addition & 0 deletions docs/content/ingestion/native_tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ the implementation of splittable firehoses. Please note that multiple tasks can
if one of them fails.

You may want to consider the below points:

- Since this task doesn't shuffle intermediate data, it isn't available for [perfect rollup](../ingestion/index.html#roll-up-modes).
- The number of tasks for parallel ingestion is decided by `maxNumSubTasks` in the tuningConfig.
Since the supervisor task creates up to `maxNumSubTasks` worker tasks regardless of the available task slots,
Expand Down
2 changes: 2 additions & 0 deletions docs/content/operations/basic-cluster-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ If you have questions on tuning Druid for specific use cases, or questions on co
#### Heap sizing

The biggest contributions to heap usage on Historicals are:

- Partial unmerged query results from segments
- The stored maps for [lookups](../querying/lookups.html).

Expand All @@ -63,6 +64,7 @@ Be sure to add `(2 * total size of all loaded lookups)` to your heap size in add
Please see the [General Guidelines for Processing Threads and Buffers](#general-guidelines-for-processing-threads-and-buffers) section for an overview of processing thread/buffer configuration.

On Historicals:

- `druid.processing.numThreads` should generally be set to `(number of cores - 1)`: a smaller value can result in CPU underutilization, while going over the number of cores can result in unnecessary CPU contention.
- `druid.processing.buffer.sizeBytes` can be set to 500MB.
- `druid.processing.numMergeBuffers`, a 1:4 ratio of merge buffers to processing threads is a reasonable choice for general use.
Expand Down
1 change: 1 addition & 0 deletions docs/content/operations/deep-storage-migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ If you have been running an evaluation Druid cluster using local deep storage an
more production-capable deep storage system such as S3 or HDFS, this document describes the necessary steps.

Migration of deep storage involves the following steps at a high level:

- Copying segments from local deep storage to the new deep storage
- Exporting Druid's segments table from metadata
- Rewriting the load specs in the exported segment data to reflect the new deep storage location
Expand Down
22 changes: 12 additions & 10 deletions docs/content/operations/export-metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ title: "Export Metadata Tool"
Druid includes an `export-metadata` tool for assisting with migration of cluster metadata and deep storage.

This tool exports the contents of the following Druid metadata tables:

- segments
- rules
- config
Expand All @@ -37,6 +38,7 @@ Additionally, the tool can rewrite the local deep storage location descriptors i
to point to new deep storage locations (S3, HDFS, and local rewrite paths are supported).

The tool has the following limitations:

- Only exporting from Derby metadata is currently supported
- If rewriting load specs for deep storage migration, only migrating from local deep storage is currently supported.

Expand All @@ -46,20 +48,19 @@ The `export-metadata` tool provides the following options:

### Connection Properties

`--connectURI`: The URI of the Derby database, e.g. `jdbc:derby://localhost:1527/var/druid/metadata.db;create=true`
`--user`: Username
`--password`: Password
`--base`: corresponds to the value of `druid.metadata.storage.tables.base` in the configuration, `druid` by default.
- `--connectURI`: The URI of the Derby database, e.g. `jdbc:derby://localhost:1527/var/druid/metadata.db;create=true`
- `--user`: Username
- `--password`: Password
- `--base`: corresponds to the value of `druid.metadata.storage.tables.base` in the configuration, `druid` by default.

### Output Path

`--output-path`, `-o`: The output directory of the tool. CSV files for the Druid segments, rules, config, datasource, and supervisors tables will be written to this directory.
- `--output-path`, `-o`: The output directory of the tool. CSV files for the Druid segments, rules, config, datasource, and supervisors tables will be written to this directory.

### Export Format Options

`--use-hex-blobs`, `-x`: If set, export BLOB payload columns as hexadecimal strings. This needs to be set if importing back into Derby. Default is false.

`--booleans-as-strings`, `-t`: If set, write boolean values as "true" or "false" instead of "1" and "0". This needs to be set if importing back into Derby. Default is false.
- `--use-hex-blobs`, `-x`: If set, export BLOB payload columns as hexadecimal strings. This needs to be set if importing back into Derby. Default is false.
- `--booleans-as-strings`, `-t`: If set, write boolean values as "true" or "false" instead of "1" and "0". This needs to be set if importing back into Derby. Default is false.

### Deep Storage Migration

Expand All @@ -69,8 +70,8 @@ By setting the options below, the tool will rewrite the segment load specs to po

This helps users migrate segments stored in local deep storage to S3.

`--s3bucket`, `-b`: The S3 bucket that will hold the migrated segments
`--s3baseKey`, `-k`: The base S3 key where the migrated segments will be stored
- `--s3bucket`, `-b`: The S3 bucket that will hold the migrated segments
- `--s3baseKey`, `-k`: The base S3 key where the migrated segments will be stored

When copying the local deep storage segments to S3, the rewrite performed by this tool requires that the directory structure of the segments be unchanged.

Expand Down Expand Up @@ -142,6 +143,7 @@ java -classpath "lib/*" -Dlog4j.configurationFile=conf/druid/cluster/_common/log
```

In the example command above:

- `lib` is the the Druid lib directory
- `extensions` is the Druid extensions directory
- `/tmp/csv` is the output directory. Please make sure that this directory exists.
Expand Down
1 change: 1 addition & 0 deletions docs/content/operations/metadata-migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ Update your Druid runtime properties with the new metadata configuration.
Druid provides a `metadata-init` tool for creating Druid's metadata tables. After initializing the Druid database, you can run the commands shown below from the root of the Druid package to initialize the tables.

In the example commands below:

- `lib` is the the Druid lib directory
- `extensions` is the Druid extensions directory
- `base` corresponds to the value of `druid.metadata.storage.tables.base` in the configuration, `druid` by default.
Expand Down
1 change: 1 addition & 0 deletions docs/content/operations/recommendations.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ JVM Flags:
Please note that above flags are general guidelines only. Be cautious and feel free to change them if necessary for the specific deployment.

Additionally, for large jvm heaps, here are a few Garbage Collection efficiency guidelines that have been known to help in some cases.

- Mount /tmp on tmpfs ( See http://www.evanjones.ca/jvm-mmap-pause.html )
- On Disk-IO intensive processes (e.g. Historical and MiddleManager), GC and Druid logs should be written to a different disk than where data is written.
- Disable Transparent Huge Pages ( See https://blogs.oracle.com/linux/performance-issues-with-transparent-huge-pages-thp )
Expand Down
1 change: 1 addition & 0 deletions docs/content/querying/aggregations.md
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,7 @@ The [Approximate Histogram](../development/extensions-core/approximate-histogram
The algorithm used by this deprecated aggregator is highly distribution-dependent and its output is subject to serious distortions when the input does not fit within the algorithm's limitations.

A [study published by the DataSketches team](https://datasketches.github.io/docs/Quantiles/DruidApproxHistogramStudy.html) demonstrates some of the known failure modes of this algorithm:

- The algorithm's quantile calculations can fail to provide results for a large range of rank values (all ranks less than 0.89 in the example used in the study), returning all zeroes instead.
- The algorithm can completely fail to record spikes in the tail ends of the distribution
- In general, the histogram produced by the algorithm can deviate significantly from the true histogram, with no bounds on the errors.
Expand Down
9 changes: 9 additions & 0 deletions docs/content/tutorials/cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ In this document, we'll set up a simple cluster and discuss how it can be furthe
your needs.

This simple cluster will feature:

- A Master server to host the Coordinator and Overlord processes
- Two scalable, fault-tolerant Data servers running Historical and MiddleManager processes
- A query server, hosting the Druid Broker and Router processes
Expand All @@ -49,6 +50,7 @@ The Coordinator and Overlord processes are responsible for handling the metadata
In this example, we will be deploying the equivalent of one AWS [m5.2xlarge](https://aws.amazon.com/ec2/instance-types/m5/) instance.

This hardware offers:

- 8 vCPUs
- 31 GB RAM

Expand Down Expand Up @@ -77,6 +79,7 @@ in-memory query cache. These servers benefit greatly from CPU and RAM.
In this example, we will be deploying the equivalent of one AWS [m5.2xlarge](https://aws.amazon.com/ec2/instance-types/m5/) instance.

This hardware offers:

- 8 vCPUs
- 31 GB RAM

Expand Down Expand Up @@ -323,13 +326,15 @@ You can copy your existing `coordinator-overlord` configs from the single-server
Suppose we are migrating from a single-server deployment that had 32 CPU and 256GB RAM. In the old deployment, the following configurations for Historicals and MiddleManagers were applied:

Historical (Single-server)

```
druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=8
druid.processing.numThreads=31
```

MiddleManager (Single-server)

```
druid.worker.capacity=8
druid.indexer.fork.property.druid.processing.numMergeBuffers=2
Expand All @@ -340,11 +345,13 @@ druid.indexer.fork.property.druid.processing.numThreads=1
In the clustered deployment, we can choose a split factor (2 in this example), and deploy 2 Data servers with 16CPU and 128GB RAM each. The areas to scale are the following:

Historical

- `druid.processing.numThreads`: Set to `(num_cores - 1)` based on the new hardware
- `druid.processing.numMergeBuffers`: Divide the old value from the single-server deployment by the split factor
- `druid.processing.buffer.sizeBytes`: Keep this unchanged

MiddleManager:

- `druid.worker.capacity`: Divide the old value from the single-server deployment by the split factor
- `druid.indexer.fork.property.druid.processing.numMergeBuffers`: Keep this unchanged
- `druid.indexer.fork.property.druid.processing.buffer.sizeBytes`: Keep this unchanged
Expand All @@ -353,13 +360,15 @@ MiddleManager:
The resulting configs after the split:

New Historical (on 2 Data servers)

```
druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=8
druid.processing.numThreads=31
```

New MiddleManager (on 2 Data servers)

```
druid.worker.capacity=4
druid.indexer.fork.property.druid.processing.numMergeBuffers=2
Expand Down