Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/content/comparisons/druid-vs-kudu.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,4 @@ fast in Druid, whereas updates of older data is higher latency. This is by desig
and does not need to be updated too frequently. Kudu supports arbitrary primary keys with uniqueness constraints, and
efficient lookup by ranges of those keys. Kudu chooses not to include the execution engine, but supports sufficient
operations so as to allow node-local processing from the execution engines. This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL).
Druid includes its own query layer that allows it to push down aggregations and computations directly to data nodes for faster query processing.
Druid includes its own query layer that allows it to push down aggregations and computations directly to data processes for faster query processing.
4 changes: 2 additions & 2 deletions docs/content/comparisons/druid-vs-redshift.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Druid’s write semantics are not as fluid and does not support full joins (we s

### Data distribution model

Druid’s data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of Historical nodes does not result in data loss because new Historical nodes can always be brought up by reading data from "deep" storage.
Druid’s data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of Historical processes does not result in data loss because new Historical processes can always be brought up by reading data from "deep" storage.

To contrast, ParAccel’s data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazon’s Redshift works around this issue with a multi-step process:

Expand All @@ -52,7 +52,7 @@ To contrast, ParAccel’s data distribution model is hash-based. Expanding the c

### Replication strategy

Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance.
Druid employs segment-level data distribution meaning that more processes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance.

ParAccel’s hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load.

Expand Down
164 changes: 82 additions & 82 deletions docs/content/configuration/index.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/content/configuration/logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ title: "Logging"

# Logging

Druid nodes will emit logs that are useful for debugging to the console. Druid nodes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.html#enabling-metrics). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`.
Druid processes will emit logs that are useful for debugging to the console. Druid processes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.html#enabling-metrics). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`.

Druid uses [log4j2](http://logging.apache.org/log4j/2.x/) for logging. Logging can be configured with a log4j2.xml file. Add the path to the directory containing the log4j2.xml file (e.g. the _common/ dir) to your classpath if you want to override default Druid log configuration. Note that this directory should be earlier in the classpath than the druid jars. The easiest way to do this is to prefix the classpath with the config dir.

Expand Down
18 changes: 9 additions & 9 deletions docs/content/configuration/realtime.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: doc_page
title: "Realtime Node Configuration"
title: "Realtime Process Configuration"
---

<!--
Expand All @@ -22,20 +22,20 @@ title: "Realtime Node Configuration"
~ under the License.
-->

# Realtime Node Configuration
# Realtime Process Configuration

For general Realtime Node information, see [here](../design/realtime.html).
For general Realtime Process information, see [here](../design/realtime.html).

Runtime Configuration
---------------------

The realtime node uses several of the global configs in [Configuration](../configuration/index.html) and has the following set of configurations as well:
The realtime process uses several of the global configs in [Configuration](../configuration/index.html) and has the following set of configurations as well:

### Node Config
### Process Config

|Property|Description|Default|
|--------|-----------|-------|
|`druid.host`|The host for the current node. This is used to advertise the current processes location as reachable from another node and should generally be specified such that `http://${druid.host}/` could actually talk to this process|InetAddress.getLocalHost().getCanonicalHostName()|
|`druid.host`|The host for the current process. This is used to advertise the current processes location as reachable from another process and should generally be specified such that `http://${druid.host}/` could actually talk to this process|InetAddress.getLocalHost().getCanonicalHostName()|
|`druid.plaintextPort`|This is the port to actually listen on; unless port mapping is used, this will be the same port as is on `druid.host`|8084|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.html) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8284|
|`druid.service`|The name of the service. This is used as a dimension when emitting metrics and alerts to differentiate between the various services|druid/realtime|
Expand All @@ -60,8 +60,8 @@ The realtime node uses several of the global configs in [Configuration](../confi

|Property|Description|Default|
|--------|-----------|-------|
|`druid.processing.buffer.sizeBytes`|This specifies a buffer size for the storage of intermediate results. The computation engine in both the Historical and Realtime nodes will use a scratch buffer of this size to do all of their intermediate computations off-heap. Larger values allow for more aggregations in a single pass over the data while smaller values can require more passes depending on the query that is being executed.|auto (max 1GB)|
|`druid.processing.formatString`|Realtime and Historical nodes use this format string to name their processing threads.|processing-%s|
|`druid.processing.buffer.sizeBytes`|This specifies a buffer size for the storage of intermediate results. The computation engine in both the Historical and Realtime processes will use a scratch buffer of this size to do all of their intermediate computations off-heap. Larger values allow for more aggregations in a single pass over the data while smaller values can require more passes depending on the query that is being executed.|auto (max 1GB)|
|`druid.processing.formatString`|Realtime and Historical processes use this format string to name their processing threads.|processing-%s|
|`druid.processing.numMergeBuffers`|The number of direct memory buffers available for merging query results. The buffers are sized by `druid.processing.buffer.sizeBytes`. This property is effectively a concurrency limit for queries that require merging buffers. If you are using any queries that require merge buffers (currently, just groupBy v2) then you should have at least two of these.|`max(2, druid.processing.numThreads / 4)`|
|`druid.processing.numThreads`|The number of processing threads to have available for parallel processing of segments. Our rule of thumb is `num_cores - 1`, which means that even under heavy load there will still be one core available to do background tasks like talking with ZooKeeper and pulling down segments. If only one core is available, this property defaults to the value `1`.|Number of cores - 1 (or 1)|
|`druid.processing.columnCache.sizeBytes`|Maximum size in bytes for the dimension value lookup cache. Any value greater than `0` enables the cache. It is currently disabled by default. Enabling the lookup cache can significantly improve the performance of aggregators operating on dimension values, such as the JavaScript aggregator, or cardinality aggregator, but can slow things down if the cache hit rate is low (i.e. dimensions with few repeating values). Enabling it may also require additional garbage collection tuning to avoid long GC pauses.|`0` (disabled)|
Expand All @@ -86,7 +86,7 @@ See [groupBy server configuration](../querying/groupbyquery.html#server-configur

### Caching

You can optionally configure caching to be enabled on the realtime node by setting caching configs here.
You can optionally configure caching to be enabled on the realtime process by setting caching configs here.

|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
Expand Down
4 changes: 2 additions & 2 deletions docs/content/dependencies/cassandra-deep-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ title: "Cassandra Deep Storage"
Druid can use Cassandra as a deep storage mechanism. Segments and their metadata are stored in Cassandra in two tables:
`index_storage` and `descriptor_storage`. Underneath the hood, the Cassandra integration leverages Astyanax. The
index storage table is a [Chunked Object](https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store) repository. It contains
compressed segments for distribution to Historical nodes. Since segments can be large, the Chunked Object storage allows the integration to multi-thread
the write to Cassandra, and spreads the data across all the nodes in a cluster. The descriptor storage table is a normal C* table that
compressed segments for distribution to Historical processes. Since segments can be large, the Chunked Object storage allows the integration to multi-thread
the write to Cassandra, and spreads the data across all the processes in a cluster. The descriptor storage table is a normal C* table that
stores the segment metadatak.

## Schema
Expand Down
2 changes: 1 addition & 1 deletion docs/content/dependencies/deep-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ title: "Deep Storage"

# Deep Storage

Deep storage is where segments are stored. It is a storage mechanism that Druid does not provide. This deep storage infrastructure defines the level of durability of your data, as long as Druid nodes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented.
Deep storage is where segments are stored. It is a storage mechanism that Druid does not provide. This deep storage infrastructure defines the level of durability of your data, as long as Druid processes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented.

## Local Mount

Expand Down
6 changes: 3 additions & 3 deletions docs/content/dependencies/metadata-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,8 @@ config changes.

The Metadata Storage is accessed only by:

1. Indexing Service Nodes (if any)
2. Realtime Nodes (if any)
3. Coordinator Nodes
1. Indexing Service Processes (if any)
2. Realtime Processes (if any)
3. Coordinator Processes

Thus you need to give permissions (eg in AWS Security Groups) only for these machines to access the Metadata storage.
10 changes: 5 additions & 5 deletions docs/content/dependencies/zookeeper.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ ${druid.zk.paths.coordinatorPath}/_COORDINATOR

The `announcementsPath` and `servedSegmentsPath` are used for this.

All [Historical](../design/historical.html) and [Realtime](../design/realtime.html) nodes publish themselves on the `announcementsPath`, specifically, they will create an ephemeral znode at
All [Historical](../design/historical.html) and [Realtime](../design/realtime.html) processes publish themselves on the `announcementsPath`, specifically, they will create an ephemeral znode at

```
${druid.zk.paths.announcementsPath}/${druid.host}
Expand All @@ -62,16 +62,16 @@ And as they load up segments, they will attach ephemeral znodes that look like
${druid.zk.paths.servedSegmentsPath}/${druid.host}/_segment_identifier_
```

Nodes like the [Coordinator](../design/coordinator.html) and [Broker](../design/broker.html) can then watch these paths to see which nodes are currently serving which segments.
Processes like the [Coordinator](../design/coordinator.html) and [Broker](../design/broker.html) can then watch these paths to see which processes are currently serving which segments.

### Segment load/drop protocol between Coordinator and Historical

The `loadQueuePath` is used for this.

When the [Coordinator](../design/coordinator.html) decides that a [Historical](../design/historical.html) node should load or drop a segment, it writes an ephemeral znode to
When the [Coordinator](../design/coordinator.html) decides that a [Historical](../design/historical.html) process should load or drop a segment, it writes an ephemeral znode to

```
${druid.zk.paths.loadQueuePath}/_host_of_historical_node/_segment_identifier
${druid.zk.paths.loadQueuePath}/_host_of_historical_process/_segment_identifier
```

This node will contain a payload that indicates to the Historical node what it should do with the given segment. When the Historical node is done with the work, it will delete the znode in order to signify to the Coordinator that it is complete.
This znode will contain a payload that indicates to the Historical process what it should do with the given segment. When the Historical process is done with the work, it will delete the znode in order to signify to the Coordinator that it is complete.
2 changes: 1 addition & 1 deletion docs/content/design/auth.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ An Authenticator implementation should provide some means through configuration

## Internal System User

Internal requests between Druid nodes (non-user initiated communications) need to have authentication credentials attached.
Internal requests between Druid processes (non-user initiated communications) need to have authentication credentials attached.

These requests should be run as an "internal system user", an identity that represents the Druid cluster itself, with full access permissions.

Expand Down
14 changes: 7 additions & 7 deletions docs/content/design/broker.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,16 @@ title: "Broker"

### Configuration

For Broker Node Configuration, see [Broker Configuration](../configuration/index.html#broker).
For Broker Process Configuration, see [Broker Configuration](../configuration/index.html#broker).

### HTTP endpoints

For a list of API endpoints supported by the Broker, see [Broker API](../operations/api-reference.html#broker).

### Overview

The Broker is the node to route queries to if you want to run a distributed cluster. It understands the metadata published to ZooKeeper about what segments exist on what nodes and routes queries such that they hit the right nodes. This node also merges the result sets from all of the individual nodes together.
On start up, Realtime nodes announce themselves and the segments they are serving in Zookeeper.
The Broker is the process to route queries to if you want to run a distributed cluster. It understands the metadata published to ZooKeeper about what segments exist on what processes and routes queries such that they hit the right processes. This process also merges the result sets from all of the individual processes together.
On start up, Historical processes announce themselves and the segments they are serving in Zookeeper.

### Running

Expand All @@ -45,11 +45,11 @@ org.apache.druid.cli.Main server broker

### Forwarding Queries

Most druid queries contain an interval object that indicates a span of time for which data is requested. Likewise, Druid [Segments](../design/segments.html) are partitioned to contain data for some interval of time and segments are distributed across a cluster. Consider a simple datasource with 7 segments where each segment contains data for a given day of the week. Any query issued to the datasource for more than one day of data will hit more than one segment. These segments will likely be distributed across multiple nodes, and hence, the query will likely hit multiple nodes.
Most druid queries contain an interval object that indicates a span of time for which data is requested. Likewise, Druid [Segments](../design/segments.html) are partitioned to contain data for some interval of time and segments are distributed across a cluster. Consider a simple datasource with 7 segments where each segment contains data for a given day of the week. Any query issued to the datasource for more than one day of data will hit more than one segment. These segments will likely be distributed across multiple processes, and hence, the query will likely hit multiple processes.

To determine which nodes to forward queries to, the Broker node first builds a view of the world from information in Zookeeper. Zookeeper maintains information about [Historical](../design/historical.html) and [Realtime](../design/realtime.html) nodes and the segments they are serving. For every datasource in Zookeeper, the Broker node builds a timeline of segments and the nodes that serve them. When queries are received for a specific datasource and interval, the Broker node performs a lookup into the timeline associated with the query datasource for the query interval and retrieves the nodes that contain data for the query. The Broker process then forwards down the query to the selected nodes.
To determine which processes to forward queries to, the Broker process first builds a view of the world from information in Zookeeper. Zookeeper maintains information about [Historical](../design/historical.html) and streaming ingestion [Peon](../design/peon.html) processes and the segments they are serving. For every datasource in Zookeeper, the Broker process builds a timeline of segments and the processes that serve them. When queries are received for a specific datasource and interval, the Broker process performs a lookup into the timeline associated with the query datasource for the query interval and retrieves the processes that contain data for the query. The Broker process then forwards down the query to the selected processes.

### Caching

Broker nodes employ a cache with a LRU cache invalidation strategy. The Broker cache stores per-segment results. The cache can be local to each Broker process or shared across multiple processes using an external distributed cache such as [memcached](http://memcached.org/). Each time a broker node receives a query, it first maps the query to a set of segments. A subset of these segment results may already exist in the cache and the results can be directly pulled from the cache. For any segment results that do not exist in the cache, the broker node will forward the query to the
Historical nodes. Once the Historical processes return their results, the Broker will store those results in the cache. Real-time segments are never cached and hence requests for real-time data will always be forwarded to real-time nodes. Real-time data is perpetually changing and caching the results would be unreliable.
Broker processes employ a cache with a LRU cache invalidation strategy. The Broker cache stores per-segment results. The cache can be local to each Broker process or shared across multiple processes using an external distributed cache such as [memcached](http://memcached.org/). Each time a broker process receives a query, it first maps the query to a set of segments. A subset of these segment results may already exist in the cache and the results can be directly pulled from the cache. For any segment results that do not exist in the cache, the broker process will forward the query to the
Historical processes. Once the Historical processes return their results, the Broker will store those results in the cache. Real-time segments are never cached and hence requests for real-time data will always be forwarded to real-time processes. Real-time data is perpetually changing and caching the results would be unreliable.
Loading