Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions docs/Aggregations.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
---
layout: default
---
Aggregations are specifications of processing over metrics available in Druid.
Available aggregations are:

Expand All @@ -13,7 +16,7 @@ computes the sum of values as a 64-bit, signed integer
"fieldName" : <metric_name>
}</code>

`name` – output name for the summed value
`name` – output name for the summed value
`fieldName` – name of the metric column to sum over

#### `doubleSum` aggregator
Expand Down Expand Up @@ -84,4 +87,4 @@ All JavaScript functions must return numerical values.
"fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }"
"fnCombine" : "function(partialA, partialB) { return partialA + partialB; }"
"fnReset" : "function() { return 10; }"
}</code>
}</code>
11 changes: 7 additions & 4 deletions docs/Batch-ingestion.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
---
layout: default
---
Batch Data Ingestion
====================

There are two choices for batch data ingestion to your Druid cluster, you can use the [[Indexing service]] or you can use the `HadoopDruidIndexerMain`. This page describes how to use the `HadoopDruidIndexerMain`.
There are two choices for batch data ingestion to your Druid cluster, you can use the [Indexing service](Indexing-service.html) or you can use the `HadoopDruidIndexerMain`. This page describes how to use the `HadoopDruidIndexerMain`.

Which should I use?
-------------------

The [[Indexing service]] is a node that can run as part of your Druid cluster and can accomplish a number of different types of indexing tasks. Even if all you care about is batch indexing, it provides for the encapsulation of things like the Database that is used for segment metadata and other things, so that your indexing tasks do not need to include such information. Long-term, the indexing service is going to be the preferred method of ingesting data.
The [Indexing service](Indexing-service.html) is a node that can run as part of your Druid cluster and can accomplish a number of different types of indexing tasks. Even if all you care about is batch indexing, it provides for the encapsulation of things like the Database that is used for segment metadata and other things, so that your indexing tasks do not need to include such information. Long-term, the indexing service is going to be the preferred method of ingesting data.

The `HadoopDruidIndexerMain` runs hadoop jobs in order to separate and index data segments. It takes advantage of Hadoop as a job scheduling and distributed job execution platform. It is a simple method if you already have Hadoop running and don’t want to spend the time configuring and deploying the [[Indexing service]] just yet.
The `HadoopDruidIndexerMain` runs hadoop jobs in order to separate and index data segments. It takes advantage of Hadoop as a job scheduling and distributed job execution platform. It is a simple method if you already have Hadoop running and don’t want to spend the time configuring and deploying the [Indexing service](Indexing service.html) just yet.

HadoopDruidIndexer
------------------
Expand Down Expand Up @@ -135,4 +138,4 @@ This is a specification of the properties that tell the job how to update metada
|password|password for db|yes|
|segmentTable|table to use in DB|yes|

These properties should parrot what you have configured for your [[Master]].
These properties should parrot what you have configured for your [Master](Master.html).
5 changes: 4 additions & 1 deletion docs/Booting-a-production-cluster.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
---
layout: default
---
# Booting a Single Node Cluster #

[[Loading Your Data]] and [[Querying Your Data]] contain recipes to boot a small druid cluster on localhost. Here we will boot a small cluster on EC2. You can checkout the code, or download a tarball from [here](http://static.druid.io/artifacts/druid-services-0.5.51-SNAPSHOT-bin.tar.gz).
[Loading Your Data](Loading-Your-Data.html) and [Querying Your Data](Querying-Your-Data.html) contain recipes to boot a small druid cluster on localhost. Here we will boot a small cluster on EC2. You can checkout the code, or download a tarball from [here](http://static.druid.io/artifacts/druid-services-0.5.51-SNAPSHOT-bin.tar.gz).

The [ec2 run script](https://github.com/metamx/druid/blob/master/examples/bin/run_ec2.sh), run_ec2.sh, is located at 'examples/bin' if you have checked out the code, or at the root of the project if you've downloaded a tarball. The scripts rely on the [Amazon EC2 API Tools](http://aws.amazon.com/developertools/351), and you will need to set three environment variables:

Expand Down
9 changes: 6 additions & 3 deletions docs/Broker.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
---
layout: default
---
Broker
======

Expand All @@ -6,9 +9,9 @@ The Broker is the node to route queries to if you want to run a distributed clus
Forwarding Queries
------------------

Most druid queries contain an interval object that indicates a span of time for which data is requested. Likewise, Druid [[Segments]] are partitioned to contain data for some interval of time and segments are distributed across a cluster. Consider a simple datasource with 7 segments where each segment contains data for a given day of the week. Any query issued to the datasource for more than one day of data will hit more than one segment. These segments will likely be distributed across multiple nodes, and hence, the query will likely hit multiple nodes.
Most druid queries contain an interval object that indicates a span of time for which data is requested. Likewise, Druid [Segments](Segments.html) are partitioned to contain data for some interval of time and segments are distributed across a cluster. Consider a simple datasource with 7 segments where each segment contains data for a given day of the week. Any query issued to the datasource for more than one day of data will hit more than one segment. These segments will likely be distributed across multiple nodes, and hence, the query will likely hit multiple nodes.

To determine which nodes to forward queries to, the Broker node first builds a view of the world from information in Zookeeper. Zookeeper maintains information about [[Compute]] and [[Realtime]] nodes and the segments they are serving. For every datasource in Zookeeper, the Broker node builds a timeline of segments and the nodes that serve them. When queries are received for a specific datasource and interval, the Broker node performs a lookup into the timeline associated with the query datasource for the query interval and retrieves the nodes that contain data for the query. The Broker node then forwards down the query to the selected nodes.
To determine which nodes to forward queries to, the Broker node first builds a view of the world from information in Zookeeper. Zookeeper maintains information about [Compute](Compute.html) and [Realtime](Realtime.html) nodes and the segments they are serving. For every datasource in Zookeeper, the Broker node builds a timeline of segments and the nodes that serve them. When queries are received for a specific datasource and interval, the Broker node performs a lookup into the timeline associated with the query datasource for the query interval and retrieves the nodes that contain data for the query. The Broker node then forwards down the query to the selected nodes.

Caching
-------
Expand All @@ -24,4 +27,4 @@ Broker nodes can be run using the `com.metamx.druid.http.BrokerMain` class.
Configuration
-------------

See [[Configuration]].
See [Configuration](Configuration.html).
3 changes: 3 additions & 0 deletions docs/Build-from-source.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
---
layout: default
---
### Clone and Build from Source

The other way to setup Druid is from source via git. To do so, run these commands:
Expand Down
43 changes: 23 additions & 20 deletions docs/Cluster-setup.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
A Druid cluster consists of various node types that need to be set up depending on your use case. See our [[Design]] docs for a description of the different node types.
---
layout: default
---
A Druid cluster consists of various node types that need to be set up depending on your use case. See our [Design](Design.html) docs for a description of the different node types.

Setup Scripts
-------------
Expand All @@ -8,14 +11,14 @@ One of our community members, [housejester](https://github.com/housejester/), co
Minimum Physical Layout: Absolute Minimum
-----------------------------------------

As a special case, the absolute minimum setup is one of the standalone examples for realtime ingestion and querying; see [[Examples]] that can easily run on one machine with one core and 1GB RAM. This layout can be set up to try some basic queries with Druid.
As a special case, the absolute minimum setup is one of the standalone examples for realtime ingestion and querying; see [Examples](Examples.html) that can easily run on one machine with one core and 1GB RAM. This layout can be set up to try some basic queries with Druid.

Minimum Physical Layout: Experimental Testing with 4GB of RAM
-------------------------------------------------------------

This layout can be used to load some data from deep storage onto a Druid compute node for the first time. A minimal physical layout for a 1 or 2 core machine with 4GB of RAM is:

1. node1: [[Master]] + metadata service + zookeeper + [[Compute]]
1. node1: [Master](Master.html) + metadata service + zookeeper + [Compute](Compute.html)
2. transient nodes: indexer

This setup is only reasonable to prove that a configuration works. It would not be worthwhile to use this layout for performance measurement.
Expand All @@ -27,13 +30,13 @@ Comfortable Physical Layout: Pilot Project with Multiple Machines

A minimal physical layout not constrained by cores that demonstrates parallel querying and realtime, using AWS-EC2 “small”/m1.small (one core, with 1.7GB of RAM) or larger, no realtime, is:

1. node1: [[Master]] (m1.small)
1. node1: [Master](Master.html) (m1.small)
2. node2: metadata service (m1.small)
3. node3: zookeeper (m1.small)
4. node4: [[Broker]] (m1.small or m1.medium or m1.large)
5. node5: [[Compute]] (m1.small or m1.medium or m1.large)
6. node6: [[Compute]] (m1.small or m1.medium or m1.large)
7. node7: [[Realtime]] (m1.small or m1.medium or m1.large)
4. node4: [Broker](Broker.html) (m1.small or m1.medium or m1.large)
5. node5: [Compute](Compute.html) (m1.small or m1.medium or m1.large)
6. node6: [Compute](Compute.html) (m1.small or m1.medium or m1.large)
7. node7: [Realtime](Realtime.html) (m1.small or m1.medium or m1.large)
8. transient nodes: indexer

This layout naturally lends itself to adding more RAM and core to Compute nodes, and to adding many more Compute nodes. Depending on the actual load, the Master, metadata server, and Zookeeper might need to use larger machines.
Expand All @@ -45,18 +48,18 @@ High Availability Physical Layout

An HA layout allows full rolling restarts and heavy volume:

1. node1: [[Master]] (m1.small or m1.medium or m1.large)
2. node2: [[Master]] (m1.small or m1.medium or m1.large) (backup)
1. node1: [Master](Master.html) (m1.small or m1.medium or m1.large)
2. node2: [Master](Master.html) (m1.small or m1.medium or m1.large) (backup)
3. node3: metadata service (c1.medium or m1.large)
4. node4: metadata service (c1.medium or m1.large) (backup)
5. node5: zookeeper (c1.medium)
6. node6: zookeeper (c1.medium)
7. node7: zookeeper (c1.medium)
8. node8: [[Broker]] (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
9. node9: [[Broker]] (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge) (backup)
10. node10: [[Compute]] (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
11. node11: [[Compute]] (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
12. node12: [[Realtime]] (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
8. node8: [Broker](Broker.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
9. node9: [Broker](Broker.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge) (backup)
10. node10: [Compute](Compute.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
11. node11: [Compute](Compute.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
12. node12: [Realtime](Realtime.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
13. transient nodes: indexer

Sizing for Cores and RAM
Expand All @@ -76,7 +79,7 @@ Local disk (“ephemeral” on AWS EC2) for caching is recommended over network
Setup
-----

Setting up a cluster is essentially just firing up all of the nodes you want with the proper [[configuration]]. One thing to be aware of is that there are a few properties in the configuration that potentially need to be set individually for each process:
Setting up a cluster is essentially just firing up all of the nodes you want with the proper [configuration](configuration.html). One thing to be aware of is that there are a few properties in the configuration that potentially need to be set individually for each process:

<code>
druid.server.type=historical|realtime
Expand Down Expand Up @@ -104,8 +107,8 @@ The following table shows the possible services and fully qualified class for ma

|service|main class|
|-------|----------|
|[[ Realtime ]]|com.metamx.druid.realtime.RealtimeMain|
|[[ Master ]]|com.metamx.druid.http.MasterMain|
|[[ Broker ]]|com.metamx.druid.http.BrokerMain|
|[[ Compute ]]|com.metamx.druid.http.ComputeMain|
|[ Realtime ]( Realtime .html)|com.metamx.druid.realtime.RealtimeMain|
|[ Master ]( Master .html)|com.metamx.druid.http.MasterMain|
|[ Broker ]( Broker .html)|com.metamx.druid.http.BrokerMain|
|[ Compute ]( Compute .html)|com.metamx.druid.http.ComputeMain|

11 changes: 7 additions & 4 deletions docs/Compute.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
---
layout: default
---
Compute
=======

Expand All @@ -8,9 +11,9 @@ Loading and Serving Segments

Each compute node maintains a constant connection to Zookeeper and watches a configurable set of Zookeeper paths for new segment information. Compute nodes do not communicate directly with each other or with the master nodes but instead rely on Zookeeper for coordination.

The [[Master]] node is responsible for assigning new segments to compute nodes. Assignment is done by creating an ephemeral Zookeeper entry under a load queue path associated with a compute node. For more information on how the master assigns segments to compute nodes, please see [[Master]].
The [Master](Master.html) node is responsible for assigning new segments to compute nodes. Assignment is done by creating an ephemeral Zookeeper entry under a load queue path associated with a compute node. For more information on how the master assigns segments to compute nodes, please see [Master](Master.html).

When a compute node notices a new load queue entry in its load queue path, it will first check a local disk directory (cache) for the information about segment. If no information about the segment exists in the cache, the compute node will download metadata about the new segment to serve from Zookeeper. This metadata includes specifications about where the segment is located in deep storage and about how to decompress and process the segment. For more information about segment metadata and Druid segments in general, please see [[Segments]]. Once a compute node completes processing a segment, the segment is announced in Zookeeper under a served segments path associated with the node. At this point, the segment is available for querying.
When a compute node notices a new load queue entry in its load queue path, it will first check a local disk directory (cache) for the information about segment. If no information about the segment exists in the cache, the compute node will download metadata about the new segment to serve from Zookeeper. This metadata includes specifications about where the segment is located in deep storage and about how to decompress and process the segment. For more information about segment metadata and Druid segments in general, please see [Segments](Segments.html). Once a compute node completes processing a segment, the segment is announced in Zookeeper under a served segments path associated with the node. At this point, the segment is available for querying.

Loading and Serving Segments From Cache
---------------------------------------
Expand All @@ -22,7 +25,7 @@ The segment cache is also leveraged when a compute node is first started. On sta
Querying Segments
-----------------

Please see [[Querying]] for more information on querying compute nodes.
Please see [Querying](Querying.html) for more information on querying compute nodes.

For every query that a compute node services, it will log the query and report metrics on the time taken to run the query.

Expand All @@ -34,4 +37,4 @@ Compute nodes can be run using the `com.metamx.druid.http.ComputeMain` class.
Configuration
-------------

See [[Configuration]].
See [Configuration](Configuration.html).
5 changes: 4 additions & 1 deletion docs/Concepts-and-Terminology.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
---
layout: default
---
Concepts and Terminology
========================

Expand All @@ -9,4 +12,4 @@ Concepts and Terminology

- **Segment:** A collection of (internal) records that are stored and processed together.
- **Shard:** A unit of partitioning data across machine. TODO: clarify; by time or other dimensions?
- **specFile** is specification for services in JSON format; see [[Realtime]] and [[Batch-ingestion]]
- **specFile** is specification for services in JSON format; see [Realtime](Realtime.html) and [Batch-ingestion](Batch-ingestion.html)
Loading