From 5a93996a17998d4d561961e6703a779a392be247 Mon Sep 17 00:00:00 2001 From: Jihoon Son Date: Wed, 13 Feb 2019 11:23:08 -0800 Subject: [PATCH] Add doc for Hadoop-based ingestion vs Native batch ingestion (#7044) * Add doc for Hadoop-based ingestion vs Native batch ingestion * add links * add links --- .../ingestion/hadoop-vs-native-batch.md | 43 +++++++++++++++++++ docs/content/ingestion/hadoop.md | 4 +- docs/content/ingestion/native_tasks.md | 2 + docs/content/ingestion/tasks.md | 4 ++ 4 files changed, 52 insertions(+), 1 deletion(-) create mode 100644 docs/content/ingestion/hadoop-vs-native-batch.md diff --git a/docs/content/ingestion/hadoop-vs-native-batch.md b/docs/content/ingestion/hadoop-vs-native-batch.md new file mode 100644 index 000000000000..ce2c97e603b0 --- /dev/null +++ b/docs/content/ingestion/hadoop-vs-native-batch.md @@ -0,0 +1,43 @@ +--- +layout: doc_page +title: "Hadoop-based Batch Ingestion VS Native Batch Ingestion" +--- + + + +# Comparison of Batch Ingestion Methods + +Druid basically supports three types of batch ingestion: Hadoop-based +batch ingestion, native parallel batch ingestion, and native local batch +ingestion. The below table shows what features are supported by each +ingestion method. + + +| |Hadoop-based ingestion|Native parallel ingestion|Native local ingestion| +|---|----------------------|-------------------------|----------------------| +| Parallel indexing | Always parallel | Parallel if firehose is splittable | Always sequential | +| Supported indexing modes | Replacing mode | Both appending and replacing modes | Both appending and replacing modes | +| External dependency | Hadoop (it internally submits Hadoop jobs) | No dependency | No dependency | +| Supported [rollup modes](http://druid.io/docs/latest/ingestion/index.html#roll-up-modes) | Perfect rollup | Best-effort rollup | Both perfect and best-effort rollup | +| Supported partitioning methods | [Both Hash-based and range partitioning](http://druid.io/docs/latest/ingestion/hadoop.html#partitioning-specification) | N/A | Hash-based partitioning (when `forceGuaranteedRollup` = true) | +| Supported input locations | All locations accessible via HDFS client or Druid dataSource | All implemented [firehoses](./firehose.html) | All implemented [firehoses](./firehose.html) | +| Supported file formats | All implemented Hadoop InputFormats | Currently only text file format (CSV, TSV, JSON) | Currently only text file format (CSV, TSV, JSON) | +| Saving parse exceptions in ingestion report | Currently not supported | Currently not supported | Supported | +| Custom segment version | Supported, but this is NOT recommended | N/A | N/A | diff --git a/docs/content/ingestion/hadoop.md b/docs/content/ingestion/hadoop.md index 4f8174c40a95..c824fd0809ca 100644 --- a/docs/content/ingestion/hadoop.md +++ b/docs/content/ingestion/hadoop.md @@ -25,7 +25,9 @@ title: "Hadoop-based Batch Ingestion" # Hadoop-based Batch Ingestion Hadoop-based batch ingestion in Druid is supported via a Hadoop-ingestion task. These tasks can be posted to a running -instance of a Druid [Overlord](../design/overlord.html). +instance of a Druid [Overlord](../design/overlord.html). + +Please check [Hadoop-based Batch Ingestion VS Native Batch Ingestion](./hadoop-vs-native-batch.html) for differences between native batch ingestion and Hadoop-based ingestion. ## Command Line Hadoop Indexer diff --git a/docs/content/ingestion/native_tasks.md b/docs/content/ingestion/native_tasks.md index e5b2e7d28710..b9657d15c8d4 100644 --- a/docs/content/ingestion/native_tasks.md +++ b/docs/content/ingestion/native_tasks.md @@ -28,6 +28,8 @@ Druid currently has two types of native batch indexing tasks, `index_parallel` w in parallel on multiple MiddleManager nodes, and `index` which will run a single indexing task locally on a single MiddleManager. +Please check [Hadoop-based Batch Ingestion VS Native Batch Ingestion](./hadoop-vs-native-batch.html) for differences between native batch ingestion and Hadoop-based ingestion. + Parallel Index Task -------------------------------- diff --git a/docs/content/ingestion/tasks.md b/docs/content/ingestion/tasks.md index 41f7b52444b1..4653d6ba2ed7 100644 --- a/docs/content/ingestion/tasks.md +++ b/docs/content/ingestion/tasks.md @@ -41,6 +41,10 @@ See [batch ingestion](../ingestion/hadoop.html). Druid provides a native index task which doesn't need any dependencies on other systems. See [native index tasks](./native_tasks.html) for more details. +
+Please check [Hadoop-based Batch Ingestion VS Native Batch Ingestion](./hadoop-vs-native-batch.html) for differences between native batch ingestion and Hadoop-based ingestion. +
+ ### Kafka Indexing Tasks Kafka Indexing tasks are automatically created by a Kafka Supervisor and are responsible for pulling data from Kafka streams. These tasks are not meant to be created/submitted directly by users. See [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.html) for more details.