Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions docs/content/ingestion/hadoop-vs-native-batch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
layout: doc_page
title: "Hadoop-based Batch Ingestion VS Native Batch Ingestion"
---

<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->

# Comparison of Batch Ingestion Methods

Druid basically supports three types of batch ingestion: Hadoop-based
batch ingestion, native parallel batch ingestion, and native local batch
ingestion. The below table shows what features are supported by each
ingestion method.


| |Hadoop-based ingestion|Native parallel ingestion|Native local ingestion|
|---|----------------------|-------------------------|----------------------|
| Parallel indexing | Always parallel | Parallel if firehose is splittable | Always sequential |
| Supported indexing modes | Replacing mode | Both appending and replacing modes | Both appending and replacing modes |
| External dependency | Hadoop (it internally submits Hadoop jobs) | No dependency | No dependency |
| Supported [rollup modes](http://druid.io/docs/latest/ingestion/index.html#roll-up-modes) | Perfect rollup | Best-effort rollup | Both perfect and best-effort rollup |
| Supported partitioning methods | [Both Hash-based and range partitioning](http://druid.io/docs/latest/ingestion/hadoop.html#partitioning-specification) | N/A | Hash-based partitioning (when `forceGuaranteedRollup` = true) |
| Supported input locations | All locations accessible via HDFS client or Druid dataSource | All implemented [firehoses](./firehose.html) | All implemented [firehoses](./firehose.html) |
| Supported file formats | All implemented Hadoop InputFormats | Currently only text file format (CSV, TSV, JSON) | Currently only text file format (CSV, TSV, JSON) |
| Saving parse exceptions in ingestion report | Currently not supported | Currently not supported | Supported |
| Custom segment version | Supported, but this is NOT recommended | N/A | N/A |
4 changes: 3 additions & 1 deletion docs/content/ingestion/hadoop.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@ title: "Hadoop-based Batch Ingestion"
# Hadoop-based Batch Ingestion

Hadoop-based batch ingestion in Druid is supported via a Hadoop-ingestion task. These tasks can be posted to a running
instance of a Druid [Overlord](../design/overlord.html).
instance of a Druid [Overlord](../design/overlord.html).

Please check [Hadoop-based Batch Ingestion VS Native Batch Ingestion](./hadoop-vs-native-batch.html) for differences between native batch ingestion and Hadoop-based ingestion.

## Command Line Hadoop Indexer

Expand Down
2 changes: 2 additions & 0 deletions docs/content/ingestion/native_tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ Druid currently has two types of native batch indexing tasks, `index_parallel` w
in parallel on multiple MiddleManager nodes, and `index` which will run a single indexing task locally on a single
MiddleManager.

Please check [Hadoop-based Batch Ingestion VS Native Batch Ingestion](./hadoop-vs-native-batch.html) for differences between native batch ingestion and Hadoop-based ingestion.

Parallel Index Task
--------------------------------

Expand Down
4 changes: 4 additions & 0 deletions docs/content/ingestion/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ See [batch ingestion](../ingestion/hadoop.html).
Druid provides a native index task which doesn't need any dependencies on other systems.
See [native index tasks](./native_tasks.html) for more details.

<div class="note info">
Please check [Hadoop-based Batch Ingestion VS Native Batch Ingestion](./hadoop-vs-native-batch.html) for differences between native batch ingestion and Hadoop-based ingestion.
</div>

### Kafka Indexing Tasks

Kafka Indexing tasks are automatically created by a Kafka Supervisor and are responsible for pulling data from Kafka streams. These tasks are not meant to be created/submitted directly by users. See [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.html) for more details.
Expand Down