Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions docs/content/ingestion/hadoop-vs-native-batch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
layout: doc_page
title: "Hadoop-based Batch Ingestion VS Native Batch Ingestion"
---

<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->

# Comparison of Batch Ingestion Methods

Druid basically supports three types of batch ingestion: Hadoop-based
batch ingestion, native parallel batch ingestion, and native local batch
ingestion. The below table shows what features are supported by each
ingestion method.


| |Hadoop-based ingestion|Native parallel ingestion|Native local ingestion|
|---|----------------------|-------------------------|----------------------|
| Parallel indexing | Always parallel | Parallel if firehose is splittable | Always sequential |
| Supported indexing modes | Replacing mode | Both appending and replacing modes | Both appending and replacing modes |
| External dependency | Hadoop (it internally submits Hadoop jobs) | No dependency | No dependency |
| Supported [rollup modes](http://druid.io/docs/latest/ingestion/index.html#roll-up-modes) | Perfect rollup | Best-effort rollup | Both perfect and best-effort rollup |
| Supported partitioning methods | [Both Hash-based and range partitioning](http://druid.io/docs/latest/ingestion/hadoop.html#partitioning-specification) | N/A | Hash-based partitioning (when `forceGuaranteedRollup` = true) |
| Supported input locations | All locations accessible via HDFS client or Druid dataSource | All implemented [firehoses](./firehose.html) | All implemented [firehoses](./firehose.html) |
| Supported file formats | All implemented Hadoop InputFormats | Currently only text file format (CSV, TSV, JSON) | Currently only text file format (CSV, TSV, JSON) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not 100% true (as I'm sure you know). The native tasks support arbitrary InputRowParsers if you are willing to write your own FirehoseFactory, and it does look like there are even a couple firehose factories in this repo that don't require StringInputRowParser (eg the Rabbit and Rocket MQ firehoses).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's true. In the current implementation, the above statement is true only when firehoseFactory is a FiniteFirehoseFactory. But, I guess no one is using native tasks with infinite FirehoseFactory? Maybe it's better to restrict native tasks to support only finiteFirehoseFactory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, it looks worth to update the statement to like Currently text file formats (CSV, TSV, JSON) and any custom implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, it's just AbstractTextFilesFirehoseFactory where the dependency on text shows up. I've got a

class GCSLengthDelimitedByteArrayFirehoseFactory<T : InputRowParser<*>>(
	@JacksonInject private val gcsClient: GCSClient,
	@JsonProperty("bucket") val bucket: String,
	@JsonProperty("prefix") val prefix: String,
	@JsonProperty("interval") val interval: Interval) : FiniteFirehoseFactory<T, String> {

(kotlin) working just fine right here :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes. Probably there was a misunderstanding. I mean, every implementation of FiniteFirehoseFactory extends AbstractTextFilesFirehoseFactory. A simple history behind FirehoseFactory and FiniteFirehoseFactory is, FirehoseFactory was first designed for stream ingestion and extended to be used in indexTask later. Finally, FiniteFirehoseFactory was added when parallel index task was added. However, FiniteFirehoseFactory is for any type of batch indexing, rather than only for parallel indexing.

I guess GCSLengthDelimitedByteArrayFirehoseFactory is a custom implementation? Then yes, it would work if it's designed for batch ingestion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Our main ingestion pipeline ingests protobufs from Kafka, using custom InputRowParsers (and different IRPs for each data source, multiple data sources parsing the same Kafka topics with different IRPs, some data sources generating many rows from a single protobuf/Kafka message). We back up Kafka to GCS in a simple packed binary format using Secor and are batch ingesting from that with the custom firehose. Or at least are trying to, it's almost working in prod :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, it sounds nice.

BTW, I raised #7071. Please take a look when you're available.

| Saving parse exceptions in ingestion report | Currently not supported | Currently not supported | Supported |
| Custom segment version | Supported, but this is NOT recommended | N/A | N/A |
4 changes: 3 additions & 1 deletion docs/content/ingestion/hadoop.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@ title: "Hadoop-based Batch Ingestion"
# Hadoop-based Batch Ingestion

Hadoop-based batch ingestion in Druid is supported via a Hadoop-ingestion task. These tasks can be posted to a running
instance of a Druid [Overlord](../design/overlord.html).
instance of a Druid [Overlord](../design/overlord.html).

Please check [Hadoop-based Batch Ingestion VS Native Batch Ingestion](./hadoop-vs-native-batch.html) for differences between native batch ingestion and Hadoop-based ingestion.

## Command Line Hadoop Indexer

Expand Down
2 changes: 2 additions & 0 deletions docs/content/ingestion/native_tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ Druid currently has two types of native batch indexing tasks, `index_parallel` w
in parallel on multiple MiddleManager nodes, and `index` which will run a single indexing task locally on a single
MiddleManager.

Please check [Hadoop-based Batch Ingestion VS Native Batch Ingestion](./hadoop-vs-native-batch.html) for differences between native batch ingestion and Hadoop-based ingestion.

Parallel Index Task
--------------------------------

Expand Down
4 changes: 4 additions & 0 deletions docs/content/ingestion/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ See [batch ingestion](../ingestion/hadoop.html).
Druid provides a native index task which doesn't need any dependencies on other systems.
See [native index tasks](./native_tasks.html) for more details.

<div class="note info">
Please check [Hadoop-based Batch Ingestion VS Native Batch Ingestion](./hadoop-vs-native-batch.html) for differences between native batch ingestion and Hadoop-based ingestion.
</div>

### Kafka Indexing Tasks

Kafka Indexing tasks are automatically created by a Kafka Supervisor and are responsible for pulling data from Kafka streams. These tasks are not meant to be created/submitted directly by users. See [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.html) for more details.
Expand Down