Skip to content

Add doc for Hadoop-based ingestion vs Native batch ingestion#7044

Merged
jihoonson merged 3 commits intoapache:masterfrom
jihoonson:hadoop-vs-native
Feb 13, 2019
Merged

Add doc for Hadoop-based ingestion vs Native batch ingestion#7044
jihoonson merged 3 commits intoapache:masterfrom
jihoonson:hadoop-vs-native

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

Fixes #5918.

@jihoonson
Copy link
Copy Markdown
Contributor Author

Thanks @Dylan1312 for the review. I added more links.

@jihoonson jihoonson merged commit 9703084 into apache:master Feb 13, 2019
| Supported [rollup modes](http://druid.io/docs/latest/ingestion/index.html#roll-up-modes) | Perfect rollup | Best-effort rollup | Both perfect and best-effort rollup |
| Supported partitioning methods | [Both Hash-based and range partitioning](http://druid.io/docs/latest/ingestion/hadoop.html#partitioning-specification) | N/A | Hash-based partitioning (when `forceGuaranteedRollup` = true) |
| Supported input locations | All locations accessible via HDFS client or Druid dataSource | All implemented [firehoses](./firehose.html) | All implemented [firehoses](./firehose.html) |
| Supported file formats | All implemented Hadoop InputFormats | Currently only text file format (CSV, TSV, JSON) | Currently only text file format (CSV, TSV, JSON) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not 100% true (as I'm sure you know). The native tasks support arbitrary InputRowParsers if you are willing to write your own FirehoseFactory, and it does look like there are even a couple firehose factories in this repo that don't require StringInputRowParser (eg the Rabbit and Rocket MQ firehoses).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's true. In the current implementation, the above statement is true only when firehoseFactory is a FiniteFirehoseFactory. But, I guess no one is using native tasks with infinite FirehoseFactory? Maybe it's better to restrict native tasks to support only finiteFirehoseFactory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, it looks worth to update the statement to like Currently text file formats (CSV, TSV, JSON) and any custom implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, it's just AbstractTextFilesFirehoseFactory where the dependency on text shows up. I've got a

class GCSLengthDelimitedByteArrayFirehoseFactory<T : InputRowParser<*>>(
	@JacksonInject private val gcsClient: GCSClient,
	@JsonProperty("bucket") val bucket: String,
	@JsonProperty("prefix") val prefix: String,
	@JsonProperty("interval") val interval: Interval) : FiniteFirehoseFactory<T, String> {

(kotlin) working just fine right here :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes. Probably there was a misunderstanding. I mean, every implementation of FiniteFirehoseFactory extends AbstractTextFilesFirehoseFactory. A simple history behind FirehoseFactory and FiniteFirehoseFactory is, FirehoseFactory was first designed for stream ingestion and extended to be used in indexTask later. Finally, FiniteFirehoseFactory was added when parallel index task was added. However, FiniteFirehoseFactory is for any type of batch indexing, rather than only for parallel indexing.

I guess GCSLengthDelimitedByteArrayFirehoseFactory is a custom implementation? Then yes, it would work if it's designed for batch ingestion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Our main ingestion pipeline ingests protobufs from Kafka, using custom InputRowParsers (and different IRPs for each data source, multiple data sources parsing the same Kafka topics with different IRPs, some data sources generating many rows from a single protobuf/Kafka message). We back up Kafka to GCS in a simple packed binary format using Secor and are batch ingesting from that with the custom firehose. Or at least are trying to, it's almost working in prod :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, it sounds nice.

BTW, I raised #7071. Please take a look when you're available.

@jon-wei jon-wei added this to the 0.14.0 milestone Feb 13, 2019
jihoonson added a commit to jihoonson/druid that referenced this pull request Feb 20, 2019
…7044)

* Add doc for Hadoop-based ingestion vs Native batch ingestion

* add links

* add links
fjy pushed a commit that referenced this pull request Feb 20, 2019
…7103)

* Add doc for Hadoop-based ingestion vs Native batch ingestion

* add links

* add links
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants