Add doc for Hadoop-based ingestion vs Native batch ingestion by jihoonson · Pull Request #7044 · apache/druid

jihoonson · 2019-02-09T01:26:31Z

jihoonson · 2019-02-11T18:58:14Z

Thanks @Dylan1312 for the review. I added more links.

glasser · 2019-02-13T20:06:18Z

+| Supported [rollup modes](http://druid.io/docs/latest/ingestion/index.html#roll-up-modes) | Perfect rollup | Best-effort rollup | Both perfect and best-effort rollup |
+| Supported partitioning methods | [Both Hash-based and range partitioning](http://druid.io/docs/latest/ingestion/hadoop.html#partitioning-specification) | N/A | Hash-based partitioning (when `forceGuaranteedRollup` = true) |
+| Supported input locations | All locations accessible via HDFS client or Druid dataSource | All implemented [firehoses](./firehose.html) | All implemented [firehoses](./firehose.html) |
+| Supported file formats | All implemented Hadoop InputFormats | Currently only text file format (CSV, TSV, JSON) | Currently only text file format (CSV, TSV, JSON) |


This is not 100% true (as I'm sure you know). The native tasks support arbitrary InputRowParsers if you are willing to write your own FirehoseFactory, and it does look like there are even a couple firehose factories in this repo that don't require StringInputRowParser (eg the Rabbit and Rocket MQ firehoses).

Ah, that's true. In the current implementation, the above statement is true only when firehoseFactory is a FiniteFirehoseFactory. But, I guess no one is using native tasks with infinite FirehoseFactory? Maybe it's better to restrict native tasks to support only finiteFirehoseFactory.

Anyway, it looks worth to update the statement to like Currently text file formats (CSV, TSV, JSON) and any custom implementation.

Nah, it's just AbstractTextFilesFirehoseFactory where the dependency on text shows up. I've got a

class GCSLengthDelimitedByteArrayFirehoseFactory<T : InputRowParser<*>>( @JacksonInject private val gcsClient: GCSClient, @JsonProperty("bucket") val bucket: String, @JsonProperty("prefix") val prefix: String, @JsonProperty("interval") val interval: Interval) : FiniteFirehoseFactory<T, String> {

(kotlin) working just fine right here :)

Oh yes. Probably there was a misunderstanding. I mean, every implementation of FiniteFirehoseFactory extends AbstractTextFilesFirehoseFactory. A simple history behind FirehoseFactory and FiniteFirehoseFactory is, FirehoseFactory was first designed for stream ingestion and extended to be used in indexTask later. Finally, FiniteFirehoseFactory was added when parallel index task was added. However, FiniteFirehoseFactory is for any type of batch indexing, rather than only for parallel indexing.

I guess GCSLengthDelimitedByteArrayFirehoseFactory is a custom implementation? Then yes, it would work if it's designed for batch ingestion.

Yep. Our main ingestion pipeline ingests protobufs from Kafka, using custom InputRowParsers (and different IRPs for each data source, multiple data sources parsing the same Kafka topics with different IRPs, some data sources generating many rows from a single protobuf/Kafka message). We back up Kafka to GCS in a simple packed binary format using Secor and are batch ingesting from that with the custom firehose. Or at least are trying to, it's almost working in prod :)

Cool, it sounds nice.

BTW, I raised #7071. Please take a look when you're available.

…7044) * Add doc for Hadoop-based ingestion vs Native batch ingestion * add links * add links

…7103) * Add doc for Hadoop-based ingestion vs Native batch ingestion * add links * add links

jihoonson added 2 commits February 8, 2019 17:14

Add doc for Hadoop-based ingestion vs Native batch ingestion

03c1734

add links

f782d8e

jihoonson added the Area - Documentation label Feb 9, 2019

dylwylie approved these changes Feb 11, 2019

View reviewed changes

add links

f10e4b3

jihoonson merged commit 9703084 into apache:master Feb 13, 2019

glasser reviewed Feb 13, 2019

View reviewed changes

jihoonson mentioned this pull request Feb 13, 2019

Fix supported file formats for Hadoop vs Native batch doc #7069

Merged

jon-wei added this to the 0.14.0 milestone Feb 13, 2019

jihoonson added a commit to jihoonson/druid that referenced this pull request Feb 20, 2019

Add doc for Hadoop-based ingestion vs Native batch ingestion (apache#…

5a93996

…7044) * Add doc for Hadoop-based ingestion vs Native batch ingestion * add links * add links

jihoonson mentioned this pull request Feb 20, 2019

[Backport] Add doc for Hadoop-based ingestion vs Native batch ingestion #7103

Merged

fjy pushed a commit that referenced this pull request Feb 20, 2019

Add doc for Hadoop-based ingestion vs Native batch ingestion (#7044) (#…

510de41

…7103) * Add doc for Hadoop-based ingestion vs Native batch ingestion * add links * add links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add doc for Hadoop-based ingestion vs Native batch ingestion#7044

Add doc for Hadoop-based ingestion vs Native batch ingestion#7044
jihoonson merged 3 commits intoapache:masterfrom
jihoonson:hadoop-vs-native

jihoonson commented Feb 9, 2019

Uh oh!

jihoonson commented Feb 11, 2019

Uh oh!

glasser Feb 13, 2019

Uh oh!

jihoonson Feb 13, 2019

Uh oh!

jihoonson Feb 13, 2019

Uh oh!

glasser Feb 13, 2019

Uh oh!

jihoonson Feb 13, 2019

Uh oh!

glasser Feb 13, 2019

Uh oh!

jihoonson Feb 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jihoonson commented Feb 9, 2019

Uh oh!

jihoonson commented Feb 11, 2019

Uh oh!

glasser Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

jihoonson Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

jihoonson Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

glasser Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

jihoonson Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

glasser Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

jihoonson Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants