Add doc for Hadoop-based ingestion vs Native batch ingestion#7044
Add doc for Hadoop-based ingestion vs Native batch ingestion#7044jihoonson merged 3 commits intoapache:masterfrom
Conversation
|
Thanks @Dylan1312 for the review. I added more links. |
| | Supported [rollup modes](http://druid.io/docs/latest/ingestion/index.html#roll-up-modes) | Perfect rollup | Best-effort rollup | Both perfect and best-effort rollup | | ||
| | Supported partitioning methods | [Both Hash-based and range partitioning](http://druid.io/docs/latest/ingestion/hadoop.html#partitioning-specification) | N/A | Hash-based partitioning (when `forceGuaranteedRollup` = true) | | ||
| | Supported input locations | All locations accessible via HDFS client or Druid dataSource | All implemented [firehoses](./firehose.html) | All implemented [firehoses](./firehose.html) | | ||
| | Supported file formats | All implemented Hadoop InputFormats | Currently only text file format (CSV, TSV, JSON) | Currently only text file format (CSV, TSV, JSON) | |
There was a problem hiding this comment.
This is not 100% true (as I'm sure you know). The native tasks support arbitrary InputRowParsers if you are willing to write your own FirehoseFactory, and it does look like there are even a couple firehose factories in this repo that don't require StringInputRowParser (eg the Rabbit and Rocket MQ firehoses).
There was a problem hiding this comment.
Ah, that's true. In the current implementation, the above statement is true only when firehoseFactory is a FiniteFirehoseFactory. But, I guess no one is using native tasks with infinite FirehoseFactory? Maybe it's better to restrict native tasks to support only finiteFirehoseFactory.
There was a problem hiding this comment.
Anyway, it looks worth to update the statement to like Currently text file formats (CSV, TSV, JSON) and any custom implementation.
There was a problem hiding this comment.
Nah, it's just AbstractTextFilesFirehoseFactory where the dependency on text shows up. I've got a
class GCSLengthDelimitedByteArrayFirehoseFactory<T : InputRowParser<*>>(
@JacksonInject private val gcsClient: GCSClient,
@JsonProperty("bucket") val bucket: String,
@JsonProperty("prefix") val prefix: String,
@JsonProperty("interval") val interval: Interval) : FiniteFirehoseFactory<T, String> {
(kotlin) working just fine right here :)
There was a problem hiding this comment.
Oh yes. Probably there was a misunderstanding. I mean, every implementation of FiniteFirehoseFactory extends AbstractTextFilesFirehoseFactory. A simple history behind FirehoseFactory and FiniteFirehoseFactory is, FirehoseFactory was first designed for stream ingestion and extended to be used in indexTask later. Finally, FiniteFirehoseFactory was added when parallel index task was added. However, FiniteFirehoseFactory is for any type of batch indexing, rather than only for parallel indexing.
I guess GCSLengthDelimitedByteArrayFirehoseFactory is a custom implementation? Then yes, it would work if it's designed for batch ingestion.
There was a problem hiding this comment.
Yep. Our main ingestion pipeline ingests protobufs from Kafka, using custom InputRowParsers (and different IRPs for each data source, multiple data sources parsing the same Kafka topics with different IRPs, some data sources generating many rows from a single protobuf/Kafka message). We back up Kafka to GCS in a simple packed binary format using Secor and are batch ingesting from that with the custom firehose. Or at least are trying to, it's almost working in prod :)
There was a problem hiding this comment.
Cool, it sounds nice.
BTW, I raised #7071. Please take a look when you're available.
…7044) * Add doc for Hadoop-based ingestion vs Native batch ingestion * add links * add links
Fixes #5918.