Doc update for the new input source and the new input format#9171
Doc update for the new input source and the new input format#9171jon-wei merged 8 commits intoapache:masterfrom
Conversation
- The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md
|
@maytasm3 thanks for taking a look. I added the FS impl for GS. BTW, you could leave your comments on the PR, not on the commit next time. Technically, reviewing the commit to a repository other than the Druid repo is not a part of our code review process. Also, Github will properly show your comments at correct lines in the "Files changed" tab which makes the review easier. |
|
|
||
| You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS. | ||
|
|
||
| #### Configuration for AWS S3 |
There was a problem hiding this comment.
Do I need to add the s3 extension for this support or is it bundled with the hdfs extension somehow?
There was a problem hiding this comment.
Sorry, please ignore I see that the hadoop-aws module needs to be added - mentioned below
| |prefetchTriggerBytes|Threshold to trigger prefetching files.|maxFetchCapacityBytes / 2| | ||
| |fetchTimeout|Timeout for fetching each file.|60000| | ||
| |maxFetchRetry|Maximum number of retries for fetching each file.|3| | ||
| Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2. |
There was a problem hiding this comment.
Is this still accurate? Have we done more recent tests?
There was a problem hiding this comment.
It was tested before I started working on Druid and don't know what was the test coverage. There's no more recent tests that I'm aware of.
| ### Native batch ingestion | ||
|
|
||
| The [HDFS input source](../../ingestion/native-batch.md#hdfs-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task) | ||
| to read files directly from the HDFS Storage. However, we highly recommend to use a proper |
There was a problem hiding this comment.
What type of input source should I use instead of the hdfs input source? Why is this beneficial?
There was a problem hiding this comment.
It depends on the type of your cloud storage. The benefit of using it is simpler to use without the extra setup to read from the cloud storage using the hdfs library which is basically same with the steps described above. We currently support only s3 and google cloud storage input sources. So if you want to read from something else such as azure, you may want to use the hdfs input source. But I don't think we have tested this functionality very well and I also don't know how to set up properly for doing that.
There was a problem hiding this comment.
Added this to the doc.
| [Input Source](../../ingestion/native-batch.md#input-sources) instead to read objects from Cloud storage. | ||
|
|
||
| ### Hadoop-based ingestion | ||
|
|
There was a problem hiding this comment.
Is one of these ingestion methods recommended over the other? How do I decide which one to use?
There was a problem hiding this comment.
You mean between the native batch ingestion and hadoop-based one? It's explained at https://github.com/apache/druid/pull/9171/files#diff-3ae520a063215c87a2a6c144eeb0bfc0R74-R80.
suneet-s
left a comment
There was a problem hiding this comment.
LGTM - thanks so much for re-writing so much of the docs! I have a follow up change coming so I can address the comments if you want to merge as is.
| "type": "kafka", | ||
| "dataSchema": { | ||
| "dataSource": "metrics-kafka", | ||
| "parser": { |
There was a problem hiding this comment.
I didn't update the kafka tutorial to use this spec. I can follow up in a separate patch
| "ansi-regex": { | ||
| "version": "2.1.1", | ||
| "bundled": true, | ||
| "dev": true, |
There was a problem hiding this comment.
just curious why all of these were marked as optional before, but not needed any more
There was a problem hiding this comment.
Oops, this is not supposed to be added. Reverted all changed in this file.
| ``` | ||
| { | ||
| "type": "index", | ||
| "type": "parallel_index", |
There was a problem hiding this comment.
this should be index_parallel - same comment on line 299, 332, 349. I have a doc change coming up so I can fix in the next patch as well.
There was a problem hiding this comment.
Oops, thanks. Fixed.
| * [http://jsonpath.herokuapp.com/](http://jsonpath.herokuapp.com/) is useful for testing `path`-type expressions. | ||
| * jackson-jq supports a subset of the full [jq](https://stedolan.github.io/jq/) syntax. Please refer to the [jackson-jq documentation](https://github.com/eiiches/jackson-jq) for details. | ||
|
|
||
| ## Parser (Deprecated) |
There was a problem hiding this comment.
Would it be more accurate to say the string parser is deprecated since we still need the parser for hadoop ingestion?
There was a problem hiding this comment.
Good point. I changed as below:
> The Parser is deprecated for [native batch tasks](./native-batch.md), [Kafka indexing service](../development/extensions-core/kafka-ingestion.md),
and [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md).
Consider using the [input format](#input-format) instead for these types of ingestion.
| This firehose ingests events from a predefined list of files from a Hadoop filesystem. | ||
| This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task). | ||
| Since each split represents an HDFS file, each worker task of `index_parallel` will read an object. | ||
| #### Configuration for Google Cloud Storage |
There was a problem hiding this comment.
Is there authentication configuration needed for accessing GCS? Could add that in a follow-on PR if so.
There was a problem hiding this comment.
I added google.cloud.auth.service.account.enable property. Haven't checked how it works, but just copied from https://github.com/GoogleCloudDataproc/bigdata-interop/blob/master/gcs/INSTALL.md.
| #### Configuration for Google Cloud Storage | ||
|
|
||
| Sample spec: | ||
| To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. |
There was a problem hiding this comment.
Google cloud Storage -> Google Cloud Storage
| This is registering the FirehoseFactory with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"firehose": { "type": "static-s3", ... }` in your realtime config, then the system will load this FirehoseFactory for your firehose. | ||
| This is registering the InputSource with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"inputSource": { "type": "s3", ... }` in your IO config, then the system will load this InputSource for your `InputSource` implementation. | ||
|
|
||
| Note that inside of Druid, we have made the @JacksonInject annotation for Jackson deserialized objects actually use the base Guice injector to resolve the object to be injected. So, if your InputSource needs access to some object, you can add a @JacksonInject annotation on a setter and it will get set on instantiation. |
There was a problem hiding this comment.
suggest putting backticks around @JacksonInject
|
|
||
| ### Adding support for a new data format | ||
|
|
||
| Adding support for a new data format requires to implement two interfaces, i.e., `InputFormat` and `InputEntityReader`. |
There was a problem hiding this comment.
Suggest the following
"requires to implement two interfaces, i.e.," -> "requires implementing two interfaces: "
| ``` | ||
|
|
||
| You can also read from cloud storage such as AWS S3 or Google Cloud Storage. | ||
| To do so, you need to install the necessary library under `${DRUID_HOME}/hadoop-dependencies` in _all MiddleManager or Indexer processes_. |
There was a problem hiding this comment.
Noting here that ${DRUID_HOME}/hadoop-dependencies doesn't work for this since the HDFS extension needs these libraries on the peon startup
There was a problem hiding this comment.
Hmm, yeah. Good point. Updated docs.
| #### Configuration for Google Cloud Storage | ||
|
|
||
| To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. | ||
| To use the Google Cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. |
There was a problem hiding this comment.
For the installation section below, I think we could point to https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md and say the following, and remove the parts where we duplicate their setup instructions:
Please follow the instructions at https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md for configuring your
core-site.xmlwith the filesystem and authentication properties needed for GCS."
We can also add the following (it took me a while to find a download link for the connector):
The GCS connector library is available at https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters
The line below:
"Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2."
can be updated to
"Tested with Druid 0.17.0, Hadoop 2.8.5 and gcs-connector jar 2.0.0-hadoop2.
There was a problem hiding this comment.
Thanks, I made changed based on the suggestions. But I would still want to keep the example properties for GCS, since they are pretty mandatory. The similar pattern is applied to S3 configuration.
|
TC failure doesn't seem legit. |
…9171) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs
…9171) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs
…9214) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs
…9171) (#42) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs
* add middle manager and indexer worker category to tier column of services view (apache#9158) (apache#9167) * Graduation update for ASF release process guide and download links (apache#9126) (apache#9160) * Graduation update for ASF release process guide and download links * Fix release vote thread typo * Fix pom.xml * Add numeric nulls to sample data, fix some numeric null handling issues (apache#9154) (apache#9175) * Fix LongSumAggregator comparator null handling * Remove unneeded GroupBy test change * Checkstyle * Update other processing tests for new sample data * Remove unused code * Fix SearchQueryRunner column selectors * Fix DimensionIndexer null handling and ScanQueryRunnerTest * Fix TeamCity errors * Add jackson-mapper-asl for hdfs-storage extension (apache#9178) (apache#9185) Previously jackson-mapper-asl was excluded to remove a security vulnerability; however, it is required for functionality (e.g., org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator). * Suppress CVE-2019-20330 for htrace-core-4.0.1 (apache#9189) (apache#9191) CVE-2019-20330 was updated on 14 Jan 2020, which now gets flagged by the security vulnerability scan. Since the CVE is for jackson-databind, via htrace-core-4.0.1, it can be added to the existing list of security vulnerability suppressions for that dependency. * Fix deserialization of maxBytesInMemory (apache#9092) (apache#9170) * Fix deserialization of maxBytesInMemory * Add maxBytes check Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com> * Update Kinesis resharding information about task failures (apache#9104) (apache#9201) * fix refresh button (apache#9195) (apache#9203) Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> * allow empty values to be set in the auto form (apache#9198) (apache#9206) Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> * fix null handling for arithmetic post aggregator comparator (apache#9159) (apache#9202) * fix null handling for arithmetic postagg comparator, add test for comparator for min/max/quantile postaggs in histogram ext * fix * Link javaOpts to middlemanager runtime.properties docs (apache#9101) (apache#9204) * Link javaOpts to middlemanager runtime.properties docs * fix broken link * reword config links * Tutorials use new ingestion spec where possible (apache#9155) (apache#9205) * Tutorials use new ingestion spec where possible There are 2 main changes * Use task type index_parallel instead of index * Remove the use of parser + firehose in favor of inputFormat + inputSource index_parallel is the preferred method starting in 0.17. Setting the job to index_parallel with the default maxNumConcurrentSubTasks(1) is the equivalent of an index task Instead of using a parserSpec, dimensionSpec and timestampSpec have been promoted to the dataSchema. The format is described in the ioConfig as the inputFormat. There are a few cases where the new format is not supported * Hadoop must use firehoses instead of the inputSource and inputFormat * There is no equivalent of a combining firehose as an inputSource * A Combining firehose does not support index_parallel * fix typo * Fix TSV bugs (apache#9199) (apache#9213) * working * - support multi-char delimiter for tsv - respect "delimiter" property for tsv * default value check for findColumnsFromHeader * remove CSVParser to have a true and only CSVParser * fix tests * fix another test * Fix LATEST / EARLIEST Buffer Aggregator does not work on String column (apache#9197) (apache#9210) * fix buff limit bug * add tests * add test * add tests * fix checkstyle * Doc update for the new input source and the new input format (apache#9171) (apache#9214) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs * [0.17.0] Speed up String first/last aggregators when folding isn't needed. (apache#9181) (apache#9215) * Speed up String first/last aggregators when folding isn't needed. (apache#9181) * Speed up String first/last aggregators when folding isn't needed. Examines the value column, and disables fold checking via a needsFoldCheck flag if that column can't possibly contain SerializableLongStringPairs. This is helpful because it avoids calling getObject on the value selector when unnecessary; say, because the time selector didn't yield an earlier or later value. * PR comments. * Move fastLooseChop to StringUtils. * actually fix conflict correctly * remove unused import Co-authored-by: Gian Merlino <gianmerlino@gmail.com> * fix topn aggregation on numeric columns with null values (apache#9183) (apache#9219) * fix topn issue with aggregating on numeric columns with null values * adjustments * rename * add more tests * fix comments * more javadocs * computeIfAbsent * first/last aggregators and nulls (apache#9161) (apache#9233) * null handling for numeric first/last aggregators, refactor to not extend nullable numeric agg since they are complex typed aggs * initially null or not based on config * review stuff, make string first/last consistent with null handling of numeric columns, more tests * docs * handle nil selectors, revert to primitive first/last types so groupby v1 works... * Minor doc updates (apache#9217) (apache#9230) * update string first last aggs * update kafka ingestion specs in docs * remove unnecessary parser spec * [Backport] Update docs for extensions (apache#9218) (apache#9228) Backport of apache#9218 to 0.17.0. * More tests for range partition parallel indexing (apache#9232) (apache#9236) Add more unit tests for range partition native batch parallel indexing. Also, fix a bug where ParallelIndexPhaseRunner incorrectly thinks that identical collected DimensionDistributionReports are not equal due to not overriding equals() in DimensionDistributionReport. * Support both IndexTuningConfig and ParallelIndexTuningConfig for compaction task (apache#9222) (apache#9237) * Support both IndexTuningConfig and ParallelIndexTuningConfig for compaction task * tuningConfig module * fix tests Co-authored-by: Clint Wylie <cjwylie@gmail.com> Co-authored-by: Chi Cao Minh <chi.caominh@gmail.com> Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com> Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> Co-authored-by: Suneet Saldanha <44787917+suneet-s@users.noreply.github.com> Co-authored-by: Jihoon Son <jihoonson@apache.org> Co-authored-by: Maytas Monsereenusorn <52679095+maytasm3@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Description
This PR contains:
The docs under docs/tutorial are not updated in this PR.
This PR has:
This change is