Doc update for the new input source and the new input format by jihoonson · Pull Request #9171 · apache/druid

jihoonson · 2020-01-13T04:44:14Z

Description

This PR contains:

The input source and input format are promoted in all docs under docs/ingestion.
All input sources including core extension ones are located in docs/ingestion/native-batch.md.
All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md.
New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md.
Add a proper name for each phase in the parallel task.

The docs under docs/tutorial are not updated in this PR.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.
added integration tests.
been tested in a test Druid cluster.

This change is

- The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md

jihoonson · 2020-01-14T18:22:03Z

@maytasm3 thanks for taking a look. I added the FS impl for GS.

BTW, you could leave your comments on the PR, not on the commit next time. Technically, reviewing the commit to a repository other than the Druid repo is not a part of our code review process. Also, Github will properly show your comments at correct lines in the "Files changed" tab which makes the review easier.

suneet-s · 2020-01-15T22:20:29Z

+
+You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS.
+
+#### Configuration for AWS S3


Do I need to add the s3 extension for this support or is it bundled with the hdfs extension somehow?

Sorry, please ignore I see that the hadoop-aws module needs to be added - mentioned below

suneet-s · 2020-01-15T22:24:14Z

-|prefetchTriggerBytes|Threshold to trigger prefetching files.|maxFetchCapacityBytes / 2|
-|fetchTimeout|Timeout for fetching each file.|60000|
-|maxFetchRetry|Maximum number of retries for fetching each file.|3|
+Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.


Is this still accurate? Have we done more recent tests?

It was tested before I started working on Druid and don't know what was the test coverage. There's no more recent tests that I'm aware of.

suneet-s · 2020-01-15T22:27:10Z

+### Native batch ingestion
+
+The [HDFS input source](../../ingestion/native-batch.md#hdfs-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
+to read files directly from the HDFS Storage. However, we highly recommend to use a proper


What type of input source should I use instead of the hdfs input source? Why is this beneficial?

It depends on the type of your cloud storage. The benefit of using it is simpler to use without the extra setup to read from the cloud storage using the hdfs library which is basically same with the steps described above. We currently support only s3 and google cloud storage input sources. So if you want to read from something else such as azure, you may want to use the hdfs input source. But I don't think we have tested this functionality very well and I also don't know how to set up properly for doing that.

Added this to the doc.

suneet-s · 2020-01-15T22:29:17Z

+[Input Source](../../ingestion/native-batch.md#input-sources) instead to read objects from Cloud storage.
+
+### Hadoop-based ingestion
+


Is one of these ingestion methods recommended over the other? How do I decide which one to use?

You mean between the native batch ingestion and hadoop-based one? It's explained at https://github.com/apache/druid/pull/9171/files#diff-3ae520a063215c87a2a6c144eeb0bfc0R74-R80.

suneet-s

LGTM - thanks so much for re-writing so much of the docs! I have a follow up change coming so I can address the comments if you want to merge as is.

suneet-s · 2020-01-16T17:49:16Z

  "type": "kafka",
  "dataSchema": {
    "dataSource": "metrics-kafka",
-    "parser": {


I didn't update the kafka tutorial to use this spec. I can follow up in a separate patch

suneet-s · 2020-01-16T18:01:57Z

            "ansi-regex": {
              "version": "2.1.1",
              "bundled": true,
-              "dev": true,


just curious why all of these were marked as optional before, but not needed any more

Oops, this is not supposed to be added. Reverted all changed in this file.

suneet-s · 2020-01-16T18:09:30Z

 ```
 {
-  "type": "index",
+  "type": "parallel_index",


this should be index_parallel - same comment on line 299, 332, 349. I have a doc change coming up so I can fix in the next patch as well.

Oops, thanks. Fixed.

suneet-s · 2020-01-16T18:20:16Z

+* [http://jsonpath.herokuapp.com/](http://jsonpath.herokuapp.com/) is useful for testing `path`-type expressions.
+* jackson-jq supports a subset of the full [jq](https://stedolan.github.io/jq/) syntax.  Please refer to the [jackson-jq documentation](https://github.com/eiiches/jackson-jq) for details.
+
+## Parser (Deprecated)


Would it be more accurate to say the string parser is deprecated since we still need the parser for hadoop ingestion?

Good point. I changed as below:

> The Parser is deprecated for [native batch tasks](./native-batch.md), [Kafka indexing service](../development/extensions-core/kafka-ingestion.md), and [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md). Consider using the [input format](#input-format) instead for these types of ingestion.

jon-wei · 2020-01-16T22:57:49Z

-This firehose ingests events from a predefined list of files from a Hadoop filesystem.
-This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task).
-Since each split represents an HDFS file, each worker task of `index_parallel` will read an object.
+#### Configuration for Google Cloud Storage


Is there authentication configuration needed for accessing GCS? Could add that in a follow-on PR if so.

I added google.cloud.auth.service.account.enable property. Haven't checked how it works, but just copied from https://github.com/GoogleCloudDataproc/bigdata-interop/blob/master/gcs/INSTALL.md.

jon-wei · 2020-01-16T23:15:01Z

+#### Configuration for Google Cloud Storage

-Sample spec:
+To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly.


Google cloud Storage -> Google Cloud Storage

Thanks, fixed.

jon-wei · 2020-01-16T23:22:04Z

-This is registering the FirehoseFactory with Jackson's polymorphic serialization/deserialization layer.  More concretely, having this will mean that if you specify a `"firehose": { "type": "static-s3", ... }` in your realtime config, then the system will load this FirehoseFactory for your firehose.
+This is registering the InputSource with Jackson's polymorphic serialization/deserialization layer.  More concretely, having this will mean that if you specify a `"inputSource": { "type": "s3", ... }` in your IO config, then the system will load this InputSource for your `InputSource` implementation.
+
+Note that inside of Druid, we have made the @JacksonInject annotation for Jackson deserialized objects actually use the base Guice injector to resolve the object to be injected.  So, if your InputSource needs access to some object, you can add a @JacksonInject annotation on a setter and it will get set on instantiation.


suggest putting backticks around @JacksonInject

jon-wei · 2020-01-16T23:22:36Z

+
+### Adding support for a new data format
+
+Adding support for a new data format requires to implement two interfaces, i.e., `InputFormat` and `InputEntityReader`.


Suggest the following

"requires to implement two interfaces, i.e.," -> "requires implementing two interfaces: "

Fixed, thanks.

jon-wei · 2020-01-16T23:28:28Z

+```
+
+You can also read from cloud storage such as AWS S3 or Google Cloud Storage.
+To do so, you need to install the necessary library under `${DRUID_HOME}/hadoop-dependencies` in _all MiddleManager or Indexer processes_.


Noting here that ${DRUID_HOME}/hadoop-dependencies doesn't work for this since the HDFS extension needs these libraries on the peon startup

Hmm, yeah. Good point. Updated docs.

jon-wei · 2020-01-17T00:41:05Z

 #### Configuration for Google Cloud Storage

-To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly.
+To use the Google Cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly.


For the installation section below, I think we could point to https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md and say the following, and remove the parts where we duplicate their setup instructions:

Please follow the instructions at https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md for configuring your core-site.xml with the filesystem and authentication properties needed for GCS."

We can also add the following (it took me a while to find a download link for the connector):

The GCS connector library is available at https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters

The line below:
"Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2."

can be updated to

"Tested with Druid 0.17.0, Hadoop 2.8.5 and gcs-connector jar 2.0.0-hadoop2.

Thanks, I made changed based on the suggestions. But I would still want to keep the example properties for GCS, since they are pretty mandatory. The similar pattern is applied to S3 configuration.

jihoonson · 2020-01-17T19:20:14Z

TC failure doesn't seem legit.

…9171) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs

…9214) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs

…9171) (#42) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs

* add middle manager and indexer worker category to tier column of services view (apache#9158) (apache#9167) * Graduation update for ASF release process guide and download links (apache#9126) (apache#9160) * Graduation update for ASF release process guide and download links * Fix release vote thread typo * Fix pom.xml * Add numeric nulls to sample data, fix some numeric null handling issues (apache#9154) (apache#9175) * Fix LongSumAggregator comparator null handling * Remove unneeded GroupBy test change * Checkstyle * Update other processing tests for new sample data * Remove unused code * Fix SearchQueryRunner column selectors * Fix DimensionIndexer null handling and ScanQueryRunnerTest * Fix TeamCity errors * Add jackson-mapper-asl for hdfs-storage extension (apache#9178) (apache#9185) Previously jackson-mapper-asl was excluded to remove a security vulnerability; however, it is required for functionality (e.g., org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator). * Suppress CVE-2019-20330 for htrace-core-4.0.1 (apache#9189) (apache#9191) CVE-2019-20330 was updated on 14 Jan 2020, which now gets flagged by the security vulnerability scan. Since the CVE is for jackson-databind, via htrace-core-4.0.1, it can be added to the existing list of security vulnerability suppressions for that dependency. * Fix deserialization of maxBytesInMemory (apache#9092) (apache#9170) * Fix deserialization of maxBytesInMemory * Add maxBytes check Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com> * Update Kinesis resharding information about task failures (apache#9104) (apache#9201) * fix refresh button (apache#9195) (apache#9203) Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> * allow empty values to be set in the auto form (apache#9198) (apache#9206) Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> * fix null handling for arithmetic post aggregator comparator (apache#9159) (apache#9202) * fix null handling for arithmetic postagg comparator, add test for comparator for min/max/quantile postaggs in histogram ext * fix * Link javaOpts to middlemanager runtime.properties docs (apache#9101) (apache#9204) * Link javaOpts to middlemanager runtime.properties docs * fix broken link * reword config links * Tutorials use new ingestion spec where possible (apache#9155) (apache#9205) * Tutorials use new ingestion spec where possible There are 2 main changes * Use task type index_parallel instead of index * Remove the use of parser + firehose in favor of inputFormat + inputSource index_parallel is the preferred method starting in 0.17. Setting the job to index_parallel with the default maxNumConcurrentSubTasks(1) is the equivalent of an index task Instead of using a parserSpec, dimensionSpec and timestampSpec have been promoted to the dataSchema. The format is described in the ioConfig as the inputFormat. There are a few cases where the new format is not supported * Hadoop must use firehoses instead of the inputSource and inputFormat * There is no equivalent of a combining firehose as an inputSource * A Combining firehose does not support index_parallel * fix typo * Fix TSV bugs (apache#9199) (apache#9213) * working * - support multi-char delimiter for tsv - respect "delimiter" property for tsv * default value check for findColumnsFromHeader * remove CSVParser to have a true and only CSVParser * fix tests * fix another test * Fix LATEST / EARLIEST Buffer Aggregator does not work on String column (apache#9197) (apache#9210) * fix buff limit bug * add tests * add test * add tests * fix checkstyle * Doc update for the new input source and the new input format (apache#9171) (apache#9214) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs * [0.17.0] Speed up String first/last aggregators when folding isn't needed. (apache#9181) (apache#9215) * Speed up String first/last aggregators when folding isn't needed. (apache#9181) * Speed up String first/last aggregators when folding isn't needed. Examines the value column, and disables fold checking via a needsFoldCheck flag if that column can't possibly contain SerializableLongStringPairs. This is helpful because it avoids calling getObject on the value selector when unnecessary; say, because the time selector didn't yield an earlier or later value. * PR comments. * Move fastLooseChop to StringUtils. * actually fix conflict correctly * remove unused import Co-authored-by: Gian Merlino <gianmerlino@gmail.com> * fix topn aggregation on numeric columns with null values (apache#9183) (apache#9219) * fix topn issue with aggregating on numeric columns with null values * adjustments * rename * add more tests * fix comments * more javadocs * computeIfAbsent * first/last aggregators and nulls (apache#9161) (apache#9233) * null handling for numeric first/last aggregators, refactor to not extend nullable numeric agg since they are complex typed aggs * initially null or not based on config * review stuff, make string first/last consistent with null handling of numeric columns, more tests * docs * handle nil selectors, revert to primitive first/last types so groupby v1 works... * Minor doc updates (apache#9217) (apache#9230) * update string first last aggs * update kafka ingestion specs in docs * remove unnecessary parser spec * [Backport] Update docs for extensions (apache#9218) (apache#9228) Backport of apache#9218 to 0.17.0. * More tests for range partition parallel indexing (apache#9232) (apache#9236) Add more unit tests for range partition native batch parallel indexing. Also, fix a bug where ParallelIndexPhaseRunner incorrectly thinks that identical collected DimensionDistributionReports are not equal due to not overriding equals() in DimensionDistributionReport. * Support both IndexTuningConfig and ParallelIndexTuningConfig for compaction task (apache#9222) (apache#9237) * Support both IndexTuningConfig and ParallelIndexTuningConfig for compaction task * tuningConfig module * fix tests Co-authored-by: Clint Wylie <cjwylie@gmail.com> Co-authored-by: Chi Cao Minh <chi.caominh@gmail.com> Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com> Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> Co-authored-by: Suneet Saldanha <44787917+suneet-s@users.noreply.github.com> Co-authored-by: Jihoon Son <jihoonson@apache.org> Co-authored-by: Maytas Monsereenusorn <52679095+maytasm3@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>

jihoonson added the Area - Documentation label Jan 13, 2020

jihoonson added this to the 0.17.0 milestone Jan 13, 2020

parquet

2b1c7aa

clintropolis added a commit to apache/druid-website that referenced this pull request Jan 13, 2020

preview of apache/druid#9171 for easier review for 0.17 docs

153f917

jihoonson added 3 commits January 13, 2020 12:25

add warning for range partitioning with sequential mode

c317abf

hdfs + s3, gs

9de4d33

add fs impl for gs

2391262

suneet-s reviewed Jan 15, 2020

View reviewed changes

suneet-s approved these changes Jan 16, 2020

View reviewed changes

address comments

5ee83d4

jon-wei reviewed Jan 16, 2020

View reviewed changes

address comments

59e2fef

jon-wei reviewed Jan 17, 2020

View reviewed changes

gcs

ad72740

jon-wei merged commit 1534950 into apache:master Jan 17, 2020

This was referenced Jan 17, 2020

[Backport] Doc update for the new input source and the new input format #9214

Merged

[Backport] Doc update for the new input source and the new input format implydata/druid-public#42

Merged


		You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS.

		#### Configuration for AWS S3

		[Input Source](../../ingestion/native-batch.md#input-sources) instead to read objects from Cloud storage.

		### Hadoop-based ingestion


		### Adding support for a new data format

		Adding support for a new data format requires to implement two interfaces, i.e., `InputFormat` and `InputEntityReader`.

Conversation

jihoonson commented Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

jihoonson commented Jan 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suneet-s left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Jan 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jihoonson commented Jan 13, 2020 •

edited

Loading

jihoonson Jan 17, 2020 •

edited

Loading