From 1beb18638278f230acc5fe8c78e8164f2efc85f8 Mon Sep 17 00:00:00 2001
From: Suneet Saldanha <suneet.saldanha@imply.io>
Date: Fri, 17 Jan 2020 17:59:41 -0800
Subject: [PATCH 1/2] Update docs for s3 and avro extensions

---
 docs/development/extensions-core/avro.md | 11 +--
 docs/development/extensions-core/s3.md   | 91 ++++++++++++++----------
 2 files changed, 61 insertions(+), 41 deletions(-)

diff --git a/docs/development/extensions-core/avro.md b/docs/development/extensions-core/avro.md
index 8befbbe22432..87a301b26a39 100644
--- a/docs/development/extensions-core/avro.md
+++ b/docs/development/extensions-core/avro.md
@@ -22,9 +22,10 @@ title: "Apache Avro"
   ~ under the License.
   -->
 
-This Apache Druid extension enables Druid to ingest and understand the Apache Avro data format. Make sure to [include](../../development/extensions.md#loading-extensions) `druid-avro-extensions` as an extension.
+## Druid Avro extensions
+This Apache Druid extension enables Druid to ingest and understand the Apache Avro data format. This extension provides 
+two Avro Parsers for stream ingestion and Hadoop batch ingestion. 
+See [Avro Hadoop Parser](../../ingestion/data-formats.md#avro-hadoop-parser) and [Avro Stream Parser](../../ingestion/data-formats.md#avro-stream-parser)
+for more details about how to use these in an ingestion spec.
 
-The `druid-avro-extensions` provides two Avro Parsers for stream ingestion and Hadoop batch ingestion.
-See [Avro Hadoop Parser](../../ingestion/data-formats.md#avro-hadoop-parser)
-and [Avro Stream Parser](../../ingestion/data-formats.md#avro-stream-parser)
-for details.
+Make sure to [include](../../development/extensions.md#loading-extensions) `druid-avro-extensions` as an extension.
\ No newline at end of file
diff --git a/docs/development/extensions-core/s3.md b/docs/development/extensions-core/s3.md
index ef3539b4f147..03c7d9781ac0 100644
--- a/docs/development/extensions-core/s3.md
+++ b/docs/development/extensions-core/s3.md
@@ -22,17 +22,58 @@ title: "S3-compatible"
   ~ under the License.
   -->
 
+## S3 extension
+This extension allows you to do 2 things
+* Write segmenets for [deep storage](#deep-storage) in S3 
+* [Ingest data](#reading-data-from-s3) from files stored in S3
 
 To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-s3-extensions` as an extension.
 
-## Deep Storage
+### Deep Storage
 
 S3-compatible deep storage means either AWS S3 or a compatible service like Google Storage which exposes the same API as S3.
 
-### Configuration
-
 S3 deep storage needs to be explicitly enabled by setting `druid.storage.type=s3`. **Only after setting the storage type to S3 will any of the settings below take effect.**
 
+To correctly configure this extension for deep storage in S3, update the [configuration](#configuration) to set up connectivity to AWS.
+In addition to this you need to set additional configuration, specific for [deep storage](#deep-storage-specific-configuration)
+
+### Reading data from S3
+
+The [S3 input source](../../ingestion/native-batch.md#s3-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
+to read objects directly from S3. If you use the [Hadoop task](../../ingestion/hadoop.md),
+you can read data from S3 by specifying the S3 paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
+
+To configure the extension to read an S3 file you see [these properties](#connecting-to-s3-configuration)
+
+## Configuration
+
+### S3 authentication methods
+
+To connect to your S3 bucket (whether deep storage bucket or source bucket), Druid use the following credentials providers chain
+
+|order|type|details|
+|--------|-----------|-------|
+|1|Druid config file|Based on your runtime.properties if it contains values `druid.s3.accessKey` and `druid.s3.secretKey` |
+|2|Custom properties file| Based on custom properties file where you can supply `sessionToken`, `accessKey` and `secretKey` values. This file is provided to Druid through `druid.s3.fileSessionCredentials` properties|
+|3|Environment variables|Based on environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`|
+|4|Java system properties|Based on JVM properties `aws.accessKeyId` and `aws.secretKey` |
+|5|Profile information|Based on credentials you may have on your druid instance (generally in `~/.aws/credentials`)|
+|6|ECS container credentials|Based on environment variables available on AWS ECS (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or AWS_CONTAINER_CREDENTIALS_FULL_URI) as described in the [EC2ContainerCredentialsProviderWrapper documentation](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.html)|
+|7|Instance profile information|Based on the instance profile you may have attached to your druid instance|
+
+You can find more information about authentication method [here](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials.html)<br/>
+**Note :** *Order is important here as it indicates the precedence of authentication methods.<br/>
+So if you are trying to use Instance profile information, you **must not** set `druid.s3.accessKey` and `druid.s3.secretKey` in your Druid runtime.properties*
+
+
+### S3 permissions settings
+
+`s3:GetObject` and `s3:PutObject` are basically required for pushing/loading segments to/from S3.
+If `druid.storage.disableAcl` is set to `false`, then `s3:GetBucketAcl` and `s3:PutObjectAcl` are additionally required to set ACL for objects.
+
+### AWS region
+
 The AWS SDK requires that the target region be specified. Two ways of doing this are by using the JVM system property `aws.region` or the environment variable `AWS_REGION`.
 
 As an example, to set the region to 'us-east-1' through system properties:
@@ -40,6 +81,8 @@ As an example, to set the region to 'us-east-1' through system properties:
 - Add `-Daws.region=us-east-1` to the jvm.config file for all Druid services.
 - Add `-Daws.region=us-east-1` to `druid.indexer.runner.javaOpts` in [Middle Manager configuration](../../configuration/index.md#middlemanager-configuration) so that the property will be passed to Peon (worker) processes.
 
+### Connecting to S3 configuration
+
 |Property|Description|Default|
 |--------|-----------|-------|
 |`druid.s3.accessKey`|S3 access key. See [S3 authentication methods](#s3-authentication-methods) for more details|Can be omitted according to authentication methods chosen.|
@@ -55,39 +98,21 @@ As an example, to set the region to 'us-east-1' through system properties:
 |`druid.s3.proxy.port`|Port on the proxy host to connect through.|None|
 |`druid.s3.proxy.username`|User name to use when connecting through a proxy.|None|
 |`druid.s3.proxy.password`|Password to use when connecting through a proxy.|None|
-|`druid.storage.bucket`|Bucket to store in.|Must be set.|
-|`druid.storage.baseKey`|Base key prefix to use, i.e. what directory.|Must be set.|
-|`druid.storage.archiveBucket`|S3 bucket name for archiving when running the *archive task*.|none|
-|`druid.storage.archiveBaseKey`|S3 object key prefix for archiving.|none|
-|`druid.storage.disableAcl`|Boolean flag to disable ACL. If this is set to `false`, the full control would be granted to the bucket owner. This may require to set additional permissions. See [S3 permissions settings](#s3-permissions-settings).|false|
 |`druid.storage.sse.type`|Server-side encryption type. Should be one of `s3`, `kms`, and `custom`. See the below [Server-side encryption section](#server-side-encryption) for more details.|None|
 |`druid.storage.sse.kms.keyId`|AWS KMS key ID. This is used only when `druid.storage.sse.type` is `kms` and can be empty to use the default key ID.|None|
 |`druid.storage.sse.custom.base64EncodedKey`|Base64-encoded key. Should be specified if `druid.storage.sse.type` is `custom`.|None|
-|`druid.storage.type`|Global deep storage provider. Must be set to `s3` to make use of this extension.|Must be set (likely `s3`).| 
-|`druid.storage.useS3aSchema`|If true, use the "s3a" filesystem when using Hadoop-based ingestion. If false, the "s3n" filesystem will be used. Only affects Hadoop-based ingestion.|false|
-
-### S3 permissions settings
-
-`s3:GetObject` and `s3:PutObject` are basically required for pushing/loading segments to/from S3.
-If `druid.storage.disableAcl` is set to `false`, then `s3:GetBucketAcl` and `s3:PutObjectAcl` are additionally required to set ACL for objects.
 
-### S3 authentication methods
-
-To connect to your S3 bucket (whether deep storage bucket or source bucket), Druid use the following credentials providers chain
+#### Deep storage specific configuration
 
-|order|type|details|
+|Property|Description|Default|
 |--------|-----------|-------|
-|1|Druid config file|Based on your runtime.properties if it contains values `druid.s3.accessKey` and `druid.s3.secretKey` |
-|2|Custom properties file| Based on custom properties file where you can supply `sessionToken`, `accessKey` and `secretKey` values. This file is provided to Druid through `druid.s3.fileSessionCredentials` properties|
-|3|Environment variables|Based on environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`|
-|4|Java system properties|Based on JVM properties `aws.accessKeyId` and `aws.secretKey` |
-|5|Profile information|Based on credentials you may have on your druid instance (generally in `~/.aws/credentials`)|
-|6|ECS container credentials|Based on environment variables available on AWS ECS (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or AWS_CONTAINER_CREDENTIALS_FULL_URI) as described in the [EC2ContainerCredentialsProviderWrapper documentation](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.html)|
-|7|Instance profile information|Based on the instance profile you may have attached to your druid instance|
-
-You can find more information about authentication method [here](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials.html)<br/>
-**Note :** *Order is important here as it indicates the precedence of authentication methods.<br/>
-So if you are trying to use Instance profile information, you **must not** set `druid.s3.accessKey` and `druid.s3.secretKey` in your Druid runtime.properties*
+|`druid.storage.bucket`|Bucket to store in.|Must be set.|
+|`druid.storage.baseKey`|Base key prefix to use, i.e. what directory.|Must be set.|
+|`druid.storage.type`|Global deep storage provider. Must be set to `s3` to make use of this extension.|Must be set (likely `s3`).|
+|`druid.storage.archiveBucket`|S3 bucket name for archiving when running the *archive task*.|none|
+|`druid.storage.archiveBaseKey`|S3 object key prefix for archiving.|none|
+|`druid.storage.disableAcl`|Boolean flag to disable ACL. If this is set to `false`, the full control would be granted to the bucket owner. This may require to set additional permissions. See [S3 permissions settings](#s3-permissions-settings).|false|
+|`druid.storage.useS3aSchema`|If true, use the "s3a" filesystem when using Hadoop-based ingestion. If false, the "s3n" filesystem will be used. Only affects Hadoop-based ingestion.|false|
 
 ## Server-side encryption
 
@@ -97,9 +122,3 @@ You can enable [server-side encryption](https://docs.aws.amazon.com/AmazonS3/lat
 - s3: [Server-side encryption with S3-managed encryption keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html)
 - kms: [Server-side encryption with AWS KMS–Managed Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html)
 - custom: [Server-side encryption with Customer-Provided Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html)
-
-## Reading data from S3
-
-The [S3 input source](../../ingestion/native-batch.md#s3-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
-to read objects directly from S3. If you use the [Hadoop task](../../ingestion/hadoop.md),
-you can read data from S3 by specifying the S3 paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).

From 99af592a31e973688972dad57b1c3e79ddc079ba Mon Sep 17 00:00:00 2001
From: Suneet Saldanha <suneet.saldanha@imply.io>
Date: Fri, 17 Jan 2020 22:33:30 -0800
Subject: [PATCH 2/2] More doc updates - google + cleanup

---
 docs/development/extensions-core/avro.md   |  3 +-
 docs/development/extensions-core/google.md | 31 ++++++++++-----
 docs/development/extensions-core/orc.md    | 10 ++---
 docs/development/extensions-core/s3.md     | 45 +++++++++++-----------
 4 files changed, 52 insertions(+), 37 deletions(-)

diff --git a/docs/development/extensions-core/avro.md b/docs/development/extensions-core/avro.md
index 87a301b26a39..3ec4c70d9600 100644
--- a/docs/development/extensions-core/avro.md
+++ b/docs/development/extensions-core/avro.md
@@ -22,7 +22,8 @@ title: "Apache Avro"
   ~ under the License.
   -->
 
-## Druid Avro extensions
+## Avro extension
+
 This Apache Druid extension enables Druid to ingest and understand the Apache Avro data format. This extension provides 
 two Avro Parsers for stream ingestion and Hadoop batch ingestion. 
 See [Avro Hadoop Parser](../../ingestion/data-formats.md#avro-hadoop-parser) and [Avro Stream Parser](../../ingestion/data-formats.md#avro-stream-parser)
diff --git a/docs/development/extensions-core/google.md b/docs/development/extensions-core/google.md
index 49a4c4cb775c..582550735980 100644
--- a/docs/development/extensions-core/google.md
+++ b/docs/development/extensions-core/google.md
@@ -22,23 +22,36 @@ title: "Google Cloud Storage"
   ~ under the License.
   -->
 
+## Google Cloud Storage Extension
 
-To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-google-extensions` extension and run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.
+This extension allows you to do 2 things:
+* [Ingest data](#reading-data-from-google-cloud-storage) from files stored in Google Cloud Storage.
+* Write segments to [deep storage](#deep-storage) in S3.
 
-## Deep Storage
+To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-google-extensions` extension.
+
+### Required Configuration
+
+To configure connectivity to google cloud, run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.
+
+### Reading data from Google Cloud Storage
+
+The [Google Cloud Storage input source](../../ingestion/native-batch.md#google-cloud-storage-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
+to read objects directly from Google Cloud Storage. If you use the [Hadoop task](../../ingestion/hadoop.md),
+you can read data from Google Cloud Storage by specifying the paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
+
+Objects can also be read directly from Google Cloud Storage via the [StaticGoogleBlobStoreFirehose](../../ingestion/native-batch.md#staticgoogleblobstorefirehose)
+
+### Deep Storage
 
 Deep storage can be written to Google Cloud Storage either via this extension or the [druid-hdfs-storage extension](../extensions-core/hdfs.md).
 
-### Configuration
+#### Configuration
+
+To configure connectivity to google cloud, run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.
 
 |Property|Possible Values|Description|Default|
 |--------|---------------|-----------|-------|
 |`druid.storage.type`|google||Must be set.|
 |`druid.google.bucket`||GCS bucket name.|Must be set.|
 |`druid.google.prefix`||GCS prefix.|No-prefix|
-
-## Reading data from Google Cloud Storage
-
-The [Google Cloud Storage input source](../../ingestion/native-batch.md#google-cloud-storage-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
-to read objects directly from Google Cloud Storage. If you use the [Hadoop task](../../ingestion/hadoop.md),
-you can read data from Google Cloud Storage by specifying the paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
diff --git a/docs/development/extensions-core/orc.md b/docs/development/extensions-core/orc.md
index 26e79104cdbf..6856baf78cbf 100644
--- a/docs/development/extensions-core/orc.md
+++ b/docs/development/extensions-core/orc.md
@@ -22,16 +22,16 @@ title: "ORC Extension"
   ~ under the License.
   -->
 
+## ORC extension
 
-This Apache Druid module extends [Druid Hadoop based indexing](../../ingestion/hadoop.md) to ingest data directly from offline
-Apache ORC files.
+This Apache Druid extension enables Druid to ingest and understand the Apache ORC data format.
 
-To use this extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-orc-extensions`.
-
-The `druid-orc-extensions` provides the [ORC input format](../../ingestion/data-formats.md#orc) and the [ORC Hadoop parser](../../ingestion/data-formats.md#orc-hadoop-parser)
+The extension provides the [ORC input format](../../ingestion/data-formats.md#orc) and the [ORC Hadoop parser](../../ingestion/data-formats.md#orc-hadoop-parser)
 for [native batch ingestion](../../ingestion/native-batch.md) and [Hadoop batch ingestion](../../ingestion/hadoop.md), respectively.
 Please see corresponding docs for details.
 
+To use this extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-orc-extensions`.
+
 ### Migration from 'contrib' extension
 This extension, first available in version 0.15.0, replaces the previous 'contrib' extension which was available until
 0.14.0-incubating. While this extension can index any data the 'contrib' extension could, the JSON spec for the
diff --git a/docs/development/extensions-core/s3.md b/docs/development/extensions-core/s3.md
index 03c7d9781ac0..509b97a52af5 100644
--- a/docs/development/extensions-core/s3.md
+++ b/docs/development/extensions-core/s3.md
@@ -23,28 +23,41 @@ title: "S3-compatible"
   -->
 
 ## S3 extension
-This extension allows you to do 2 things
-* Write segmenets for [deep storage](#deep-storage) in S3 
-* [Ingest data](#reading-data-from-s3) from files stored in S3
+
+This extension allows you to do 2 things:
+* [Ingest data](#reading-data-from-s3) from files stored in S3.
+* Write segments to [deep storage](#deep-storage) in S3.
 
 To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-s3-extensions` as an extension.
 
+### Reading data from S3
+
+The [S3 input source](../../ingestion/native-batch.md#s3-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
+to read objects directly from S3. If you use the [Hadoop task](../../ingestion/hadoop.md),
+you can read data from S3 by specifying the S3 paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
+
+To configure the extension to read objects from S3 you need to configure how to [connect to S3](#configuration).
+
 ### Deep Storage
 
 S3-compatible deep storage means either AWS S3 or a compatible service like Google Storage which exposes the same API as S3.
 
 S3 deep storage needs to be explicitly enabled by setting `druid.storage.type=s3`. **Only after setting the storage type to S3 will any of the settings below take effect.**
 
-To correctly configure this extension for deep storage in S3, update the [configuration](#configuration) to set up connectivity to AWS.
+To correctly configure this extension for deep storage in S3, first configure how to [connect to S3](#configuration).
 In addition to this you need to set additional configuration, specific for [deep storage](#deep-storage-specific-configuration)
 
-### Reading data from S3
-
-The [S3 input source](../../ingestion/native-batch.md#s3-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
-to read objects directly from S3. If you use the [Hadoop task](../../ingestion/hadoop.md),
-you can read data from S3 by specifying the S3 paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
+#### Deep storage specific configuration
 
-To configure the extension to read an S3 file you see [these properties](#connecting-to-s3-configuration)
+|Property|Description|Default|
+|--------|-----------|-------|
+|`druid.storage.bucket`|Bucket to store in.|Must be set.|
+|`druid.storage.baseKey`|Base key prefix to use, i.e. what directory.|Must be set.|
+|`druid.storage.type`|Global deep storage provider. Must be set to `s3` to make use of this extension.|Must be set (likely `s3`).|
+|`druid.storage.archiveBucket`|S3 bucket name for archiving when running the *archive task*.|none|
+|`druid.storage.archiveBaseKey`|S3 object key prefix for archiving.|none|
+|`druid.storage.disableAcl`|Boolean flag to disable ACL. If this is set to `false`, the full control would be granted to the bucket owner. This may require to set additional permissions. See [S3 permissions settings](#s3-permissions-settings).|false|
+|`druid.storage.useS3aSchema`|If true, use the "s3a" filesystem when using Hadoop-based ingestion. If false, the "s3n" filesystem will be used. Only affects Hadoop-based ingestion.|false|
 
 ## Configuration
 
@@ -102,18 +115,6 @@ As an example, to set the region to 'us-east-1' through system properties:
 |`druid.storage.sse.kms.keyId`|AWS KMS key ID. This is used only when `druid.storage.sse.type` is `kms` and can be empty to use the default key ID.|None|
 |`druid.storage.sse.custom.base64EncodedKey`|Base64-encoded key. Should be specified if `druid.storage.sse.type` is `custom`.|None|
 
-#### Deep storage specific configuration
-
-|Property|Description|Default|
-|--------|-----------|-------|
-|`druid.storage.bucket`|Bucket to store in.|Must be set.|
-|`druid.storage.baseKey`|Base key prefix to use, i.e. what directory.|Must be set.|
-|`druid.storage.type`|Global deep storage provider. Must be set to `s3` to make use of this extension.|Must be set (likely `s3`).|
-|`druid.storage.archiveBucket`|S3 bucket name for archiving when running the *archive task*.|none|
-|`druid.storage.archiveBaseKey`|S3 object key prefix for archiving.|none|
-|`druid.storage.disableAcl`|Boolean flag to disable ACL. If this is set to `false`, the full control would be granted to the bucket owner. This may require to set additional permissions. See [S3 permissions settings](#s3-permissions-settings).|false|
-|`druid.storage.useS3aSchema`|If true, use the "s3a" filesystem when using Hadoop-based ingestion. If false, the "s3n" filesystem will be used. Only affects Hadoop-based ingestion.|false|
-
 ## Server-side encryption
 
 You can enable [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html) by setting