diff --git a/docs/development/extensions-core/avro.md b/docs/development/extensions-core/avro.md index 8befbbe22432..3ec4c70d9600 100644 --- a/docs/development/extensions-core/avro.md +++ b/docs/development/extensions-core/avro.md @@ -22,9 +22,11 @@ title: "Apache Avro" ~ under the License. --> -This Apache Druid extension enables Druid to ingest and understand the Apache Avro data format. Make sure to [include](../../development/extensions.md#loading-extensions) `druid-avro-extensions` as an extension. +## Avro extension -The `druid-avro-extensions` provides two Avro Parsers for stream ingestion and Hadoop batch ingestion. -See [Avro Hadoop Parser](../../ingestion/data-formats.md#avro-hadoop-parser) -and [Avro Stream Parser](../../ingestion/data-formats.md#avro-stream-parser) -for details. +This Apache Druid extension enables Druid to ingest and understand the Apache Avro data format. This extension provides +two Avro Parsers for stream ingestion and Hadoop batch ingestion. +See [Avro Hadoop Parser](../../ingestion/data-formats.md#avro-hadoop-parser) and [Avro Stream Parser](../../ingestion/data-formats.md#avro-stream-parser) +for more details about how to use these in an ingestion spec. + +Make sure to [include](../../development/extensions.md#loading-extensions) `druid-avro-extensions` as an extension. \ No newline at end of file diff --git a/docs/development/extensions-core/google.md b/docs/development/extensions-core/google.md index 49a4c4cb775c..582550735980 100644 --- a/docs/development/extensions-core/google.md +++ b/docs/development/extensions-core/google.md @@ -22,23 +22,36 @@ title: "Google Cloud Storage" ~ under the License. --> +## Google Cloud Storage Extension -To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-google-extensions` extension and run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment. +This extension allows you to do 2 things: +* [Ingest data](#reading-data-from-google-cloud-storage) from files stored in Google Cloud Storage. +* Write segments to [deep storage](#deep-storage) in S3. -## Deep Storage +To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-google-extensions` extension. + +### Required Configuration + +To configure connectivity to google cloud, run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment. + +### Reading data from Google Cloud Storage + +The [Google Cloud Storage input source](../../ingestion/native-batch.md#google-cloud-storage-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task) +to read objects directly from Google Cloud Storage. If you use the [Hadoop task](../../ingestion/hadoop.md), +you can read data from Google Cloud Storage by specifying the paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec). + +Objects can also be read directly from Google Cloud Storage via the [StaticGoogleBlobStoreFirehose](../../ingestion/native-batch.md#staticgoogleblobstorefirehose) + +### Deep Storage Deep storage can be written to Google Cloud Storage either via this extension or the [druid-hdfs-storage extension](../extensions-core/hdfs.md). -### Configuration +#### Configuration + +To configure connectivity to google cloud, run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment. |Property|Possible Values|Description|Default| |--------|---------------|-----------|-------| |`druid.storage.type`|google||Must be set.| |`druid.google.bucket`||GCS bucket name.|Must be set.| |`druid.google.prefix`||GCS prefix.|No-prefix| - -## Reading data from Google Cloud Storage - -The [Google Cloud Storage input source](../../ingestion/native-batch.md#google-cloud-storage-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task) -to read objects directly from Google Cloud Storage. If you use the [Hadoop task](../../ingestion/hadoop.md), -you can read data from Google Cloud Storage by specifying the paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec). diff --git a/docs/development/extensions-core/orc.md b/docs/development/extensions-core/orc.md index 26e79104cdbf..6856baf78cbf 100644 --- a/docs/development/extensions-core/orc.md +++ b/docs/development/extensions-core/orc.md @@ -22,16 +22,16 @@ title: "ORC Extension" ~ under the License. --> +## ORC extension -This Apache Druid module extends [Druid Hadoop based indexing](../../ingestion/hadoop.md) to ingest data directly from offline -Apache ORC files. +This Apache Druid extension enables Druid to ingest and understand the Apache ORC data format. -To use this extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-orc-extensions`. - -The `druid-orc-extensions` provides the [ORC input format](../../ingestion/data-formats.md#orc) and the [ORC Hadoop parser](../../ingestion/data-formats.md#orc-hadoop-parser) +The extension provides the [ORC input format](../../ingestion/data-formats.md#orc) and the [ORC Hadoop parser](../../ingestion/data-formats.md#orc-hadoop-parser) for [native batch ingestion](../../ingestion/native-batch.md) and [Hadoop batch ingestion](../../ingestion/hadoop.md), respectively. Please see corresponding docs for details. +To use this extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-orc-extensions`. + ### Migration from 'contrib' extension This extension, first available in version 0.15.0, replaces the previous 'contrib' extension which was available until 0.14.0-incubating. While this extension can index any data the 'contrib' extension could, the JSON spec for the diff --git a/docs/development/extensions-core/s3.md b/docs/development/extensions-core/s3.md index ef3539b4f147..509b97a52af5 100644 --- a/docs/development/extensions-core/s3.md +++ b/docs/development/extensions-core/s3.md @@ -22,54 +22,44 @@ title: "S3-compatible" ~ under the License. --> +## S3 extension + +This extension allows you to do 2 things: +* [Ingest data](#reading-data-from-s3) from files stored in S3. +* Write segments to [deep storage](#deep-storage) in S3. To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-s3-extensions` as an extension. -## Deep Storage +### Reading data from S3 -S3-compatible deep storage means either AWS S3 or a compatible service like Google Storage which exposes the same API as S3. +The [S3 input source](../../ingestion/native-batch.md#s3-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task) +to read objects directly from S3. If you use the [Hadoop task](../../ingestion/hadoop.md), +you can read data from S3 by specifying the S3 paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec). -### Configuration +To configure the extension to read objects from S3 you need to configure how to [connect to S3](#configuration). -S3 deep storage needs to be explicitly enabled by setting `druid.storage.type=s3`. **Only after setting the storage type to S3 will any of the settings below take effect.** +### Deep Storage -The AWS SDK requires that the target region be specified. Two ways of doing this are by using the JVM system property `aws.region` or the environment variable `AWS_REGION`. +S3-compatible deep storage means either AWS S3 or a compatible service like Google Storage which exposes the same API as S3. -As an example, to set the region to 'us-east-1' through system properties: +S3 deep storage needs to be explicitly enabled by setting `druid.storage.type=s3`. **Only after setting the storage type to S3 will any of the settings below take effect.** -- Add `-Daws.region=us-east-1` to the jvm.config file for all Druid services. -- Add `-Daws.region=us-east-1` to `druid.indexer.runner.javaOpts` in [Middle Manager configuration](../../configuration/index.md#middlemanager-configuration) so that the property will be passed to Peon (worker) processes. +To correctly configure this extension for deep storage in S3, first configure how to [connect to S3](#configuration). +In addition to this you need to set additional configuration, specific for [deep storage](#deep-storage-specific-configuration) + +#### Deep storage specific configuration |Property|Description|Default| |--------|-----------|-------| -|`druid.s3.accessKey`|S3 access key. See [S3 authentication methods](#s3-authentication-methods) for more details|Can be omitted according to authentication methods chosen.| -|`druid.s3.secretKey`|S3 secret key. See [S3 authentication methods](#s3-authentication-methods) for more details|Can be omitted according to authentication methods chosen.| -|`druid.s3.fileSessionCredentials`|Path to properties file containing `sessionToken`, `accessKey` and `secretKey` value. One key/value pair per line (format `key=value`). See [S3 authentication methods](#s3-authentication-methods) for more details |Can be omitted according to authentication methods chosen.| -|`druid.s3.protocol`|Communication protocol type to use when sending requests to AWS. `http` or `https` can be used. This configuration would be ignored if `druid.s3.endpoint.url` is filled with a URL with a different protocol.|`https`| -|`druid.s3.disableChunkedEncoding`|Disables chunked encoding. See [AWS document](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#disableChunkedEncoding--) for details.|false| -|`druid.s3.enablePathStyleAccess`|Enables path style access. See [AWS document](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#enablePathStyleAccess--) for details.|false| -|`druid.s3.forceGlobalBucketAccessEnabled`|Enables global bucket access. See [AWS document](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#setForceGlobalBucketAccessEnabled-java.lang.Boolean-) for details.|false| -|`druid.s3.endpoint.url`|Service endpoint either with or without the protocol.|None| -|`druid.s3.endpoint.signingRegion`|Region to use for SigV4 signing of requests (e.g. us-west-1).|None| -|`druid.s3.proxy.host`|Proxy host to connect through.|None| -|`druid.s3.proxy.port`|Port on the proxy host to connect through.|None| -|`druid.s3.proxy.username`|User name to use when connecting through a proxy.|None| -|`druid.s3.proxy.password`|Password to use when connecting through a proxy.|None| |`druid.storage.bucket`|Bucket to store in.|Must be set.| |`druid.storage.baseKey`|Base key prefix to use, i.e. what directory.|Must be set.| +|`druid.storage.type`|Global deep storage provider. Must be set to `s3` to make use of this extension.|Must be set (likely `s3`).| |`druid.storage.archiveBucket`|S3 bucket name for archiving when running the *archive task*.|none| |`druid.storage.archiveBaseKey`|S3 object key prefix for archiving.|none| |`druid.storage.disableAcl`|Boolean flag to disable ACL. If this is set to `false`, the full control would be granted to the bucket owner. This may require to set additional permissions. See [S3 permissions settings](#s3-permissions-settings).|false| -|`druid.storage.sse.type`|Server-side encryption type. Should be one of `s3`, `kms`, and `custom`. See the below [Server-side encryption section](#server-side-encryption) for more details.|None| -|`druid.storage.sse.kms.keyId`|AWS KMS key ID. This is used only when `druid.storage.sse.type` is `kms` and can be empty to use the default key ID.|None| -|`druid.storage.sse.custom.base64EncodedKey`|Base64-encoded key. Should be specified if `druid.storage.sse.type` is `custom`.|None| -|`druid.storage.type`|Global deep storage provider. Must be set to `s3` to make use of this extension.|Must be set (likely `s3`).| |`druid.storage.useS3aSchema`|If true, use the "s3a" filesystem when using Hadoop-based ingestion. If false, the "s3n" filesystem will be used. Only affects Hadoop-based ingestion.|false| -### S3 permissions settings - -`s3:GetObject` and `s3:PutObject` are basically required for pushing/loading segments to/from S3. -If `druid.storage.disableAcl` is set to `false`, then `s3:GetBucketAcl` and `s3:PutObjectAcl` are additionally required to set ACL for objects. +## Configuration ### S3 authentication methods @@ -89,6 +79,42 @@ You can find more information about authentication method [here](https://docs.aw **Note :** *Order is important here as it indicates the precedence of authentication methods.
So if you are trying to use Instance profile information, you **must not** set `druid.s3.accessKey` and `druid.s3.secretKey` in your Druid runtime.properties* + +### S3 permissions settings + +`s3:GetObject` and `s3:PutObject` are basically required for pushing/loading segments to/from S3. +If `druid.storage.disableAcl` is set to `false`, then `s3:GetBucketAcl` and `s3:PutObjectAcl` are additionally required to set ACL for objects. + +### AWS region + +The AWS SDK requires that the target region be specified. Two ways of doing this are by using the JVM system property `aws.region` or the environment variable `AWS_REGION`. + +As an example, to set the region to 'us-east-1' through system properties: + +- Add `-Daws.region=us-east-1` to the jvm.config file for all Druid services. +- Add `-Daws.region=us-east-1` to `druid.indexer.runner.javaOpts` in [Middle Manager configuration](../../configuration/index.md#middlemanager-configuration) so that the property will be passed to Peon (worker) processes. + +### Connecting to S3 configuration + +|Property|Description|Default| +|--------|-----------|-------| +|`druid.s3.accessKey`|S3 access key. See [S3 authentication methods](#s3-authentication-methods) for more details|Can be omitted according to authentication methods chosen.| +|`druid.s3.secretKey`|S3 secret key. See [S3 authentication methods](#s3-authentication-methods) for more details|Can be omitted according to authentication methods chosen.| +|`druid.s3.fileSessionCredentials`|Path to properties file containing `sessionToken`, `accessKey` and `secretKey` value. One key/value pair per line (format `key=value`). See [S3 authentication methods](#s3-authentication-methods) for more details |Can be omitted according to authentication methods chosen.| +|`druid.s3.protocol`|Communication protocol type to use when sending requests to AWS. `http` or `https` can be used. This configuration would be ignored if `druid.s3.endpoint.url` is filled with a URL with a different protocol.|`https`| +|`druid.s3.disableChunkedEncoding`|Disables chunked encoding. See [AWS document](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#disableChunkedEncoding--) for details.|false| +|`druid.s3.enablePathStyleAccess`|Enables path style access. See [AWS document](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#enablePathStyleAccess--) for details.|false| +|`druid.s3.forceGlobalBucketAccessEnabled`|Enables global bucket access. See [AWS document](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#setForceGlobalBucketAccessEnabled-java.lang.Boolean-) for details.|false| +|`druid.s3.endpoint.url`|Service endpoint either with or without the protocol.|None| +|`druid.s3.endpoint.signingRegion`|Region to use for SigV4 signing of requests (e.g. us-west-1).|None| +|`druid.s3.proxy.host`|Proxy host to connect through.|None| +|`druid.s3.proxy.port`|Port on the proxy host to connect through.|None| +|`druid.s3.proxy.username`|User name to use when connecting through a proxy.|None| +|`druid.s3.proxy.password`|Password to use when connecting through a proxy.|None| +|`druid.storage.sse.type`|Server-side encryption type. Should be one of `s3`, `kms`, and `custom`. See the below [Server-side encryption section](#server-side-encryption) for more details.|None| +|`druid.storage.sse.kms.keyId`|AWS KMS key ID. This is used only when `druid.storage.sse.type` is `kms` and can be empty to use the default key ID.|None| +|`druid.storage.sse.custom.base64EncodedKey`|Base64-encoded key. Should be specified if `druid.storage.sse.type` is `custom`.|None| + ## Server-side encryption You can enable [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html) by setting @@ -97,9 +123,3 @@ You can enable [server-side encryption](https://docs.aws.amazon.com/AmazonS3/lat - s3: [Server-side encryption with S3-managed encryption keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html) - kms: [Server-side encryption with AWS KMS–Managed Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) - custom: [Server-side encryption with Customer-Provided Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html) - -## Reading data from S3 - -The [S3 input source](../../ingestion/native-batch.md#s3-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task) -to read objects directly from S3. If you use the [Hadoop task](../../ingestion/hadoop.md), -you can read data from S3 by specifying the S3 paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).