From 26b2938678aae93ace81d44ee54b82b96a6650f7 Mon Sep 17 00:00:00 2001 From: Victoria Lim Date: Mon, 9 Sep 2024 18:50:28 -0700 Subject: [PATCH 1/8] add prerequisite info --- docs/development/extensions-core/mysql.md | 3 +- docs/ingestion/input-sources.md | 46 +++++++++++++---------- 2 files changed, 27 insertions(+), 22 deletions(-) diff --git a/docs/development/extensions-core/mysql.md b/docs/development/extensions-core/mysql.md index bc6012dbb5a3..b7f296781027 100644 --- a/docs/development/extensions-core/mysql.md +++ b/docs/development/extensions-core/mysql.md @@ -40,8 +40,7 @@ install it separately. There are a few ways to obtain this library: - It may be available through your package manager, e.g. as `libmysql-java` on APT for a Debian-based OS This fetches the MySQL connector JAR file with a name like `mysql-connector-j-8.2.0.jar`. - -Copy or symlink this file inside the folder `extensions/mysql-metadata-storage` under the distribution root directory. +Copy or symlink this file inside the folder `lib` under the distribution root directory. ## Alternative: Installing the MariaDB connector library diff --git a/docs/ingestion/input-sources.md b/docs/ingestion/input-sources.md index 71340abc2c0b..5179cf173cad 100644 --- a/docs/ingestion/input-sources.md +++ b/docs/ingestion/input-sources.md @@ -29,10 +29,8 @@ For general information on native batch indexing and parallel task indexing, see ## S3 input source -:::info - -You need to include the [`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension to use the S3 input source. - +:::info Required extension +To use the S3 input source, load the extension [`druid-s3-extensions`](../development/extensions-core/s3.md) in your `common.runtime.properties` file. ::: The S3 input source reads objects directly from S3. You can specify either: @@ -41,7 +39,7 @@ The S3 input source reads objects directly from S3. You can specify either: * a list of S3 location prefixes that attempts to list the contents and ingest all objects contained within the locations. -The S3 input source is splittable. Therefore, you can use it with the [Parallel task](./native-batch.md). Each worker task of `index_parallel` reads one or multiple objects. +The S3 input source is splittable. Therefore, you can use it with the [parallel task](./native-batch.md). Each worker task of `index_parallel` reads one or multiple objects. Sample specs: @@ -228,7 +226,7 @@ You need to include the [`druid-google-extensions`](../development/extensions-co The Google Cloud Storage input source is to support reading objects directly from Google Cloud Storage. Objects can be specified as list of Google Cloud Storage URI strings. The Google Cloud Storage input source is splittable -and can be used by the [Parallel task](./native-batch.md), where each worker task of `index_parallel` will read +and can be used by the [parallel task](./native-batch.md), where each worker task of `index_parallel` will read one or multiple objects. Sample specs: @@ -314,7 +312,7 @@ You need to include the [`druid-azure-extensions`](../development/extensions-cor ::: The Azure input source (that uses the type `azureStorage`) reads objects directly from Azure Blob store or Azure Data Lake sources. You can -specify objects as a list of file URI strings or prefixes. You can split the Azure input source for use with [Parallel task](./native-batch.md) indexing and each worker task reads one chunk of the split data. +specify objects as a list of file URI strings or prefixes. You can split the Azure input source for use with [parallel task](./native-batch.md) indexing and each worker task reads one chunk of the split data. The `azureStorage` input source is a new schema for Azure input sources that allows you to specify which storage account files should be ingested from. We recommend that you update any specs that use the old `azure` schema to use the new `azureStorage` schema. The new schema provides more functionality than the older `azure` schema. @@ -499,7 +497,7 @@ You need to include the [`druid-hdfs-storage`](../development/extensions-core/hd The HDFS input source is to support reading files directly from HDFS storage. File paths can be specified as an HDFS URI string or a list -of HDFS URI strings. The HDFS input source is splittable and can be used by the [Parallel task](./native-batch.md), +of HDFS URI strings. The HDFS input source is splittable and can be used by the [parallel task](./native-batch.md), where each worker task of `index_parallel` will read one or multiple files. Sample specs: @@ -593,7 +591,7 @@ The `http` input source is not limited to the HTTP or HTTPS protocols. It uses t For more information about security best practices, see [Security overview](../operations/security-overview.md#best-practices). -The HTTP input source is _splittable_ and can be used by the [Parallel task](./native-batch.md), +The HTTP input source is _splittable_ and can be used by the [parallel task](./native-batch.md), where each worker task of `index_parallel` will read only one file. This input source does not support Split Hint Spec. Sample specs: @@ -701,7 +699,7 @@ Sample spec: The Local input source is to support reading files directly from local storage, and is mainly intended for proof-of-concept testing. -The Local input source is _splittable_ and can be used by the [Parallel task](./native-batch.md), +The Local input source is _splittable_ and can be used by the [parallel task](./native-batch.md), where each worker task of `index_parallel` will read one or multiple files. Sample spec: @@ -736,7 +734,7 @@ Sample spec: The Druid input source is to support reading data directly from existing Druid segments, potentially using a new schema and changing the name, dimensions, metrics, rollup, etc. of the segment. -The Druid input source is _splittable_ and can be used by the [Parallel task](./native-batch.md). +The Druid input source is _splittable_ and can be used by the [parallel task](./native-batch.md). This input source has a fixed input format for reading from Druid segments; no `inputFormat` field needs to be specified in the ingestion spec when using this input source. @@ -833,16 +831,28 @@ For more information on the `maxNumConcurrentSubTasks` field, see [Implementatio ## SQL input source +:::info Required extension +To use the SQL input source, you must load the appropriate extension in your `common.runtime.properties` file. +* To connect to MySQL, load the extension [`mysql-metadata-storage`](../development/extensions-core/mysql.md). +* To connect to PostgreSQL, load the extension [`postgresql-metadata-storage`](../development/extensions-core/postgresql.md). + +The MySQL extension requires a JDBC driver. +For more information, see the [Installing the MySQL connector library](../development/extensions-core/mysql.md). +::: + The SQL input source is used to read data directly from RDBMS. -The SQL input source is _splittable_ and can be used by the [Parallel task](./native-batch.md), where each worker task will read from one SQL query from the list of queries. +The SQL input source is _splittable_ and can be used by the [parallel task](./native-batch.md), where each worker task will read from one SQL query from the list of queries. This input source does not support Split Hint Spec. -Since this input source has a fixed input format for reading events, no `inputFormat` field needs to be specified in the ingestion spec when using this input source. -Please refer to the Recommended practices section below before using this input source. + +The SQL input source has a fixed input format for reading events. +Don't specify `inputFormat` when using this input source. + +Refer to the [recommended practices](#recommended-practices) before using this input source. |Property|Description|Required| |--------|-----------|---------| |type|Set the value to `sql`.|Yes| -|database|Specifies the database connection details. The database type corresponds to the extension that supplies the `connectorConfig` support. The specified extension must be loaded into Druid:



You can selectively allow JDBC properties in `connectURI`. See [JDBC connections security config](../configuration/index.md#jdbc-connections-to-external-databases) for more details.|Yes| +|database|Specifies the database connection details. The database type corresponds to the extension that supplies the `connectorConfig` support.

You can selectively allow JDBC properties in `connectURI`. See [JDBC connections security config](../configuration/index.md#jdbc-connections-to-external-databases) for more details.|Yes| |foldCase|Toggle case folding of database column names. This may be enabled in cases where the database returns case insensitive column names in query results.|No| |sqls|List of SQL queries where each SQL query would retrieve the data to be indexed.|Yes| @@ -887,7 +897,7 @@ Compared to the other native batch input sources, SQL input source behaves diffe The Combining input source lets you read data from multiple input sources. It identifies the splits from delegate input sources and uses a worker task to process each split. -Use the Combining input source only if all the delegates are splittable and can be used by the [Parallel task](./native-batch.md). +Use the Combining input source only if all the delegates are splittable and can be used by the [parallel task](./native-batch.md). Similar to other input sources, the Combining input source supports a single `inputFormat`. Delegate input sources that require an `inputFormat` must have the same format for input data. @@ -932,9 +942,7 @@ The following is an example of a Combining input source spec: ## Iceberg input source :::info - To use the Iceberg input source, load the extension [`druid-iceberg-extensions`](../development/extensions-contrib/iceberg.md). - ::: You use the Iceberg input source to read data stored in the Iceberg table format. For a given table, the input source scans up to the latest Iceberg snapshot from the configured Hive catalog. Druid ingests the underlying live data files using the existing input source formats. @@ -1139,9 +1147,7 @@ This input source provides the following filters: `and`, `equals`, `interval`, a ## Delta Lake input source :::info - To use the Delta Lake input source, load the extension [`druid-deltalake-extensions`](../development/extensions-contrib/delta-lake.md). - ::: You can use the Delta input source to read data stored in a Delta Lake table. For a given table, the input source scans From a3a3eac9bf38390bcf9b6a1cd46791f8ce913749 Mon Sep 17 00:00:00 2001 From: Victoria Lim Date: Mon, 9 Sep 2024 19:10:55 -0700 Subject: [PATCH 2/8] update mysql intro --- docs/development/extensions-core/mysql.md | 34 +++++++++++++++-------- 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/docs/development/extensions-core/mysql.md b/docs/development/extensions-core/mysql.md index b7f296781027..84e7f55e362d 100644 --- a/docs/development/extensions-core/mysql.md +++ b/docs/development/extensions-core/mysql.md @@ -25,24 +25,34 @@ title: "MySQL Metadata Store" To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `mysql-metadata-storage` in the extensions load list. -:::info - The MySQL extension requires the MySQL Connector/J library or MariaDB Connector/J library, neither of which are included in the Druid distribution. - Refer to the following section for instructions on how to install this library. -::: +The MySQL extension lets you use MySQL as a metadata store or ingest from a MySQL database. -## Installing the MySQL connector library +The extension requires a connector library that's not included with Druid. +See the [Prerequisites](#prerequisites) for installation instructions. -This extension can use Oracle's MySQL JDBC driver which is not included in the Druid distribution. You must -install it separately. There are a few ways to obtain this library: +## Prerequisites -- It can be downloaded from the MySQL site at: https://dev.mysql.com/downloads/connector/j/ -- It can be fetched from Maven Central at: https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.2.0/mysql-connector-j-8.2.0.jar -- It may be available through your package manager, e.g. as `libmysql-java` on APT for a Debian-based OS +To use the MySQL extension, you need to install one of the following libraries: +* [MySQL Connector/J](#install-the-mysql-connector-library) +* [MariaDB Connector/J](#install-the-mariadb-connector-library) -This fetches the MySQL connector JAR file with a name like `mysql-connector-j-8.2.0.jar`. +### Install the MySQL connector library + +The MySQL extension uses Oracle's MySQL JDBC driver. +The current version of Druid uses version 8.2.0. +Other versions may not work with this extension. + +You can download the library from various sources: + +- [Maven Central (direct download)](https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.2.0/mysql-connector-j-8.2.0.jar) +- [MySQL website](https://dev.mysql.com/downloads/connector/j/) + Visit the archives page to download older product versions. +- Your package manager. For example, `libmysql-java` on APT for a Debian-based OS. + +The download includes the MySQL connector JAR file with a name like `mysql-connector-j-8.2.0.jar`. Copy or symlink this file inside the folder `lib` under the distribution root directory. -## Alternative: Installing the MariaDB connector library +### Install the MariaDB connector library This extension also supports using the MariaDB connector jar, though it is also not included in the Druid distribution, so you must install it separately. From 7aeb6d623b71ba3eeebf5e7668f3e7095bbc246e Mon Sep 17 00:00:00 2001 From: Victoria Lim Date: Tue, 10 Sep 2024 16:17:18 -0700 Subject: [PATCH 3/8] explain foldCase, update headings and intros --- .../extensions-core/druid-lookups.md | 2 +- docs/development/extensions-core/mysql.md | 50 ++++++++++++------- .../development/extensions-core/postgresql.md | 13 +++-- docs/ingestion/input-sources.md | 2 +- docs/querying/lookups-cached-global.md | 2 +- 5 files changed, 42 insertions(+), 27 deletions(-) diff --git a/docs/development/extensions-core/druid-lookups.md b/docs/development/extensions-core/druid-lookups.md index d6219b8c7428..aabb65e7b377 100644 --- a/docs/development/extensions-core/druid-lookups.md +++ b/docs/development/extensions-core/druid-lookups.md @@ -33,7 +33,7 @@ To use this Apache Druid extension, [include](../../configuration/extensions.md# :::info If using JDBC, you will need to add your database's client JAR files to the extension's directory. For Postgres, the connector JAR is already included. - See the MySQL extension documentation for instructions to obtain [MySQL](./mysql.md#installing-the-mysql-connector-library) or [MariaDB](./mysql.md#alternative-installing-the-mariadb-connector-library) connector libraries. + See the MySQL extension documentation for instructions to obtain [MySQL](./mysql.md#install-mysql-connectorj) or [MariaDB](./mysql.md#install-mariadb-connectorj) connector libraries. Copy or symlink the downloaded file to `extensions/druid-lookups-cached-single` under the distribution root directory. ::: diff --git a/docs/development/extensions-core/mysql.md b/docs/development/extensions-core/mysql.md index 84e7f55e362d..597cf380bb5e 100644 --- a/docs/development/extensions-core/mysql.md +++ b/docs/development/extensions-core/mysql.md @@ -1,6 +1,6 @@ --- id: mysql -title: "MySQL Metadata Store" +title: "MySQL metadata store" ---