From a2904a1de98ef8bce939e7e8954af498f76b3db4 Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Mon, 5 Aug 2024 14:12:27 -0700 Subject: [PATCH 1/7] docs: update query from deepstorage segment requirement --- docs/configuration/index.md | 4 +++- docs/querying/query-from-deep-storage.md | 4 +++- docs/tutorials/tutorial-query-deep-storage.md | 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/configuration/index.md b/docs/configuration/index.md index 3b3c2711d3b4..c1abb125ab27 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -595,7 +595,9 @@ need arises. |`druid.centralizedDatasourceSchema.enabled`|Boolean flag for enabling datasource schema building in the Coordinator, this should be specified in the common runtime properties.|false|No.| |`druid.indexer.fork.property.druid.centralizedDatasourceSchema.enabled`| This config should be set when CentralizedDatasourceSchema feature is enabled. This should be specified in the MiddleManager runtime properties.|false|No.| -For, stale schema cleanup configs, refer to properties with the prefix `druid.coordinator.kill.segmentSchema` in [Metadata Management](#metadata-management). +If you enable this feature, you can query datasources that are only stored in cold storage and are not cached on a Historical. + +For, stale schema cleanup configs, refer to properties with the prefix `druid.coordinator.kill.segmentSchema` in [Metadata Management](#metadata-management). ### Ingestion security configuration diff --git a/docs/querying/query-from-deep-storage.md b/docs/querying/query-from-deep-storage.md index 1ce74818655d..d57768821664 100644 --- a/docs/querying/query-from-deep-storage.md +++ b/docs/querying/query-from-deep-storage.md @@ -66,7 +66,9 @@ You can also confirm this through the Druid console. On the **Segments** page, s Keep the following in mind when working with load rules to control what exists only in deep storage: -- At least one of the segments in a datasource must be loaded onto a Historical process so that Druid can plan the query. The segment on the Historical process can be any segment from the datasource. It does not need to be a specific segment. One way to verify that a datasource has at least one segment on a Historical process is if it's visible in the Druid console. +- Your datasource must meet one of the following conditions: + - At least one of the segments in a datasource is loaded onto a Historical process so that Druid can plan the query. The segment on the Historical process can be any segment from the datasource. It does not need to be a specific segment. One way to verify that a datasource has at least one segment on a Historical process is if it's visible in the Druid console. + - You have the centralized data source schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema) - The actual number of replicas may differ from the replication factor temporarily as Druid processes your load rules. ## Run a query from deep storage diff --git a/docs/tutorials/tutorial-query-deep-storage.md b/docs/tutorials/tutorial-query-deep-storage.md index dfb4de22eb01..63611de692c9 100644 --- a/docs/tutorials/tutorial-query-deep-storage.md +++ b/docs/tutorials/tutorial-query-deep-storage.md @@ -25,7 +25,7 @@ sidebar_label: "Query from deep storage" Query from deep storage allows you to query segments that are stored only in deep storage, which provides lower costs than if you were to load everything onto Historical processes. The tradeoff is that queries from deep storage may take longer to complete. -This tutorial walks you through loading example data, configuring load rules so that not all the segments get loaded onto Historical processes, and querying data from deep storage. +This tutorial walks you through loading example data, configuring load rules so that not all the segments get loaded onto Historical processes, and querying data from deep storage. If you have [centralized datasource schema enabled](../configuration/index.md#centralized-datasource-schema), you can query datasources that are only in deep storage and don't need to make sure at least one segment is available on a Historical. To run the queries in this tutorial, replace `ROUTER:PORT` with the location of the Router process and its port number. For example, use `localhost:8888` for the quickstart deployment. From 708a9208504df53fc37f1b943f0b6f84f74888b2 Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Mon, 5 Aug 2024 14:15:20 -0700 Subject: [PATCH 2/7] typo --- docs/querying/query-from-deep-storage.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/querying/query-from-deep-storage.md b/docs/querying/query-from-deep-storage.md index d57768821664..08445775fe5c 100644 --- a/docs/querying/query-from-deep-storage.md +++ b/docs/querying/query-from-deep-storage.md @@ -66,9 +66,9 @@ You can also confirm this through the Druid console. On the **Segments** page, s Keep the following in mind when working with load rules to control what exists only in deep storage: -- Your datasource must meet one of the following conditions: - - At least one of the segments in a datasource is loaded onto a Historical process so that Druid can plan the query. The segment on the Historical process can be any segment from the datasource. It does not need to be a specific segment. One way to verify that a datasource has at least one segment on a Historical process is if it's visible in the Druid console. - - You have the centralized data source schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema) +- To be queryable, your datasource must meet one of the following conditions: + - At least one of the segments in the datasource is loaded onto a Historical process so that Druid can plan the query. The segment on the Historical process can be any segment from the datasource. It does not need to be a specific segment. One way to verify that a datasource has at least one segment on a Historical process is if it's visible in the Druid console. + - You have the centralized data source schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema). - The actual number of replicas may differ from the replication factor temporarily as Druid processes your load rules. ## Run a query from deep storage From 780a115c74216b062fb06102c3eeac383199ddca Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Tue, 6 Aug 2024 11:13:26 -0700 Subject: [PATCH 3/7] Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --- docs/configuration/index.md | 2 +- docs/querying/query-from-deep-storage.md | 2 +- docs/tutorials/tutorial-query-deep-storage.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/configuration/index.md b/docs/configuration/index.md index c1abb125ab27..c2c56e56cfad 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -597,7 +597,7 @@ need arises. If you enable this feature, you can query datasources that are only stored in cold storage and are not cached on a Historical. -For, stale schema cleanup configs, refer to properties with the prefix `druid.coordinator.kill.segmentSchema` in [Metadata Management](#metadata-management). +For stale schema cleanup configs, refer to properties with the prefix `druid.coordinator.kill.segmentSchema` in [Metadata Management](#metadata-management). ### Ingestion security configuration diff --git a/docs/querying/query-from-deep-storage.md b/docs/querying/query-from-deep-storage.md index 08445775fe5c..7f89ed714fb2 100644 --- a/docs/querying/query-from-deep-storage.md +++ b/docs/querying/query-from-deep-storage.md @@ -67,7 +67,7 @@ You can also confirm this through the Druid console. On the **Segments** page, s Keep the following in mind when working with load rules to control what exists only in deep storage: - To be queryable, your datasource must meet one of the following conditions: - - At least one of the segments in the datasource is loaded onto a Historical process so that Druid can plan the query. The segment on the Historical process can be any segment from the datasource. It does not need to be a specific segment. One way to verify that a datasource has at least one segment on a Historical process is if it's visible in the Druid console. + - At least one segment from the datasource is loaded onto a Historical service for Druid to plan the query. This segment can be any segment from the datasource. You can verify that a datasource has at least one segment on a Historical service if it's visible in the Druid console. - You have the centralized data source schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema). - The actual number of replicas may differ from the replication factor temporarily as Druid processes your load rules. diff --git a/docs/tutorials/tutorial-query-deep-storage.md b/docs/tutorials/tutorial-query-deep-storage.md index 63611de692c9..36c93f9ae76e 100644 --- a/docs/tutorials/tutorial-query-deep-storage.md +++ b/docs/tutorials/tutorial-query-deep-storage.md @@ -25,7 +25,7 @@ sidebar_label: "Query from deep storage" Query from deep storage allows you to query segments that are stored only in deep storage, which provides lower costs than if you were to load everything onto Historical processes. The tradeoff is that queries from deep storage may take longer to complete. -This tutorial walks you through loading example data, configuring load rules so that not all the segments get loaded onto Historical processes, and querying data from deep storage. If you have [centralized datasource schema enabled](../configuration/index.md#centralized-datasource-schema), you can query datasources that are only in deep storage and don't need to make sure at least one segment is available on a Historical. +This tutorial walks you through loading example data, configuring load rules so that not all the segments get loaded onto Historical services, and querying data from deep storage. If you have [centralized datasource schema enabled](../configuration/index.md#centralized-datasource-schema), you can query datasources that are only in deep storage and don't need to make sure at least one segment is available on a Historical. To run the queries in this tutorial, replace `ROUTER:PORT` with the location of the Router process and its port number. For example, use `localhost:8888` for the quickstart deployment. From 3c1b2d04bbbe9dfb6c04eaf96743f6a30477c8a8 Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Fri, 9 Aug 2024 12:38:58 -0700 Subject: [PATCH 4/7] udpate reqs for central datasoruce schema --- docs/configuration/index.md | 2 +- docs/querying/query-from-deep-storage.md | 20 +++++++++++--------- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/docs/configuration/index.md b/docs/configuration/index.md index c2c56e56cfad..19d589e1e131 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -595,7 +595,7 @@ need arises. |`druid.centralizedDatasourceSchema.enabled`|Boolean flag for enabling datasource schema building in the Coordinator, this should be specified in the common runtime properties.|false|No.| |`druid.indexer.fork.property.druid.centralizedDatasourceSchema.enabled`| This config should be set when CentralizedDatasourceSchema feature is enabled. This should be specified in the MiddleManager runtime properties.|false|No.| -If you enable this feature, you can query datasources that are only stored in cold storage and are not cached on a Historical. +If you enable this feature, you can query datasources that are only stored in cold storage and are not loaded on a Historical. For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). For stale schema cleanup configs, refer to properties with the prefix `druid.coordinator.kill.segmentSchema` in [Metadata Management](#metadata-management). diff --git a/docs/querying/query-from-deep-storage.md b/docs/querying/query-from-deep-storage.md index 7f89ed714fb2..1b12eae7ea25 100644 --- a/docs/querying/query-from-deep-storage.md +++ b/docs/querying/query-from-deep-storage.md @@ -28,13 +28,20 @@ Druid can query segments that are only stored in deep storage. Running a query f Query from deep storage requires the Multi-stage query (MSQ) task engine. Load the extension for it if you don't already have it enabled before you begin. See [enable MSQ](../multi-stage-query/index.md#load-the-extension) for more information. +To be queryable, your datasource must meet one of the following conditions: + +- At least one segment from the datasource is loaded onto a Historical service for Druid to plan the query. This segment can be any segment from the datasource. You can verify that a datasource has at least one segment on a Historical service if it's visible in the Druid console. +- You have the centralized data source schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema). + +If you use centralized data source schemas, there's an additional step for any datasource created prior to enabling it to make the datasource queryable from deep storage. You need to load the cold segments onto a Historical so that the schema can be backfilled in the metadata database by changing your load rules. You can load some or all of the segments that are only in deep storage. Once that process is complete, you can unload all the segments from the Historical and only keep the data in deep storage. + ## Keep segments in deep storage only -Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. However, to take advantage of the cost savings that querying from deep storage provides, make sure not all your segments get loaded onto Historical processes. +Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. However, to take advantage of the cost savings that querying from deep storage provides, make sure not all your segments get loaded onto Historical processes. If you use centralized data source schemas, a datasource can be kept only in deep storage but remain queryable. -To do this, configure [load rules](../operations/rule-configuration.md#load-rules) to manage the which segments are only in deep storage and which get loaded onto Historical processes. +To manage the which segments are kept only in deep storage and which get loaded onto Historical processes., configure [load rules](../operations/rule-configuration.md#load-rules) -The easiest way to do this is to explicitly configure the segments that don't get loaded onto Historical processes. Set `tieredReplicants` to an empty array and `useDefaultTierForNull` to `false`. For example, if you configure the following rule for a datasource: +The easiest way to keep segments only in deep storage is to explicitly configure the segments that don't get loaded onto Historical processes. Set `tieredReplicants` to an empty array and `useDefaultTierForNull` to `false`. For example, if you configure the following rule for a datasource: ```json [ @@ -64,12 +71,7 @@ Segments with a `replication_factor` of `0` are not assigned to any Historical t You can also confirm this through the Druid console. On the **Segments** page, see the **Replication factor** column. -Keep the following in mind when working with load rules to control what exists only in deep storage: - -- To be queryable, your datasource must meet one of the following conditions: - - At least one segment from the datasource is loaded onto a Historical service for Druid to plan the query. This segment can be any segment from the datasource. You can verify that a datasource has at least one segment on a Historical service if it's visible in the Druid console. - - You have the centralized data source schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema). -- The actual number of replicas may differ from the replication factor temporarily as Druid processes your load rules. +Note that the actual number of replicas may differ from the replication factor temporarily as Druid processes your load rules. ## Run a query from deep storage From 99bcf4f5e8aecc15c133a8ee2c20c11ee8cee457 Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Mon, 12 Aug 2024 11:09:47 -0700 Subject: [PATCH 5/7] load all or some segments --- docs/querying/query-from-deep-storage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/querying/query-from-deep-storage.md b/docs/querying/query-from-deep-storage.md index 1b12eae7ea25..7ee805337366 100644 --- a/docs/querying/query-from-deep-storage.md +++ b/docs/querying/query-from-deep-storage.md @@ -33,7 +33,7 @@ To be queryable, your datasource must meet one of the following conditions: - At least one segment from the datasource is loaded onto a Historical service for Druid to plan the query. This segment can be any segment from the datasource. You can verify that a datasource has at least one segment on a Historical service if it's visible in the Druid console. - You have the centralized data source schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema). -If you use centralized data source schemas, there's an additional step for any datasource created prior to enabling it to make the datasource queryable from deep storage. You need to load the cold segments onto a Historical so that the schema can be backfilled in the metadata database by changing your load rules. You can load some or all of the segments that are only in deep storage. Once that process is complete, you can unload all the segments from the Historical and only keep the data in deep storage. +If you use centralized data source schemas, there's an additional step for any datasource created prior to enabling it to make the datasource queryable from deep storage. You need to load the cold segments onto a Historical so that the schema can be backfilled in the metadata database. You can load some or all of the segments that are only in deep storage. Note that if you choose to not load all the segments, any dimensions that appear only in the segments you didn't load won't be queryable. That is, only the dimensions that are in the metadata database are queryable. Once that process is complete, you can unload all the segments from the Historical and only keep the data in deep storage. ## Keep segments in deep storage only From 1bc92b42083525f3c5028e9e81bf72d73ae341f8 Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Mon, 12 Aug 2024 11:15:29 -0700 Subject: [PATCH 6/7] load all or some segments --- docs/querying/query-from-deep-storage.md | 2 +- website/.spelling | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/querying/query-from-deep-storage.md b/docs/querying/query-from-deep-storage.md index 7ee805337366..d297062bf5c7 100644 --- a/docs/querying/query-from-deep-storage.md +++ b/docs/querying/query-from-deep-storage.md @@ -33,7 +33,7 @@ To be queryable, your datasource must meet one of the following conditions: - At least one segment from the datasource is loaded onto a Historical service for Druid to plan the query. This segment can be any segment from the datasource. You can verify that a datasource has at least one segment on a Historical service if it's visible in the Druid console. - You have the centralized data source schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema). -If you use centralized data source schemas, there's an additional step for any datasource created prior to enabling it to make the datasource queryable from deep storage. You need to load the cold segments onto a Historical so that the schema can be backfilled in the metadata database. You can load some or all of the segments that are only in deep storage. Note that if you choose to not load all the segments, any dimensions that appear only in the segments you didn't load won't be queryable. That is, only the dimensions that are in the metadata database are queryable. Once that process is complete, you can unload all the segments from the Historical and only keep the data in deep storage. +If you use centralized data source schemas, there's an additional step for any datasource created prior to enabling it to make the datasource queryable from deep storage. You need to load the cold segments onto a Historical so that the schema can be backfilled in the metadata database. You can load some or all of the segments that are only in deep storage. If you don't load all the segments, any dimensions that are only in the segments you didn't load will not be in the queryable datasource schema and won't be queryable from deep storage. That is, only the dimensions that are in the metadata database and the schema are queryable. Once that process is complete, you can unload all the segments from the Historical and only keep the data in deep storage. ## Keep segments in deep storage only diff --git a/website/.spelling b/website/.spelling index 1ac39471571d..b8467feb4f7e 100644 --- a/website/.spelling +++ b/website/.spelling @@ -276,6 +276,7 @@ averager averagers backend backfills +backfilled backpressure base64 big-endian From e2f0c313a5f24d27918f18a445112bfc222b56c1 Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Wed, 21 Aug 2024 12:51:42 -0700 Subject: [PATCH 7/7] Apply suggestions from code review Co-authored-by: Rishabh Singh <6513075+findingrish@users.noreply.github.com> --- docs/configuration/index.md | 2 +- docs/querying/query-from-deep-storage.md | 8 ++++---- docs/tutorials/tutorial-query-deep-storage.md | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/configuration/index.md b/docs/configuration/index.md index 19d589e1e131..704568c1cce0 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -595,7 +595,7 @@ need arises. |`druid.centralizedDatasourceSchema.enabled`|Boolean flag for enabling datasource schema building in the Coordinator, this should be specified in the common runtime properties.|false|No.| |`druid.indexer.fork.property.druid.centralizedDatasourceSchema.enabled`| This config should be set when CentralizedDatasourceSchema feature is enabled. This should be specified in the MiddleManager runtime properties.|false|No.| -If you enable this feature, you can query datasources that are only stored in cold storage and are not loaded on a Historical. For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). +If you enable this feature, you can query datasources that are only stored in deep storage and are not loaded on a Historical. For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). For stale schema cleanup configs, refer to properties with the prefix `druid.coordinator.kill.segmentSchema` in [Metadata Management](#metadata-management). diff --git a/docs/querying/query-from-deep-storage.md b/docs/querying/query-from-deep-storage.md index d297062bf5c7..e90f2fb4b024 100644 --- a/docs/querying/query-from-deep-storage.md +++ b/docs/querying/query-from-deep-storage.md @@ -31,15 +31,15 @@ Query from deep storage requires the Multi-stage query (MSQ) task engine. Load t To be queryable, your datasource must meet one of the following conditions: - At least one segment from the datasource is loaded onto a Historical service for Druid to plan the query. This segment can be any segment from the datasource. You can verify that a datasource has at least one segment on a Historical service if it's visible in the Druid console. -- You have the centralized data source schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema). +- You have the centralized datasource schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema). -If you use centralized data source schemas, there's an additional step for any datasource created prior to enabling it to make the datasource queryable from deep storage. You need to load the cold segments onto a Historical so that the schema can be backfilled in the metadata database. You can load some or all of the segments that are only in deep storage. If you don't load all the segments, any dimensions that are only in the segments you didn't load will not be in the queryable datasource schema and won't be queryable from deep storage. That is, only the dimensions that are in the metadata database and the schema are queryable. Once that process is complete, you can unload all the segments from the Historical and only keep the data in deep storage. +If you use centralized data source schema, there's an additional step for any datasource created prior to enabling it to make the datasource queryable from deep storage. You need to load the segments from deep storage onto a Historical so that the schema can be backfilled in the metadata database. You can load some or all of the segments that are only in deep storage. If you don't load all the segments, any dimensions that are only in the segments you didn't load will not be in the queryable datasource schema and won't be queryable from deep storage. That is, only the dimensions that are present in the segment schema in metadata database are queryable. Once that process is complete, you can unload all the segments from the Historical and only keep the data in deep storage. ## Keep segments in deep storage only -Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. However, to take advantage of the cost savings that querying from deep storage provides, make sure not all your segments get loaded onto Historical processes. If you use centralized data source schemas, a datasource can be kept only in deep storage but remain queryable. +Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. However, to take advantage of the cost savings that querying from deep storage provides, make sure not all your segments get loaded onto Historical processes. If you use centralized datasource schema, a datasource can be kept only in deep storage but remain queryable. -To manage the which segments are kept only in deep storage and which get loaded onto Historical processes., configure [load rules](../operations/rule-configuration.md#load-rules) +To manage which segments are kept only in deep storage and which get loaded onto Historical processes, configure [load rules](../operations/rule-configuration.md#load-rules) The easiest way to keep segments only in deep storage is to explicitly configure the segments that don't get loaded onto Historical processes. Set `tieredReplicants` to an empty array and `useDefaultTierForNull` to `false`. For example, if you configure the following rule for a datasource: diff --git a/docs/tutorials/tutorial-query-deep-storage.md b/docs/tutorials/tutorial-query-deep-storage.md index 36c93f9ae76e..1bd2b96501f7 100644 --- a/docs/tutorials/tutorial-query-deep-storage.md +++ b/docs/tutorials/tutorial-query-deep-storage.md @@ -25,7 +25,7 @@ sidebar_label: "Query from deep storage" Query from deep storage allows you to query segments that are stored only in deep storage, which provides lower costs than if you were to load everything onto Historical processes. The tradeoff is that queries from deep storage may take longer to complete. -This tutorial walks you through loading example data, configuring load rules so that not all the segments get loaded onto Historical services, and querying data from deep storage. If you have [centralized datasource schema enabled](../configuration/index.md#centralized-datasource-schema), you can query datasources that are only in deep storage and don't need to make sure at least one segment is available on a Historical. +This tutorial walks you through loading example data, configuring load rules so that not all the segments get loaded onto Historical services, and querying data from deep storage. If you have [centralized datasource schema enabled](../configuration/index.md#centralized-datasource-schema), you can query datasources that are only in deep storage without having any segment available on Historical. To run the queries in this tutorial, replace `ROUTER:PORT` with the location of the Router process and its port number. For example, use `localhost:8888` for the quickstart deployment.