From 453f49b3e0edc227dcd749a7c1e485d79989004e Mon Sep 17 00:00:00 2001 From: Jihoon Son Date: Wed, 20 Feb 2019 18:04:42 -0800 Subject: [PATCH 1/3] Improve doc for auto compaction --- docs/content/configuration/index.md | 6 ++- docs/content/design/coordinator.md | 2 +- docs/content/operations/api-reference.md | 3 +- .../operations/segment-optimization.md | 54 ++++++++++++++++--- 4 files changed, 55 insertions(+), 10 deletions(-) diff --git a/docs/content/configuration/index.md b/docs/content/configuration/index.md index 22dd1aea59cd..ccbf60bc4178 100644 --- a/docs/content/configuration/index.md +++ b/docs/content/configuration/index.md @@ -835,8 +835,10 @@ These configuration options control the behavior of the Lookup dynamic configura ##### Compaction Dynamic Configuration -Compaction configurations can also be set or updated dynamically without restarting Coordinators. For segment compaction, -please see [Compacting Segments](../design/coordinator.html#compacting-segments). +Compaction configurations can also be set or updated dynamically using +[Coordinator's API](../operations/api-reference.html#compaction-configuration) without restarting Coordinators. + +For details about segment compaction, please check [Segment Size Optimization](../operations/segment-optimization.html). A description of the compaction config is: diff --git a/docs/content/design/coordinator.md b/docs/content/design/coordinator.md index a0fdb6b79133..49c56383cb87 100644 --- a/docs/content/design/coordinator.md +++ b/docs/content/design/coordinator.md @@ -66,7 +66,7 @@ To ensure an even distribution of segments across Historical nodes in the cluste ### Compacting Segments Each run, the Druid Coordinator compacts small segments abutting each other. This is useful when you have a lot of small -segments which may degrade the query performance as well as increasing the disk space usage. +segments which may degrade the query performance as well as increasing the disk space usage. See [Segment Size Optimization](../operations/segment-optimization.html) for details. The Coordinator first finds the segments to compact together based on the [segment search policy](#segment-search-policy). Once some segments are found, it launches a [compaction task](../ingestion/tasks.html#compaction-task) to compact those segments. diff --git a/docs/content/operations/api-reference.md b/docs/content/operations/api-reference.md index 407b47355dff..0e946374d940 100644 --- a/docs/content/operations/api-reference.md +++ b/docs/content/operations/api-reference.md @@ -341,7 +341,8 @@ will be set for them. * `/druid/coordinator/v1/config/compaction` -Creates or updates the compaction config for a dataSource. See [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for configuration details. +Creates or updates the compaction config for a dataSource. +See [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for configuration details. ##### DELETE diff --git a/docs/content/operations/segment-optimization.md b/docs/content/operations/segment-optimization.md index 2fea3ff3d9cd..188fdea3fda7 100644 --- a/docs/content/operations/segment-optimization.md +++ b/docs/content/operations/segment-optimization.md @@ -1,6 +1,6 @@ --- layout: doc_page -title: "Segment size optimization" +title: "Segment Size Optimization" --- -# Segment size optimization +# Segment Size Optimization In Druid, it's important to optimize the segment size because @@ -32,15 +32,57 @@ In Druid, it's important to optimize the segment size because which hold the input segments of the query. Each node has a processing threads pool and use one thread per segment to process it. If the segment size is too large, data might not be well distributed over the whole cluster, thereby decreasing the degree of parallelism. If the segment size is too small, - each processing thread processes too small data. This might reduce the processing speed of other queries as well as - the input query itself because the processing threads are shared for executing all queries. + each processing thread might process too small data. This can reduce the overall processing speed because + parallel processing involves some overhead like thread scheduling. It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy -especially for the streaming ingestion because the amount of data ingested might vary over time. In this case, -you can roughly set the segment size at ingestion time and optimize it later. You have two options: +especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case, +you can create segments with a sub-optimzed size first and optimize them later. + +There might be several ways to check if the compaction is necessary. One way +is using the [System Schema](../querying/sql.html#system-schema). The +system schema provides several tables about the current system status including the `segments` table. +By running the below query, you can get the average number of rows and average size for published segments. + +```sql +SELECT + "start", + "end", + version, + COUNT(*) AS num_segments, + AVG("num_rows") AS avg_num_rows, + SUM("num_rows") AS total_num_rows, + AVG("size") AS avg_size, + SUM("size") AS total_size +FROM + sys.segments A +WHERE + datasource = 'your_dataSource' AND + is_published = 1 +GROUP BY 1, 2, 3 +ORDER BY 1, 2, 3 DESC; +``` + +Please note that the query result might include overshadowed segments. +In this case, you may want to see only rows of the max version per interval (pair of `start` and `end`). + +The recomended number of rows per segment and segment size are 5 million rows and 300 ~ 700MB, respectively. +If you see way different numbers, you may need to compact your segments. + +
+The above values are generally good for most use cases, but the optimal +values can vary based on your workload. For example, if most of your queries +are heavy and take a long time to process each row, you may want to make +segments smaller so that the query processing can be more parallelized. +
+ +You have two options for segment compaction: - Turning on the [automatic compaction of Coordinators](../design/coordinator.html#compacting-segments). The Coordinator periodically submits [compaction tasks](../ingestion/tasks.html#compaction-task) to re-index small segments. + To enable the automatic compaction, you need to configure it for each dataSource via Coordinator's dynamic configuration. + See [Compaction Configuration API](../operations/api-reference.html#compaction-configuration) + and [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for details. - Running periodic Hadoop batch ingestion jobs and using a `dataSource` inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel. Details on how to do this can be found under ['Updating Existing Data'](../ingestion/update-existing-data.html). From 543135c8029e6c6ed46f470ebf7b2b523e1d773b Mon Sep 17 00:00:00 2001 From: Jihoon Son Date: Fri, 22 Feb 2019 15:43:44 -0800 Subject: [PATCH 2/3] fix doc --- docs/content/design/coordinator.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/design/coordinator.md b/docs/content/design/coordinator.md index 49c56383cb87..5dd2501ddadd 100644 --- a/docs/content/design/coordinator.md +++ b/docs/content/design/coordinator.md @@ -66,7 +66,7 @@ To ensure an even distribution of segments across Historical nodes in the cluste ### Compacting Segments Each run, the Druid Coordinator compacts small segments abutting each other. This is useful when you have a lot of small -segments which may degrade the query performance as well as increasing the disk space usage. See [Segment Size Optimization](../operations/segment-optimization.html) for details. +segments which may degrade query performance as well as increase disk space usage. See [Segment Size Optimization](../operations/segment-optimization.html) for details. The Coordinator first finds the segments to compact together based on the [segment search policy](#segment-search-policy). Once some segments are found, it launches a [compaction task](../ingestion/tasks.html#compaction-task) to compact those segments. From 23f01d306b8df4d4c0fbe8c9dcc8ce4af1685dba Mon Sep 17 00:00:00 2001 From: Jihoon Son Date: Thu, 28 Feb 2019 13:57:23 -0800 Subject: [PATCH 3/3] address comments --- .../operations/segment-optimization.md | 44 ++++++++++++------- 1 file changed, 28 insertions(+), 16 deletions(-) diff --git a/docs/content/operations/segment-optimization.md b/docs/content/operations/segment-optimization.md index 188fdea3fda7..a1f006b7c931 100644 --- a/docs/content/operations/segment-optimization.md +++ b/docs/content/operations/segment-optimization.md @@ -28,17 +28,39 @@ In Druid, it's important to optimize the segment size because 1. Druid stores data in segments. If you're using the [best-effort roll-up](../design/index.html#roll-up-modes) mode, increasing the segment size might introduce further aggregation which reduces the dataSource size. - 2. When a query is submitted, that query is distributed to all Historicals and realtimes + 2. When a query is submitted, that query is distributed to all Historicals and realtime tasks which hold the input segments of the query. Each node has a processing threads pool and use one thread per segment to - process it. If the segment size is too large, data might not be well distributed over the - whole cluster, thereby decreasing the degree of parallelism. If the segment size is too small, - each processing thread might process too small data. This can reduce the overall processing speed because - parallel processing involves some overhead like thread scheduling. + process it. If segment sizes are too large, data might not be well distributed between data + servers, decreasing the degree of parallelism possible during query processing. + At the other extreme where segment sizes are too small, the scheduling + overhead of processing a larger number of segments per query can reduce + performance, as the threads that process each segment compete for the fixed + slots of the processing pool. It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case, you can create segments with a sub-optimzed size first and optimize them later. +You may need to consider the followings to optimize your segments. + + - Number of rows per segment: it's generally recommended for each segment to have around 5 million rows. + This setting is usually _more_ important than the below "segment byte size". + This is because Druid uses a single thread to process each segment, + and thus this setting can directly control how many rows each thread processes, + which in turn means how well the query execution is parallelized. + - Segment byte size: it's recommended to set 300 ~ 700MB. If this value + doesn't match with the "number of rows per segment", please consider optimizing + number of rows per segment rather than this value. + +
+The above recommendation works in general, but the optimal setting can +vary based on your workload. For example, if most of your queries +are heavy and take a long time to process each row, you may want to make +segments smaller so that the query processing can be more parallelized. +If you still see some performance issue after optimizing segment size, +you may need to find the optimal settings for your workload. +
+ There might be several ways to check if the compaction is necessary. One way is using the [System Schema](../querying/sql.html#system-schema). The system schema provides several tables about the current system status including the `segments` table. @@ -66,17 +88,7 @@ ORDER BY 1, 2, 3 DESC; Please note that the query result might include overshadowed segments. In this case, you may want to see only rows of the max version per interval (pair of `start` and `end`). -The recomended number of rows per segment and segment size are 5 million rows and 300 ~ 700MB, respectively. -If you see way different numbers, you may need to compact your segments. - -
-The above values are generally good for most use cases, but the optimal -values can vary based on your workload. For example, if most of your queries -are heavy and take a long time to process each row, you may want to make -segments smaller so that the query processing can be more parallelized. -
- -You have two options for segment compaction: +Once you find your segments need compaction, you can consider the below two options: - Turning on the [automatic compaction of Coordinators](../design/coordinator.html#compacting-segments). The Coordinator periodically submits [compaction tasks](../ingestion/tasks.html#compaction-task) to re-index small segments.