diff --git a/docs/content/configuration/index.md b/docs/content/configuration/index.md index a08c50300ce1..e8d3bfe7ece4 100644 --- a/docs/content/configuration/index.md +++ b/docs/content/configuration/index.md @@ -835,8 +835,10 @@ These configuration options control the behavior of the Lookup dynamic configura ##### Compaction Dynamic Configuration -Compaction configurations can also be set or updated dynamically without restarting Coordinators. For segment compaction, -please see [Compacting Segments](../design/coordinator.html#compacting-segments). +Compaction configurations can also be set or updated dynamically using +[Coordinator's API](../operations/api-reference.html#compaction-configuration) without restarting Coordinators. + +For details about segment compaction, please check [Segment Size Optimization](../operations/segment-optimization.html). A description of the compaction config is: diff --git a/docs/content/design/coordinator.md b/docs/content/design/coordinator.md index efb9e9c1af07..9571f3a697bd 100644 --- a/docs/content/design/coordinator.md +++ b/docs/content/design/coordinator.md @@ -66,7 +66,7 @@ To ensure an even distribution of segments across Historical processes in the cl ### Compacting Segments Each run, the Druid Coordinator compacts small segments abutting each other. This is useful when you have a lot of small -segments which may degrade the query performance as well as increasing the disk space usage. +segments which may degrade query performance as well as increase disk space usage. See [Segment Size Optimization](../operations/segment-optimization.html) for details. The Coordinator first finds the segments to compact together based on the [segment search policy](#segment-search-policy). Once some segments are found, it launches a [compaction task](../ingestion/tasks.html#compaction-task) to compact those segments. diff --git a/docs/content/operations/api-reference.md b/docs/content/operations/api-reference.md index 2417ead3510e..5ad0e6c00285 100644 --- a/docs/content/operations/api-reference.md +++ b/docs/content/operations/api-reference.md @@ -341,7 +341,8 @@ will be set for them. * `/druid/coordinator/v1/config/compaction` -Creates or updates the compaction config for a dataSource. See [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for configuration details. +Creates or updates the compaction config for a dataSource. +See [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for configuration details. ##### DELETE diff --git a/docs/content/operations/segment-optimization.md b/docs/content/operations/segment-optimization.md index d13e4ba63f14..179b418b934e 100644 --- a/docs/content/operations/segment-optimization.md +++ b/docs/content/operations/segment-optimization.md @@ -1,6 +1,6 @@ --- layout: doc_page -title: "Segment size optimization" +title: "Segment Size Optimization" --- -# Segment size optimization +# Segment Size Optimization In Druid, it's important to optimize the segment size because 1. Druid stores data in segments. If you're using the [best-effort roll-up](../design/index.html#roll-up-modes) mode, increasing the segment size might introduce further aggregation which reduces the dataSource size. - 2. When a query is submitted, that query is distributed to all Historicals and realtimes - which hold the input segments of the query. Each process has a processing threads pool and use one thread per segment to - process it. If the segment size is too large, data might not be well distributed over the - whole cluster, thereby decreasing the degree of parallelism. If the segment size is too small, - each processing thread processes too small data. This might reduce the processing speed of other queries as well as - the input query itself because the processing threads are shared for executing all queries. + 2. When a query is submitted, that query is distributed to all Historicals and realtime tasks + which hold the input segments of the query. Each process and task picks a thread from its own processing thread pool + to process a single segment. If segment sizes are too large, data might not be well distributed between data + servers, decreasing the degree of parallelism possible during query processing. + At the other extreme where segment sizes are too small, the scheduling + overhead of processing a larger number of segments per query can reduce + performance, as the threads that process each segment compete for the fixed + slots of the processing pool. It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy -especially for the streaming ingestion because the amount of data ingested might vary over time. In this case, -you can roughly set the segment size at ingestion time and optimize it later. You have two options: +especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case, +you can create segments with a sub-optimzed size first and optimize them later. + +You may need to consider the followings to optimize your segments. + + - Number of rows per segment: it's generally recommended for each segment to have around 5 million rows. + This setting is usually _more_ important than the below "segment byte size". + This is because Druid uses a single thread to process each segment, + and thus this setting can directly control how many rows each thread processes, + which in turn means how well the query execution is parallelized. + - Segment byte size: it's recommended to set 300 ~ 700MB. If this value + doesn't match with the "number of rows per segment", please consider optimizing + number of rows per segment rather than this value. + +
+The above recommendation works in general, but the optimal setting can +vary based on your workload. For example, if most of your queries +are heavy and take a long time to process each row, you may want to make +segments smaller so that the query processing can be more parallelized. +If you still see some performance issue after optimizing segment size, +you may need to find the optimal settings for your workload. +
+ +There might be several ways to check if the compaction is necessary. One way +is using the [System Schema](../querying/sql.html#system-schema). The +system schema provides several tables about the current system status including the `segments` table. +By running the below query, you can get the average number of rows and average size for published segments. + +```sql +SELECT + "start", + "end", + version, + COUNT(*) AS num_segments, + AVG("num_rows") AS avg_num_rows, + SUM("num_rows") AS total_num_rows, + AVG("size") AS avg_size, + SUM("size") AS total_size +FROM + sys.segments A +WHERE + datasource = 'your_dataSource' AND + is_published = 1 +GROUP BY 1, 2, 3 +ORDER BY 1, 2, 3 DESC; +``` + +Please note that the query result might include overshadowed segments. +In this case, you may want to see only rows of the max version per interval (pair of `start` and `end`). + +Once you find your segments need compaction, you can consider the below two options: - Turning on the [automatic compaction of Coordinators](../design/coordinator.html#compacting-segments). The Coordinator periodically submits [compaction tasks](../ingestion/tasks.html#compaction-task) to re-index small segments. + To enable the automatic compaction, you need to configure it for each dataSource via Coordinator's dynamic configuration. + See [Compaction Configuration API](../operations/api-reference.html#compaction-configuration) + and [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for details. - Running periodic Hadoop batch ingestion jobs and using a `dataSource` inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel. Details on how to do this can be found under ['Updating Existing Data'](../ingestion/update-existing-data.html).