diff --git a/docs/content/configuration/index.md b/docs/content/configuration/index.md index 2486ff442cfd..858c2a2dc24e 100644 --- a/docs/content/configuration/index.md +++ b/docs/content/configuration/index.md @@ -832,8 +832,10 @@ These configuration options control the behavior of the Lookup dynamic configura ##### Compaction Dynamic Configuration -Compaction configurations can also be set or updated dynamically without restarting Coordinators. For segment compaction, -please see [Compacting Segments](../design/coordinator.html#compacting-segments). +Compaction configurations can also be set or updated dynamically using +[Coordinator's API](../operations/api-reference.html#compaction-configuration) without restarting Coordinators. + +For details about segment compaction, please check [Segment Size Optimization](../operations/segment-optimization.html). A description of the compaction config is: diff --git a/docs/content/design/coordinator.md b/docs/content/design/coordinator.md index efb9e9c1af07..9571f3a697bd 100644 --- a/docs/content/design/coordinator.md +++ b/docs/content/design/coordinator.md @@ -66,7 +66,7 @@ To ensure an even distribution of segments across Historical processes in the cl ### Compacting Segments Each run, the Druid Coordinator compacts small segments abutting each other. This is useful when you have a lot of small -segments which may degrade the query performance as well as increasing the disk space usage. +segments which may degrade query performance as well as increase disk space usage. See [Segment Size Optimization](../operations/segment-optimization.html) for details. The Coordinator first finds the segments to compact together based on the [segment search policy](#segment-search-policy). Once some segments are found, it launches a [compaction task](../ingestion/tasks.html#compaction-task) to compact those segments. diff --git a/docs/content/operations/api-reference.md b/docs/content/operations/api-reference.md index 2417ead3510e..5ad0e6c00285 100644 --- a/docs/content/operations/api-reference.md +++ b/docs/content/operations/api-reference.md @@ -341,7 +341,8 @@ will be set for them. * `/druid/coordinator/v1/config/compaction` -Creates or updates the compaction config for a dataSource. See [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for configuration details. +Creates or updates the compaction config for a dataSource. +See [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for configuration details. ##### DELETE diff --git a/docs/content/operations/segment-optimization.md b/docs/content/operations/segment-optimization.md index d13e4ba63f14..179b418b934e 100644 --- a/docs/content/operations/segment-optimization.md +++ b/docs/content/operations/segment-optimization.md @@ -1,6 +1,6 @@ --- layout: doc_page -title: "Segment size optimization" +title: "Segment Size Optimization" --- -# Segment size optimization +# Segment Size Optimization In Druid, it's important to optimize the segment size because 1. Druid stores data in segments. If you're using the [best-effort roll-up](../design/index.html#roll-up-modes) mode, increasing the segment size might introduce further aggregation which reduces the dataSource size. - 2. When a query is submitted, that query is distributed to all Historicals and realtimes - which hold the input segments of the query. Each process has a processing threads pool and use one thread per segment to - process it. If the segment size is too large, data might not be well distributed over the - whole cluster, thereby decreasing the degree of parallelism. If the segment size is too small, - each processing thread processes too small data. This might reduce the processing speed of other queries as well as - the input query itself because the processing threads are shared for executing all queries. + 2. When a query is submitted, that query is distributed to all Historicals and realtime tasks + which hold the input segments of the query. Each process and task picks a thread from its own processing thread pool + to process a single segment. If segment sizes are too large, data might not be well distributed between data + servers, decreasing the degree of parallelism possible during query processing. + At the other extreme where segment sizes are too small, the scheduling + overhead of processing a larger number of segments per query can reduce + performance, as the threads that process each segment compete for the fixed + slots of the processing pool. It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy -especially for the streaming ingestion because the amount of data ingested might vary over time. In this case, -you can roughly set the segment size at ingestion time and optimize it later. You have two options: +especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case, +you can create segments with a sub-optimzed size first and optimize them later. + +You may need to consider the followings to optimize your segments. + + - Number of rows per segment: it's generally recommended for each segment to have around 5 million rows. + This setting is usually _more_ important than the below "segment byte size". + This is because Druid uses a single thread to process each segment, + and thus this setting can directly control how many rows each thread processes, + which in turn means how well the query execution is parallelized. + - Segment byte size: it's recommended to set 300 ~ 700MB. If this value + doesn't match with the "number of rows per segment", please consider optimizing + number of rows per segment rather than this value. + +