Conversation
|
|
||
| Each run, the Druid Coordinator compacts small segments abutting each other. This is useful when you have a lot of small | ||
| segments which may degrade the query performance as well as increasing the disk space usage. | ||
| segments which may degrade the query performance as well as increasing the disk space usage. See [Segment Size Optimization](../operations/segment-optimization.html) for details. |
There was a problem hiding this comment.
"segments which may degrade the query performance as well as increasing the disk space usage" -> "segments which may degrade query performance as well as increase disk space usage"
| Please note that the query result might include overshadowed segments. | ||
| In this case, you may want to see only rows of the max version per interval (pair of `start` and `end`). | ||
|
|
||
| The recomended number of rows per segment and segment size are 5 million rows and 300 ~ 700MB, respectively. |
There was a problem hiding this comment.
I think this part about recommended sizing should probably go before the example of how to find out whether or not you need to use compaction, maybe before or after the part about the impacts of sizing too big or too small?
It also might be worth suggesting that the row count is maybe more important number for performance i think than raw segment sizing, so extreme cases of very few columns or very many columns might vary on segment sizing?
There was a problem hiding this comment.
Sounds good. Moved this part and emphasized the importance of # of rows.
| each processing thread processes too small data. This might reduce the processing speed of other queries as well as | ||
| the input query itself because the processing threads are shared for executing all queries. | ||
| each processing thread might process too small data. This can reduce the overall processing speed because | ||
| parallel processing involves some overhead like thread scheduling. |
There was a problem hiding this comment.
Just a suggestion for this section, feel free to change or not:
If segment sizes are too large, data might not be well distributed between data
servers, decreasing the degree of parallelism possible during query processing.
At the other extreme where segment sizes are too small, the scheduling
overhead of processing a larger number of segments per query can reduce
performance, as the threads that process each segment compete for the fixed
slots of the processing pool.
There was a problem hiding this comment.
Looks good to me. Thanks!
* Improve doc for auto compaction * fix doc * address comments
I crosslinked pages for compaction configuration, coordinator API, and description of segment optimization. I also added a section for how to check the compaction is needed.