Multiphase segment merge for IndexMergerV9#10689
Conversation
| // always merge at least two segments regardless of column limit | ||
| if (indexes.size() <= 2) { | ||
| if (getIndexColumnCount(indexes) > maxColumnsToMerge) { | ||
| log.warn("index pair has more columns than maxColumnsToMerge [%d].", maxColumnsToMerge); |
There was a problem hiding this comment.
Should this be a warn since we expected to always merge at least two segments regardless of column limit? The warning may be misleading as there is nothing to fix / change
| log.debug("base outDir: " + outDir); | ||
|
|
||
| try { | ||
| while (true) { |
There was a problem hiding this comment.
Is it useful to log the iteration number of this loop?
like how many times have we done a pass so far?
|
|
||
| try { | ||
| while (true) { | ||
| for (List<IndexableAdapter> phase : currentPhases) { |
There was a problem hiding this comment.
is it useful to log the size of currentPhases? It might help to see the progress as the number should decrease after each pass
| for (IndexableAdapter index : indexes) { | ||
| int indexColumnCount = getIndexColumnCount(index); | ||
| if (indexColumnCount > maxColumnsToMerge) { | ||
| log.warn("index has more columns [%d] than maxColumnsToMerge [%d]!", indexColumnCount, maxColumnsToMerge); |
There was a problem hiding this comment.
Should this be a warn since this can happen and is a expected / ok thing? The warning may be misleading as there is nothing to fix / change
|
Also integration test might be useful and easy to add. i.e. a IT that sets the |
|
@jihoonson Do you think this Improvement should be called out in the release notes |
|
@suneet-s yes, I think it's worth mentioning. Thanks for catching it. |
* Multiphase merge for IndexMergerV9 * JSON fix * Cleanup temp files * Docs * Address logging and add IT * Fix spelling and test unloader datasource name
|
@jon-wei Any reason this was not added to kafka/kinesis? |
This PR introduces a new tuning config parameter,
maxColumnsToMerge.This functions as a limit on how many segments can be merged at the same time by the IndexMerger, to limit memory usage during the merge. When the column limit is exceeded across a set of segments, the IndexMerger will break the segments to be merged into smaller phases, and merge the smaller phases in a tree.
A minimum of 2 segments will be merged at once, regardless of the limit. If there is only 1 segment being merged, the limit does not apply. (A warning is logged in these cases, but merging is allowed to proceed).
Currently only the native batch and parallel ingest task tuning config have this new parameter added, this PR does not add support for it to Kafka/Kinesis ingestion yet.
This PR has: