Skip to content

Multiphase segment merge for IndexMergerV9#10689

Merged
jon-wei merged 6 commits intoapache:masterfrom
jon-wei:maxcol
Jan 6, 2021
Merged

Multiphase segment merge for IndexMergerV9#10689
jon-wei merged 6 commits intoapache:masterfrom
jon-wei:maxcol

Conversation

@jon-wei
Copy link
Copy Markdown
Contributor

@jon-wei jon-wei commented Dec 17, 2020

This PR introduces a new tuning config parameter, maxColumnsToMerge.

This functions as a limit on how many segments can be merged at the same time by the IndexMerger, to limit memory usage during the merge. When the column limit is exceeded across a set of segments, the IndexMerger will break the segments to be merged into smaller phases, and merge the smaller phases in a tree.

A minimum of 2 segments will be merged at once, regardless of the limit. If there is only 1 segment being merged, the limit does not apply. (A warning is logged in these cases, but merging is allowed to proceed).

Currently only the native batch and parallel ingest task tuning config have this new parameter added, this PR does not add support for it to Kafka/Kinesis ingestion yet.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

// always merge at least two segments regardless of column limit
if (indexes.size() <= 2) {
if (getIndexColumnCount(indexes) > maxColumnsToMerge) {
log.warn("index pair has more columns than maxColumnsToMerge [%d].", maxColumnsToMerge);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warn since we expected to always merge at least two segments regardless of column limit? The warning may be misleading as there is nothing to fix / change

log.debug("base outDir: " + outDir);

try {
while (true) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to log the iteration number of this loop?
like how many times have we done a pass so far?


try {
while (true) {
for (List<IndexableAdapter> phase : currentPhases) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it useful to log the size of currentPhases? It might help to see the progress as the number should decrease after each pass

for (IndexableAdapter index : indexes) {
int indexColumnCount = getIndexColumnCount(index);
if (indexColumnCount > maxColumnsToMerge) {
log.warn("index has more columns [%d] than maxColumnsToMerge [%d]!", indexColumnCount, maxColumnsToMerge);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warn since this can happen and is a expected / ok thing? The warning may be misleading as there is nothing to fix / change

@maytasm
Copy link
Copy Markdown
Contributor

maytasm commented Dec 21, 2020

Also integration test might be useful and easy to add. i.e. a IT that sets the maxColumnsToMerge to some > 0 number and make sure that the ingestion succeeded and run a test query to make sure data is correct. This can builds on existing IT to reuse the raw data and query

@jon-wei jon-wei merged commit 68bb038 into apache:master Jan 6, 2021
gianm added a commit to gianm/druid that referenced this pull request Jan 8, 2021
gianm added a commit that referenced this pull request Jan 8, 2021
@suneet-s
Copy link
Copy Markdown
Contributor

@jihoonson Do you think this Improvement should be called out in the release notes

@jihoonson jihoonson added this to the 0.21.0 milestone Jan 15, 2021
@jihoonson
Copy link
Copy Markdown
Contributor

@suneet-s yes, I think it's worth mentioning. Thanks for catching it.

JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jan 22, 2021
* Multiphase merge for IndexMergerV9

* JSON fix

* Cleanup temp files

* Docs

* Address logging and add IT

* Fix spelling and test unloader datasource name
JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jan 22, 2021
@maytasm
Copy link
Copy Markdown
Contributor

maytasm commented Nov 1, 2023

@jon-wei Any reason this was not added to kafka/kinesis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants