release mmap immediately after merge indexes by kaijianding · Pull Request #6699 · apache/druid

kaijianding · 2018-12-01T18:34:07Z

When coordinator is down or loading segments slowly, or realtime node populates segments too fast(like pulling data from long time ago), there will be many many merged files and many many hydrant segments be loaded as mmap during the merge process. There will be many many intermediate objects, like DirectBufferR, while these hydrant segments' queryableIndex can't be unloaded immediately after the merge process due to they may be used by a query.

As a result, both mmap and heap size will grow to a large number and finally killed by container(in my case is yarn) or OOM(it indeed happened in my product environment).

A better approach is that: separate query and merge process, alway load indexes from hydrant files and close the loaded indexes after merge success/fail to release mmap and intermediate objects

leventov · 2018-12-03T12:49:20Z

+        return indexIO.loadIndex(input);
+      }
+      catch (IOException e) {
+        Throwables.propagate(e);


Throwables.propagate() must not be used without throw before it. See https://github.com/google/guava/wiki/Why-we-deprecated-Throwables.propagate#propagate-is-magic. This method must not be used in new code. (Note: there are already 39 occurrences of this bug in Druid, #6701)

clintropolis

Hi @kaijianding, could you please explain a bit more clearly how this helps to resolve the issue you've encountered? It isn't obvious to me from the description or added code how this change would help, all that is clear is that it puts a (probably insignificant) bit more pressure onto the heap during merge time by having a 2nd set of QueryableIndex objects. Thinking about what is going on here, the only thing I can come up with is that this patch would ensure that any 64k decompression buffers allocated to read the segment columns during a merge would be surrendered back to the compression pool after merge is complete - if they aren't already (I'm not certain of this detail, investigating). When your process got killed by the oom killer, did you have a chance to look at the size of the direct buffer objects in a dump as well as composition in general?

Overall though, I don't think the amount of direct memory usage required would change significantly (unless the buffers from the compression pool are in fact held until handoff), since the mechanics of how merge is done is not really modified. Also I don't believe mmap footprint wouldn't be impacted in any way afaik because the OS would share the same physical memory locations for mapped pages that are in use by both query and merge, and query would still need those segments until handoff is completed to be able to keep serving the data. This part shouldn't be playing a role directly in the oom killer anyway, though thrashing on disk if free space grows too small from ballooning heap or direct memory usage, could be slowing things down overall - causing further pressure on heap or direct memory as things pile up.

Thanks!

kaijianding · 2018-12-06T04:56:02Z

@clintropolis here is how things happen.

We run realtime jobs on Yarn, and yarn tracks the mem usage including heap, direct memory and native memory which in druid case is mmap.
yarn will kill process whose mem usage is much more than it applyed.
after druid merges hydrant segments into merged segment, the hydrant segments are mmaped and and the mmap usage is added about the total file size of the hydrant segments which usually hundreds or thousands MB, AND the added mmap is not released due to the QueryableIndex cann't be closed because the index is still needed by queries.
when handoff is slow, the mmap usage will grow to a huge number when the merge process happens many many times. finally Yarn will kill this druid realtime jvm.
also, there are many side-effect objects added to heap after merge success, finally OOM happens.

Direct memory is not a problem here, direct memory is released after each merge success/fail.

kaijianding · 2018-12-06T04:59:36Z

The added code in this PR, is to close the QueryableIndex after merge success/fail to explicitly release mmap and other side-effect objects. And these QueryableIndexes are different objects to the ones used by queries, so it's safe.

leventov · 2018-12-06T16:50:11Z

      @Nullable SegmentWriteOutMediumFactory segmentWriteOutMediumFactory
  ) throws IOException;

+  File mergeSegmentFiles(


Please add javadoc with the rationale for this method, and the difference from the most closely related one, mergeQueryableIndex().

leventov · 2018-12-06T16:50:45Z

+      return mergeQueryableIndex(indexes, rollup, metricAggs, outDir, indexSpec, new BaseProgressIndicator(), segmentWriteOutMediumFactory);
+    }
+    finally {
+      for (QueryableIndex index : indexes) {


Use Closer to close many objects.

leventov · 2018-12-06T16:54:09Z


-        mergedFile = indexMerger.mergeQueryableIndex(
-            indexes,
+        File[] hydrantDirs = persistDir.listFiles(


listFiles() is an unsafe API, it may return null. Use Files.list() instead. Due to this, it's also more convenient to change the new IndexMerger's new method parameter type to List<Path>.

leventov · 2018-12-06T16:54:50Z

+              @Override
+              public boolean accept(File dir, String fileName)
+              {
+                // To avoid reading and listing of "merged" dir


The comment is unclear, please elaborate

"hydrantFiles" maybe? or those files are actually directories? Then IndexMerger.mergeSegmentFiles() should be called mergeSegmentDirs().

leventov · 2018-12-06T16:56:41Z

-                mergedFile = indexMerger.mergeQueryableIndex(
-                    indexes,
+                File[] hydrantDirs = persistDir.listFiles(
+                    new FilenameFilter()


Consider extracting the repeated fragment as a method

leventov · 2018-12-06T17:01:08Z

        }

-        mergedFile = indexMerger.mergeQueryableIndex(
-            indexes,


@kaijianding I agree with @clintropolis, maybe this PR is incomplete, and you planned to free those indexes? In it's current form the PR doesn't free any extra resources that were (probably unnecessarily) held before by the worker process.

The QueryableIndexes are always new objects loaded from file, and closed when IndexMerger.mergeSegmentFiles() is done. This behavior frees the mmap usage increased during the merge process.
Currently we use indexMerger.mergeQueryableIndex(indexes) in RealtimePlumber, the indexes are not closed to explicitly release the increased mmap because these indexes are still used by queries. Then the mmap usage always increases until abandonSegment() is called which only happens when handoff succeeds.
If handoff is slow or coordinator is not working properly, the mmap usage will keep increasing, this is a problem.

This PR separates the QueryableIndex used by query and used by merge process, then we can close the QueryableIndexes used by merge process to release mmap, and leave the QueryableIndexes used by query untouched.

This PR is verified in my product environment, it indeed controls the mmap usage

Again, this is beyond me how just creating (and later closing) some new objects without revoking creating any other objects could improve anything, unless any code around here is lazy, that doesn't seem to be the case.

If this is about avoiding refreshing some memory mapped files in memory (although I don't see the mechanism how it helps either), at very least the surrounding try {} block should be refactored, because currently it couples the creation of the indexes list and mergeSegmentFiles() for no apparent reason.

Will address these comments.

The mechanism is like this:

mmap usage will increase after indexes merged. mmap can only be released when close() called on indexes.

currently close() on the indexes are not called until abandonSegment() which can be delayed for a very long time if handoff is slow. Before handoff succeeds, the mmap will keep increasing.

to avoid this increasement, I create new QueryableIndexes object and close them after each merge success/fail, then the increased mmap during the merge process is released.

I think this magic works because even a flle is mmaped multiple times, the mmap usage is calculated separately each mmap action, and can be un-mmaped separately each un-mmap action. In this case, close() on QueryableIndex is the un-mmap action.

Wish I explain the mechanism clearly. Though this PR is very simple, it indeed help on the mmap usage.

There is an alternative solution to recycle mmap: load the mergedFile as QueryableIndex and swap all hydrants' small QueryableIndexes with this single big QueryableIndex to the Sink. and close all hydrants' small QueryableIndexes and delete all hydrant segments to release mmap (the mmap usage is already increased after merge process done)

We are standing at the same page now, your understanding is totally correct.

The worst case, the complete set of data is queried (all rows/all columns), the exact same footprint is as the merge required in the first place.
But usually(as I noticed in my production environment), very little mmap is increased after queries, many columns are ingested but not queried at all. This part of mmap is not a problem. And if we do the final swap after merge, this part of mmap will decrease because of the new merged segment has no mmap footprint. In most cases, user is like to query the latest data, when handoff is slow, it's better to do the final swap to reduce this part of mmap for the earlier time's hydrant segments.

Back to this PR, it releases mmap after merge is done, as a result, the mmap usage is under control, it won't grow to a huge number when handoff is slow to let the process be killed by yarn.

Cool, glad we could sort out what is going on, and apologies it took so long for me to understand what the point was 👍

Since this doesn't really seem to particularly have any negative consequences, and since it is doing something useful in some cases at least, I'll have another look at this PR.

Also in jdk 10+, I believe it will be possible to open files with O_DIRECT, which is probably what we really want here if we want merge to be done out of the query path and not have a significant impact on page cache usage, though it would potentially lower merge performance. I think it would be worth putting a note and adding this link so our future selves remember to consider if we ever make it out of java 8.

@clintropolis it's possible to use O_DIRECT already, obtain a raw pointer from mmap() and wrap as https://github.com/DataSketches/memory.

so my understanding is that, this is useful in case when, on realtime process, very small subset of columns/data in the intermediate persisted segments is being read from queries.
some commentary in the code along those lines would be great.
overall, this patch looks useful to me.

"swap after merge" is further useful to reduce inode count, but the swap would be tricky to implement as there might be queries underway on the indexes just merged.

I think O_DIRECT is whole other beast and would require significant performance regression testing in this case.

Hmm, this got marked stale, but i also think it would be ok to merge if @leventov's comments and some additional javadocs explaining what was going on and linking to this discussion were added. Do you have any interest in finishing this @kaijianding?

Apologies that it got stalled in review for so long.

jihoonson · 2019-05-07T15:50:43Z

I'm untagging milestone since this issue is not necessarily a release blocker. Feel free to let me know if you think this should be.

stale · 2019-07-08T21:49:07Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2019-07-09T21:01:40Z

This pull request/issue is no longer marked as stale.

stale · 2019-07-09T21:01:41Z

This pull request/issue is no longer marked as stale.

stale · 2019-09-07T21:49:27Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2019-10-05T22:27:45Z

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

kaijianding added 2 commits December 2, 2018 02:15

release mmap immediately after merge indexes

96b6966

fix style

9544e18

kaijianding added the Improvement label Dec 3, 2018

leventov reviewed Dec 3, 2018

View reviewed changes

fjy added this to the 0.13.1 milestone Dec 3, 2018

remove Throwables.propagate

d118542

gianm requested a review from clintropolis December 4, 2018 21:58

clintropolis requested changes Dec 6, 2018

View reviewed changes

leventov requested changes Dec 6, 2018

View reviewed changes

jon-wei removed this from the 0.14.0 milestone Feb 5, 2019

fjy added this to the 0.15.0 milestone Mar 11, 2019

jihoonson removed this from the 0.15.0 milestone May 7, 2019

himanshug mentioned this pull request Jun 16, 2019

Kafka index service use a lot of direct memory during segment publish #7824

Closed

clintropolis mentioned this pull request Jun 18, 2019

add config to optionally disable all compression in intermediate segment persists while ingestion #7919

Merged

stale Bot added the stale label Jul 8, 2019

stale Bot removed the stale label Jul 9, 2019

stale Bot added the stale label Sep 7, 2019

stale Bot closed this Oct 5, 2019

kaijianding deleted the mergeSegmentFiles branch March 25, 2023 19:10

Conversation

kaijianding commented Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

kaijianding commented Dec 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaijianding commented Dec 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaijianding Dec 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaijianding Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented May 7, 2019

Uh oh!

stale Bot commented Jul 8, 2019

Uh oh!

stale Bot commented Jul 9, 2019

Uh oh!

stale Bot commented Jul 9, 2019

Uh oh!

stale Bot commented Sep 7, 2019

Uh oh!

stale Bot commented Oct 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kaijianding commented Dec 1, 2018 •

edited

Loading

kaijianding commented Dec 6, 2018 •

edited

Loading

kaijianding Dec 7, 2018 •

edited

Loading

kaijianding Dec 13, 2018 •

edited

Loading