KAFKA-4614 Forcefully unmap mmap of OffsetIndex to prevent long GC pause#2352
KAFKA-4614 Forcefully unmap mmap of OffsetIndex to prevent long GC pause#2352kawamuray wants to merge 1 commit intoapache:trunkfrom
Conversation
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
8c3a4c9 to
8c9aba2
Compare
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
apurvam
left a comment
There was a problem hiding this comment.
First of all, thanks for the amazing bug report. I learnt a lot reading through it. The hypothesis, and the experiment validating it are very convincing.
Trying this patch on your production servers further shows that you have a solution.
Regarding the lack of portability: it would be good to know which platforms do not have support for sun.misc.Cleaner. Perhaps somebody else reading this would know a way to find this out.
Additionally, the risk of accessing the mmmapped file after it has been unmapped and nullified will now result in a NullPointerException which would not have occurred previously.
For the latter, in addition to your audit, I think passing unit, integration, and system tests would validate that there is no additional risk. The unit and integration tests will run with this PR. I kicked of a system test against your branch:
http://jenkins.confluent.io/job/system-test-kafka-branch-builder/666/
If those results come back positive, I think this patch is safe to merge, especially in the absence of a more direct way to unmap files in Java.
|
I think the high-level per-segment locking should be sufficient from preventing the same object called its |
|
@apurvam, to answer your question, this class may not be accessible in Java 9: |
|
@kawamuray, to make it work with Java 9, something like the following would be needed: |
|
After thinking about it some more, we should stick with the simple solution in this PR for 0.10.2.0. I added a comment about the more complex solution required for Java 9 to https://issues.apache.org/jira/browse/KAFKA-4501. |
|
@apurvam Thanks for helping to run system tests :) @guozhangwang I agree that segment level locking would be enough. However, since the consequence of accessing unmapped object is seriously bad, I tried to be defensive here. It is still not sufficient to prevent all unexpected access to unmapped mmap object, but least it guarantees atomicity on unmapping and cutting further access through mmap field after it has invalidated. @ijuma I also caught that lucene's issue while I was looking for a replacement for likely going to be deprecated APIs. |
|
@kawamuray unfortunately the detailed logs for the previous run were not uploaded because the upload job failed. I kicked off another system test: http://jenkins.confluent.io/job/system-test-kafka-branch-builder/674/ Once it is done, you can check the logs yourself at http://testing.confluent.io/confluent-kafka-branch-builder-system-test-results/ There will be a link with your github username and your branch name that takes you to the test results for that run. From there, you can click through to the detailed logs of the tests which failed. |
|
Thanks @apurvam. |
apurvam
left a comment
There was a problem hiding this comment.
The system test passed. This looks good to me. Thanks for the patch!
|
@kawamuray : Thanks for the thorough investigation and the fix. LGTM |
…ause Issue: https://issues.apache.org/jira/browse/KAFKA-4614 Fixes the problem that the broker threads suffered by long GC pause. When GC thread collects mmap objects which were created for index files, it unmaps memory mapping so kernel turns to delete a file physically. This work may transparently read file's metadata from physical disk if it's not available on cache. This seems to happen typically when we're using G1GC, due to it's strategy to left a garbage for a long time if other objects in the same region are still alive. See the link for the details. Author: Yuto Kawamura <kawamuray.dadada@gmail.com> Reviewers: Apurva Mehta <apurva.1618@gmail.com>, Guozhang Wang <wangguoz@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Closes #2352 from kawamuray/KAFKA-4614-force-munmap-for-index (cherry picked from commit 5fc530b) Signed-off-by: Jun Rao <junrao@gmail.com>
…ause Issue: https://issues.apache.org/jira/browse/KAFKA-4614 Fixes the problem that the broker threads suffered by long GC pause. When GC thread collects mmap objects which were created for index files, it unmaps memory mapping so kernel turns to delete a file physically. This work may transparently read file's metadata from physical disk if it's not available on cache. This seems to happen typically when we're using G1GC, due to it's strategy to left a garbage for a long time if other objects in the same region are still alive. See the link for the details. Author: Yuto Kawamura <kawamuray.dadada@gmail.com> Reviewers: Apurva Mehta <apurva.1618@gmail.com>, Guozhang Wang <wangguoz@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Closes apache#2352 from kawamuray/KAFKA-4614-force-munmap-for-index
Issue: https://issues.apache.org/jira/browse/KAFKA-4614
Fixes the problem that the broker threads suffered by long GC pause.
When GC thread collects mmap objects which were created for index files, it unmaps memory mapping so kernel turns to delete a file physically. This work may transparently read file's metadata from physical disk if it's not available on cache.
This seems to happen typically when we're using G1GC, due to it's strategy to left a garbage for a long time if other objects in the same region are still alive.
See the link for the details.