Skip to content

KAFKA-4614 Forcefully unmap mmap of OffsetIndex to prevent long GC pause#2352

Closed
kawamuray wants to merge 1 commit intoapache:trunkfrom
kawamuray:KAFKA-4614-force-munmap-for-index
Closed

KAFKA-4614 Forcefully unmap mmap of OffsetIndex to prevent long GC pause#2352
kawamuray wants to merge 1 commit intoapache:trunkfrom
kawamuray:KAFKA-4614-force-munmap-for-index

Conversation

@kawamuray
Copy link
Copy Markdown
Contributor

Issue: https://issues.apache.org/jira/browse/KAFKA-4614

Fixes the problem that the broker threads suffered by long GC pause.
When GC thread collects mmap objects which were created for index files, it unmaps memory mapping so kernel turns to delete a file physically. This work may transparently read file's metadata from physical disk if it's not available on cache.
This seems to happen typically when we're using G1GC, due to it's strategy to left a garbage for a long time if other objects in the same region are still alive.
See the link for the details.

@asfbot
Copy link
Copy Markdown

asfbot commented Jan 12, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/789/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link
Copy Markdown

asfbot commented Jan 12, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/787/
Test PASSed (JDK 7 and Scala 2.10).

@kawamuray kawamuray force-pushed the KAFKA-4614-force-munmap-for-index branch from 8c3a4c9 to 8c9aba2 Compare January 12, 2017 14:04
@asfbot
Copy link
Copy Markdown

asfbot commented Jan 12, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/792/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link
Copy Markdown

asfbot commented Jan 12, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/790/
Test FAILed (JDK 7 and Scala 2.10).

@asfbot
Copy link
Copy Markdown

asfbot commented Jan 12, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/790/
Test FAILed (JDK 8 and Scala 2.12).

Copy link
Copy Markdown
Contributor

@apurvam apurvam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks for the amazing bug report. I learnt a lot reading through it. The hypothesis, and the experiment validating it are very convincing.

Trying this patch on your production servers further shows that you have a solution.

Regarding the lack of portability: it would be good to know which platforms do not have support for sun.misc.Cleaner. Perhaps somebody else reading this would know a way to find this out.

Additionally, the risk of accessing the mmmapped file after it has been unmapped and nullified will now result in a NullPointerException which would not have occurred previously.

For the latter, in addition to your audit, I think passing unit, integration, and system tests would validate that there is no additional risk. The unit and integration tests will run with this PR. I kicked of a system test against your branch:

http://jenkins.confluent.io/job/system-test-kafka-branch-builder/666/

If those results come back positive, I think this patch is safe to merge, especially in the absence of a more direct way to unmap files in Java.

@guozhangwang
Copy link
Copy Markdown
Contributor

I think the high-level per-segment locking should be sufficient from preventing the same object called its delete() function while getting concurrent access, but it worth double check.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 14, 2017

@apurvam, to answer your question, this class may not be accessible in Java 9:

http://openjdk.java.net/jeps/260

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 16, 2017

@kawamuray, to make it work with Java 9, something like the following would be needed:

apache/lucene-solr@7e03427

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 16, 2017

After thinking about it some more, we should stick with the simple solution in this PR for 0.10.2.0. I added a comment about the more complex solution required for Java 9 to https://issues.apache.org/jira/browse/KAFKA-4501.

@kawamuray
Copy link
Copy Markdown
Contributor Author

@apurvam Thanks for helping to run system tests :)
I could see some failures but looks like they are not related to this change.
However, just to make sure whether these failure are just flaky ones or not, can you re-run a test one more time?
And/or can you provide a link to more detailed logs like what Json did to me before: #1707 (comment) , in order to get more insights through reading all logs please?

@guozhangwang I agree that segment level locking would be enough. However, since the consequence of accessing unmapped object is seriously bad, I tried to be defensive here. It is still not sufficient to prevent all unexpected access to unmapped mmap object, but least it guarantees atomicity on unmapping and cutting further access through mmap field after it has invalidated.
Lock contention should never happen if everything goes well, since it happens in inside .delete() methods which is the last operation performed at the end of index's life.

@ijuma I also caught that lucene's issue while I was looking for a replacement for likely going to be deprecated APIs.
Are we already started to adjusting some parts of Kafka code to deal with breaking changes in Java9? If not, I agree that KAFKA-4501 is a proper place to leave a note about this and deal with it once we started working on code adjustment for Java9. I can contribute follow-up patch for this, once KAFKA-4501 started.

@apurvam
Copy link
Copy Markdown
Contributor

apurvam commented Jan 17, 2017

@kawamuray unfortunately the detailed logs for the previous run were not uploaded because the upload job failed. I kicked off another system test:

http://jenkins.confluent.io/job/system-test-kafka-branch-builder/674/

Once it is done, you can check the logs yourself at http://testing.confluent.io/confluent-kafka-branch-builder-system-test-results/

There will be a link with your github username and your branch name that takes you to the test results for that run. From there, you can click through to the detailed logs of the tests which failed.

@kawamuray
Copy link
Copy Markdown
Contributor Author

Thanks @apurvam.
I see two tests of streams are failing but they are not likely related to this change, and I see these two tests are failing in different execution as well so I think we can ignore them.

Copy link
Copy Markdown
Contributor

@apurvam apurvam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system test passed. This looks good to me. Thanks for the patch!

@junrao
Copy link
Copy Markdown
Contributor

junrao commented Jan 19, 2017

@kawamuray : Thanks for the thorough investigation and the fix. LGTM

asfgit pushed a commit that referenced this pull request Jan 19, 2017
…ause

Issue: https://issues.apache.org/jira/browse/KAFKA-4614

Fixes the problem that the broker threads suffered by long GC pause.
When GC thread collects mmap objects which were created for index files, it unmaps memory mapping so kernel turns to delete a file physically. This work may transparently read file's metadata from physical disk if it's not available on cache.
This seems to happen typically when we're using G1GC, due to it's strategy to left a garbage for a long time if other objects in the same region are still alive.
See the link for the details.

Author: Yuto Kawamura <kawamuray.dadada@gmail.com>

Reviewers: Apurva Mehta <apurva.1618@gmail.com>, Guozhang Wang <wangguoz@gmail.com>, Ismael Juma <ismael@juma.me.uk>,

Closes #2352 from kawamuray/KAFKA-4614-force-munmap-for-index

(cherry picked from commit 5fc530b)
Signed-off-by: Jun Rao <junrao@gmail.com>
@asfgit asfgit closed this in 5fc530b Jan 19, 2017
soenkeliebau pushed a commit to soenkeliebau/kafka that referenced this pull request Feb 7, 2017
…ause

Issue: https://issues.apache.org/jira/browse/KAFKA-4614

Fixes the problem that the broker threads suffered by long GC pause.
When GC thread collects mmap objects which were created for index files, it unmaps memory mapping so kernel turns to delete a file physically. This work may transparently read file's metadata from physical disk if it's not available on cache.
This seems to happen typically when we're using G1GC, due to it's strategy to left a garbage for a long time if other objects in the same region are still alive.
See the link for the details.

Author: Yuto Kawamura <kawamuray.dadada@gmail.com>

Reviewers: Apurva Mehta <apurva.1618@gmail.com>, Guozhang Wang <wangguoz@gmail.com>, Ismael Juma <ismael@juma.me.uk>,

Closes apache#2352 from kawamuray/KAFKA-4614-force-munmap-for-index
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants