-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[improve][deploy] changed the default GC options to ZGC from G1GC #15762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
some tests seem to be hitting https://bugs.openjdk.java.net/browse/JDK-8257534 with ZGC. "It may be that the initialization failed due to OOME but is being reported as a NoClassDefFoundError. " (comment link) @nicoloboschi has been fixing some memory leaks in the tests. There are open PRs #15513 and #15638. |
|
Would you please describe the OpenMessaging Benchmark setup you used. Driver files, workload, and |
I also see the above oom from one of the attempts in example failure. It appears that G1GC is more stable to run the Pulsar unit tests. Do we think we should use G1GC for the unit tests? |
Hi, dave2wave, Basically, I used the followings for the test. I see some trivial setup failures from the current OpenMessaging Benchmark tool, so I locally made some changes to make it work. I probably need to raise a PR to update the OpenMessaging Benchmark tool. |
@heesung-sn It should be fine to use ZGC also for unit tests. There has been a bad test that has caused those issues. It will be fixed by #15911 . |
|
We also need to review this PR, #15692 as ZGC will expose this Jvm metrics bug. |
|
I see the following test failures. I am trying to debug them from my local env by (heesung-sohn#3) test failure logs |
57f7628 to
546bedd
Compare
|
Raised PR to fix the test issues: #16011. With this test fix PR, this ZGC update passed the CI tests(from my local repo) https://github.com/heesung-sn/pulsar/runs/6834730213 |
|
hi @heesung-sn, when add the |
As I explained in the above, this I think this SO discussion is a good reference too. We can collect performance counts by |
Master Issue: #15207
Motivation
Pulsar Server Default GC Update
As Java 17 will be officially required for pulsar-2.11+, it would be worth to revisit Pulsar’s GC default configurations
and consider the newer GC, ZGC or ShenandoahGC as the new default.
ZGC:
One could easily find ZGC intro articles[1][2][3]. I personally found the following persuasive.
“The primary goals of ZGC are low latency, scalability, and ease of use. To achieve this, ZGC allows a Java application to continue running while it performs all garbage collection operations except thread stack scanning. It scales from a few hundred MB to TB-size Java heaps, while consistently maintaining very low pause times—typically within 2 ms.
The implications of predictably low pause times could be profound for both application developers and system architects. Developers will no longer need to worry about designing elaborate ways to avoid garbage collection pauses. And system architects will not require specialized GC performance tuning expertise to achieve the dependably low pause times that are very important for so many use cases. This makes ZGC a good fit for applications that require large amounts of memory, such as with big data. However, ZGC is also a good candidate for smaller heaps that require predictable and extremely low pause times.”[3]
The less settings, the better
One might further tune G1GC flags to outperform ZGC, but our goal is to make the default GC perform well enough to cover general use-cases — it should be rare for users to further tune GC flags. It is promoted that ZGC requires less tunings.
ShenandoahGC:
ShenandoahGC shares the similar designs to ZGC, promoting low pause time as well. Nonetheless, because ShenandoahGC is not officially supported by Oracle, it is unavailable in Oracle built OpenJdks[4][5]. Hence, between ShenandoahGC and ZGC, Pulsar probably needs to take a more available option, ZGC, also considering the future support.
Still, individual Pulsar users can override this default GC, depending on their use-case and OpenJdk versions.
[1] https://wiki.openjdk.java.net/display/zgc/Main
[2] https://developers.redhat.com/articles/2021/11/02/how-choose-best-java-garbage-collector
[3] https://blogs.oracle.com/javamagazine/post/understanding-the-jdks-new-superfast-garbage-collectors
[4] https://developers.redhat.com/blog/2019/04/19/not-all-openjdk-12-builds-include-shenandoah-heres-why
[5] https://bugs.openjdk.java.net/browse/JDK-8215030
Performance Tests:
To confirm the performance benefits, we conducted the open-messaging benchmark.
In this test, we skipped journalings to give more pressures on JVM GCs.
Max Throughput Test
Latency Test
Test Result Analysis
ZGC performs well
From the Max Throughput Test, ZGC performed well by keeping the lowest backlogs, avg 30k
while maintaining avg 1711mb/s throughput.
From the Latency Test, although the latency difference is not very significant,
ZGC showed the lowest p9999 Pub latency, 20.2ms.
Modifications
Pulsar Default GC Flag Update Proposal
Before:
https://github.com/apache/pulsar/blob/master/conf/pulsar_env.sh#L48
After:
Update Details
Replace -XX:+UseG1GC with -XX:+ZGC
Remove -XX:MaxGCPauseMillis=10
Remove -XX:+ParallelRefProcEnabled
Remove XX:+UnlockExperimentalVMOptions
Remove -XX:+DoEscapeAnalysis:
Remove -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32
Remove -XX:G1NewSizePercent=50
Remove -XX:+DisableExplicitGC
Add -XX:+PerfDisableSharedMem
Add -XX:+AlwaysPreTouch
Verifying this change
This change is already covered by existing tests, such as all CIs
Does this pull request potentially affect one of the following parts:
If
yeswas chosen, please highlight the changesDocumentation
Check the box below or label this PR directly.
Need to update docs?
doc-required(Your PR needs to update docs and you will update later)
no-need-docThis is pulsar's internal default GC setting change, but we probably need to mention this in the release note.
doc(Your PR contains doc changes)
doc-added(Docs have been already added)