-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[enhancement](memory) Jemalloc performance optimization and compatibility with MemTracker #12496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
163eeb6 to
2674fc2
Compare
d680763 to
7c69ef6
Compare
|
both clickhouse and pingcap has swtiched to jemalloc. |
7c69ef6 to
a18e340
Compare
I will refer to more later |
063acba to
2ee5a7c
Compare
2ee5a7c to
94c68ce
Compare
yiguolei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| export LIBHDFS3_CONF="${DORIS_HOME}/conf/hdfs-site.xml" | ||
|
|
||
| export MALLOC_CONF="percpu_arena:percpu,background_thread:true,metadata_thp:auto,muzzy_decay_ms:30000,dirty_decay_ms:30000,oversize_threshold:0,lg_tcache_max:16" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we set the thread cache size to 1MB, for doris there maybe 1000 threads and the total cache will be 1GB, I think it is ok.
And the decay time is too long, I think 30s is ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
muzzy_decay_ms and dirty_decay_ms are currently 30s
lg_tcache_max is not the size of the thread cache, it is the maximum size of a single class (memory block) cached by the thread cache, the default is 32K, and the size of the thread cache is controlled by jemalloc itself
…lity with MemTracker apache#12496
…lity with MemTracker apache#12496
…lity with MemTracker apache#12496
…lity with MemTracker apache#12496
…lity with MemTracker apache#12496
…lity with MemTracker apache#12496
…3367 gperftools/tcmalloc[https://github.com/gperftools/gperftools] is outdated, there are no new features for many years, only fix bugs. doris is currently used by default. google/tcmalloc[https://github.com/google/tcmalloc], very active recently, has many new features, and is expected to perform better than jemalloc, but there is currently no stable version. Moreover, the compilation dependencies are complex and difficult to integrate, and are incompatible with gperftools/tcmalloc, and there are few reference documents. jemalloc[https://github.com/jemalloc/jemalloc] performs better than gperftools/tcmalloc under high concurrency, and is mature and stable, looking forward to being the default memory allocator. Tested in Doris: #12496
Proposed changes
Issue Number: close #xxx
Problem summary
This will significantly improve multi-threading and high concurrency memory performance.
Test commit: Wed Aug 31 1a198b3,
does not include #12436 Optimize tcmalloc performance, so the test result of tcmalloc may be lower than the latest code.
refer to:
https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md
https://jemalloc.net/jemalloc.3.html
jemalloc/jemalloc#1621
https://github.com/jemalloc/jemalloc/wiki/Getting-Started
jemalloc/jemalloc#2058
jemalloc/jemalloc#2172
Performance Verification Results
1、single msyql client sequential execution
Looking at the flame graph, the time-consuming of submitting sql in a single mysql client is not in memory, so the performance improvement is less.
2、jmeter stress test, disable page cache and chunk allocator and mem pool
the performance of jemalloc is doubled.
2) SSB, only q1.1 - q3.4
the performance of jemalloc is improved by 25%.
3、jmeter stress test, enable page cache and chunk allocator and mem pool(default conf)
the performance of jemalloc is improved by 46%.
2) SSB, only q1.1 - q3.4
the performance of jemalloc is improved by 19%.
4、jmeter stress test, disable page cache, enable chunk allocator and mem pool
5、jmeter stress test, v1.1.1 vs master
sql:
v1.1.1:
master:
6、jmeter stress test, Try replacing ChunkAllocator with jemalloc in vec
Size < 4K using jemalloc, 4K < size < 64M using ChunkAllocator, the performance is still a little higher.
TODO, more testing and tuning, looking forward to replacing ChunkAllocator
Performance Verification Reproduce
1、single mysql client sequential execution
be.conf add
2、jmeter stress test, disable page cache and chunk allocator and mem pool
set global parallel_fragment_exec_instance_num=1;be.conf add
jmeter conf
3、jmeter stress test, enable page cache and chunk allocator and mem pool
set global parallel_fragment_exec_instance_num=1;be.conf add
4、jmeter stress test, disable page cache, enable chunk allocator and mem pool
set global parallel_fragment_exec_instance_num=1;be.conf add
5、jmeter stress test, v1.1.1 vs master
set global parallel_fragment_exec_instance_num=1;be.conf add
Performance Verification Data
tcmalloc_vs_jemalloc.zip
Checklist(Required)