Skip to content

Conversation

@xinyiZzz
Copy link
Contributor

@xinyiZzz xinyiZzz commented Sep 8, 2022

Proposed changes

Issue Number: close #xxx

Problem summary

  1. Before jemalloc was compiled with arrow, it was compiled separately
  2. Modify the default parameters of jemalloc to achieve better performance and lower memory usage.
    This will significantly improve multi-threading and high concurrency memory performance.
  3. Jemalloc compatible mem tracker, which is consistent with the query mem tracker value of tcmalloc

Test commit: Wed Aug 31 1a198b3,
does not include #12436 Optimize tcmalloc performance, so the test result of tcmalloc may be lower than the latest code.

refer to:
https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md
https://jemalloc.net/jemalloc.3.html
jemalloc/jemalloc#1621
https://github.com/jemalloc/jemalloc/wiki/Getting-Started
jemalloc/jemalloc#2058
jemalloc/jemalloc#2172

Performance Verification Results

1、single msyql client sequential execution

the sum of the averages of multiple executions of each query

  1. Clickbench:
sum of time(s) peak mem(M)
tcmalloc 210 11927
jemalloc default conf 203.17 20777
jemalloc optimize conf 194.45 13594
  1. SSB:
sum of time(s) peak mem(M)
tcmalloc 15.845 832
jemalloc default conf 15.052 2920
jemalloc optimize conf 13.828 2091

Looking at the flame graph, the time-consuming of submitting sql in a single mysql client is not in memory, so the performance improvement is less.

2、jmeter stress test, disable page cache and chunk allocator and mem pool

Take the results of the second stress test for each sql, because jemalloc has a cold start, the first stress test is more aggressively cached, and the second stress test starts to get faster.

  1. Clickbench, only q13 + q14
sum of time(s)
tcmalloc 82611
jemalloc default conf 72688
jemalloc optimize conf 42506

the performance of jemalloc is doubled.
2) SSB, only q1.1 - q3.4

sum of time(s)
tcmalloc 76760
jemalloc default conf 73326
jemalloc optimize conf 57297

the performance of jemalloc is improved by 25%.

3、jmeter stress test, enable page cache and chunk allocator and mem pool(default conf)

  1. Clickbench, only q13 + q14
sum of time(s)
tcmalloc 73616
jemalloc default conf 62220
jemalloc optimize conf 39565

the performance of jemalloc is improved by 46%.
2) SSB, only q1.1 - q3.4

sum of time(s)
tcmalloc 53709
jemalloc default conf 47790
jemalloc optimize conf 43297

the performance of jemalloc is improved by 19%.

4、jmeter stress test, disable page cache, enable chunk allocator and mem pool

  1. Clickbench, only q13 + q14
sum of time(s)
tcmalloc 74949
jemalloc default conf 61335
jemalloc optimize conf TODO

5、jmeter stress test, v1.1.1 vs master

sql:

with e as (select b.Title,b.measure1,b.measure2 from (select a.Title, sum(case when a.PageCharset = 'windows-1251;charset' then 1 else 0 end) as measure1,sum(case when a.RefererHash in('-296158784638538920', '-6389909303817027441') then cast(a.JavaEnable AS Double) else 0 end) as measure2 from hits  a where CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND Title <> '' group by 1) b) , f as (select avg(e.measure2) as avg1,avg(e.measure1) as avg2,var_samp(e.measure2) as variance1,var_samp(e.measure1) as variance2 from e),g as (select sum((e.measure1-f.avg2)*(e.measure2-f.avg1)) as covariance from e,f) select f.avg2,f.variance1,f.variance2,g.covariance as covariance from f,g;

v1.1.1:

jmeter thread=10 jmeter thread=20 jmeter thread=30
tcmalloc 1095 1818 2870
jemalloc default conf 581 1171 1802

master:

jmeter thread=8
tcmalloc 573
jemalloc default conf 502
  1. On v1.11, jemalloc improves performance by 55% - double, the more jmeter threads, the greater the bottleneck outside the memory, and the smaller the performance improvement.
  2. jemalloc on the master improves performance by 20%, because the master does a lot of memory reuse.

6、jmeter stress test, Try replacing ChunkAllocator with jemalloc in vec

  1. Clickbench, only q13 + q14
sum of time(s)
jemalloc lg_tcache_max:16 + ChunkAllocator 39565
jemalloc lg_tcache_max:26 40158

Size < 4K using jemalloc, 4K < size < 64M using ChunkAllocator, the performance is still a little higher.
TODO, more testing and tuning, looking forward to replacing ChunkAllocator

Performance Verification Reproduce

1、single mysql client sequential execution

be.conf add

          `enable_tcmalloc_hook=false`
          `disable_storage_page_cache=true`
          `disable_mem_pools=true`
          `chunk_reserved_bytes_limit=1`
  1. Clickbench
vim tools/clickbench-tools/run-clickbench-queries.sh
    pre_set "set global parallel_fragment_exec_instance_num=1;"
    pre_set "set global exec_mem_limit=20G;
sh tools/clickbench-tools/run-clickbench-queries.sh
  1. SSB
vim tools/ssb-tools/bin/run-ssb-queries.sh
    pre_set "set global parallel_fragment_exec_instance_num=1;"
sh tools/ssb-tools/bin/run-ssb-queries.sh

2、jmeter stress test, disable page cache and chunk allocator and mem pool

set global parallel_fragment_exec_instance_num=1;
be.conf add

          `enable_tcmalloc_hook=false`
          `disable_storage_page_cache=true`
          `disable_mem_pools=true`
          `chunk_reserved_bytes_limit=1`

jmeter conf

  1. Clickbench
        <stringProp name="ThreadGroup.num_threads">30</stringProp>
        <stringProp name="ThreadGroup.ramp_time">1</stringProp>
        <boolProp name="ThreadGroup.scheduler">true</boolProp>
        <stringProp name="ThreadGroup.duration">100</stringProp>
        <stringProp name="ThreadGroup.delay">0</stringProp>
  1. SSB
        <stringProp name="ThreadGroup.num_threads">10</stringProp>
        <stringProp name="ThreadGroup.ramp_time">1</stringProp>
        <boolProp name="ThreadGroup.scheduler">true</boolProp>
        <stringProp name="ThreadGroup.duration">30</stringProp>
        <stringProp name="ThreadGroup.delay">0</stringProp>

3、jmeter stress test, enable page cache and chunk allocator and mem pool

set global parallel_fragment_exec_instance_num=1;
be.conf add

          `enable_tcmalloc_hook=false`

4、jmeter stress test, disable page cache, enable chunk allocator and mem pool

set global parallel_fragment_exec_instance_num=1;
be.conf add

          `enable_tcmalloc_hook=false`
          `disable_storage_page_cache=true`

5、jmeter stress test, v1.1.1 vs master

set global parallel_fragment_exec_instance_num=1;
be.conf add

          `enable_tcmalloc_hook=false`
          `disable_storage_page_cache=true`
          `disable_mem_pools=true`
          `chunk_reserved_bytes_limit=1`

Performance Verification Data

tcmalloc_vs_jemalloc.zip

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
    • I don't know
  2. Has unit tests been added:
    • Yes
    • No
    • No Need
  3. Has document been added or modified:
    • Yes
    • No
    • No Need
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Are there any changes that cannot be rolled back:
    • Yes (If Yes, please explain WHY)
    • No

@xinyiZzz xinyiZzz force-pushed the 20200907_jemalloc_conf branch 2 times, most recently from 163eeb6 to 2674fc2 Compare September 14, 2022 00:48
@xinyiZzz xinyiZzz changed the title [enhancement](memory) Optimize jemalloc compilation and configuration [enhancement](memory) Jemalloc performance optimization and compatibility with MemTracker Sep 14, 2022
@xinyiZzz xinyiZzz force-pushed the 20200907_jemalloc_conf branch 2 times, most recently from d680763 to 7c69ef6 Compare September 16, 2022 05:42
@yiguolei
Copy link
Contributor

both clickhouse and pingcap has swtiched to jemalloc.
ClickHouse/ClickHouse#2773
pingcap/tiflash#424

@xinyiZzz xinyiZzz force-pushed the 20200907_jemalloc_conf branch from 7c69ef6 to a18e340 Compare September 26, 2022 16:34
@xinyiZzz
Copy link
Contributor Author

both clickhouse and pingcap has swtiched to jemalloc. ClickHouse/ClickHouse#2773 pingcap/tiflash#424

I will refer to more later

@xinyiZzz xinyiZzz added area/storage/in-memory Issues or PRs related to the memory storage engine dev/1.1.3-deprecated labels Sep 27, 2022
@xinyiZzz xinyiZzz force-pushed the 20200907_jemalloc_conf branch 2 times, most recently from 063acba to 2ee5a7c Compare September 27, 2022 11:44
Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

export LIBHDFS3_CONF="${DORIS_HOME}/conf/hdfs-site.xml"

export MALLOC_CONF="percpu_arena:percpu,background_thread:true,metadata_thp:auto,muzzy_decay_ms:30000,dirty_decay_ms:30000,oversize_threshold:0,lg_tcache_max:16"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we set the thread cache size to 1MB, for doris there maybe 1000 threads and the total cache will be 1GB, I think it is ok.
And the decay time is too long, I think 30s is ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

muzzy_decay_ms and dirty_decay_ms are currently 30s

lg_tcache_max is not the size of the thread cache, it is the maximum size of a single class (memory block) cached by the thread cache, the default is 32K, and the size of the thread cache is controlled by jemalloc itself

@yiguolei yiguolei merged commit 16bb5cb into apache:master Sep 28, 2022
FreeOnePlus pushed a commit to FreeOnePlus/doris that referenced this pull request Oct 8, 2022
FreeOnePlus pushed a commit to FreeOnePlus/doris that referenced this pull request Oct 8, 2022
FreeOnePlus pushed a commit to FreeOnePlus/doris that referenced this pull request Oct 8, 2022
FreeOnePlus pushed a commit to FreeOnePlus/doris that referenced this pull request Oct 8, 2022
FreeOnePlus pushed a commit to FreeOnePlus/doris that referenced this pull request Oct 8, 2022
FreeOnePlus pushed a commit to FreeOnePlus/doris that referenced this pull request Oct 8, 2022
yiguolei pushed a commit that referenced this pull request Oct 14, 2022
…3367

gperftools/tcmalloc[https://github.com/gperftools/gperftools] is outdated, there are no new features for many years, only fix bugs. doris is currently used by default.

google/tcmalloc[https://github.com/google/tcmalloc], very active recently, has many new features, and is expected to perform better than jemalloc, but there is currently no stable version.
Moreover, the compilation dependencies are complex and difficult to integrate, and are incompatible with gperftools/tcmalloc, and there are few reference documents.

jemalloc[https://github.com/jemalloc/jemalloc] performs better than gperftools/tcmalloc under high concurrency, and is mature and stable, looking forward to being the default memory allocator.
Tested in Doris: #12496
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/storage/in-memory Issues or PRs related to the memory storage engine area/vectorization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants