[Enhancement] [Memory] [Vectorized] Stress test and optimize memory allocation #9581

xinyiZzz · 2022-05-15T23:23:45Z

Proposed changes

Issue Number: close #9540
#9580

Problem Summary:

High concurrency stress test on SSB and wide table. Compare the performance of turning the vectorization engine on and off. Turning on the vectorization engine is slower for most SSB queries.
Optimize the Allocator in the vectorization engine. In most queries, the performance is improved by about 10%.
Memory allocation between 4KB and 64MB will be through ChunkAllocator, those less than 4KB will be through malloc, and those greater than 64MB will be through MMAP.
Optimize Chunk Allocator, increase the limit that allows chunks to be stolen from other core's arena, and optimize reserved bytes conf.

Checklist(Required)

Does it affect the original behavior: (Yes)
Has unit tests been added: (No)
Has document been added or modified: (Yes)
Does it need to update dependencies: (No)
Are there any changes that cannot be rolled back: (Yes)

Further comments

Stress testing the vectorization engine.

1. Env and Test Set

> Based on Doris V1.0

Env: 1 FE, 1 BE
Test Set: 
	SSB, 100G, lineorder 60003w rows
	Width table from online service, 419 columns, 1710549 rows
set global parallel_fragment_exec_instance_num=10

jmeter conf:
	<stringProp name="ThreadGroup.num_threads">100</stringProp>
	<stringProp name="ThreadGroup.ramp_time">1</stringProp>
	<boolProp name="ThreadGroup.scheduler">true</boolProp>
	<stringProp name="ThreadGroup.duration">30</stringProp>
	<stringProp name="ThreadGroup.delay">0</stringProp>

actual concurrency = parallel_fragment_exec_instance_num * ThreadGroup.num_threads

2. Test

TO: Master, set global enable_vectorized_engine=false;
T1: Master, set global enable_vectorized_engine=true;
T2: Master, set global enable_vectorized_engine=true, tc_max_total_thread_cache_bytes=100G；
T3: This PR, set global enable_vectorized_engine=true, allocate 4k < size < 64M use ChunkAllocator;
T4: This PR, set global enable_vectorized_engine=true, Allocator 4k < size < 64M use chunkAllocator, and compile USE_MEM_TRACKER=0；
R1: (T0mid-T1mid)/T0mid, Compare the performance of turning the vectorization engine on and off.
R2: (T1mid-T3mid)/T1mid, Performance changes brought by allocating 4K < size < 64M memory through ChunkAllocator in the vectorization engine.
R3: (T1mid-T4mid)/T1mid, Same as above, close memtracker.

Form Notes: "xxx,xxx,xxx": Repeat 3 times, the AvgTime(ms) of each time.

query	num_threads	R1	R2	R3	T0	T1	T2	T3	T4
Q1.1	100	4.7%	9.2%	12.1%	36252,36903,36800	34297,35053,36087	35757,34483,33825	33314,31657,31838	30801,31496,30445
Q1.2	100	-3.8%	7%	6.2%	24017,24338,25478	25273.25222,25914	26647,24651,25406	23453,23498,23604	23771,23084,23704
Q1.3	100	-2.6%	9%	7.9	24349,23780,22844	24073,24487,24401	23842,23149,24198	22614,22984,23050	22678,22466,22225
Q2.1	20	-11.2%	0.6%	19.3%	89466,21528,21889	26300,24345,24222	89662,25042,24069	24094,24542,24197	20538,19651,19627
Q2.2	20	12.7%	4.9%	0.8%	16963,21435,18154	15855,16936,15047	16006,17251,16593	15072,14407,15648	15716,16347,15588
Q2.3	20	1%	3.2%	8.9%	15183,16194,13977	15551,15033,14801	14338,14605,14531	14302,14548,15301	14318,13601,13689
Q3.1	20	5.4%	19.4%	23.6%	32021,32176,31427	31037,30283,30231	38002,30272,30016	25162,23187,24411	27673,23147,22492
Q3.2	20	-8%	15.8%	17.3%	10379,10433,9893	11837,11184,11223	11403,11481,9788	9296,9452,9455	9576,9172,9283
Q3.3	20	-5%	10.6%	12.8%	8559,8472,8639	8713,9390,8992	8367,8153,8133	7998,8618,8040	7952,7476,7845
Q4.1	20	-35%	27.9%	31.8%	32249,29965,29136	47405,40357,40443	41912,36981,37571	31230,27683,29166	31848,27585,27435
Q4.2	20	-73.5	16.2%	15.2%	19979,18798,17169	34066,32614,30849	34560,34460,35194	27117,29645,27337	27651,29205,27149
Q4.3	20	-46.2%	-2.8%	0.5%	19357,20762,19992	28256,29862,29230	27647,29216,29091	30523,30067,29260	29092,26644,30017
Width table (419 rows)	100	100%	17.6%	17.9%	no work	4211,4546,4710	4089,4479,4551	3679,3745,3816	3664,3732,3829

3. Detailed description

T2: Theoretically, when the capacity of the tcmalloc thread cache is sufficient, the spin lock in the central free list will be avoided to a great extent, but in practice, the spin lock cost is still large in high concurrency queries, I will test this matter in more detail below.
T3: Because tcmalloc thread cache cannot avoid spin lock, the introduction of ChunkAllocator is equivalent to adding a layer of cache in User Mode.
In allocator.h, Memory allocation between 4KB and 64MB will be through ChunkAllocator, those less than 4KB will be through malloc (for example, tcmalloc), and those greater than 64MB will be through MMAP.
In the actual test, chunkallocator allocates less than 4KB of memory slower than malloc, and chunkallocator allocates more than 64MB of memory slower than MMAP, but the 4KB threshold is an empirical value, which needs to be determined by more detailed test later.
T4: Close memtracker at compile time can be selected during POC. Memtracker records the consumption value through an atomic variable. In high query concurrency, the atomic variable spin lock has a high cost. I'll optimize memtracker. After that

be/src/common/config.h

cambyzju · 2022-05-16T03:14:52Z

be/src/runtime/memory/chunk_allocator.cpp

    } while (!_reserved_bytes.compare_exchange_weak(old_reserved_bytes, new_reserved_bytes));

+    // Reduce set metric frequency
+    if (_reserved_bytes % 100 == 32) {


How to make sure the correctness?

At the first look, (_reserved_bytes % 100 < 32) or (_reserved_bytes % 100 > 32) both will not update the metric.

At the first look, ChunkAllocator will allocate/free many times, the memory size of each allocate/free is a multiple of 2, so _reserved_bytes% 100 == 32 will definitely happen, and the latest _reserved_bytes value will be set each time .

The real-time and accurate _reserved_bytes value is not required. Usually, the value of _reserved_bytes is equal to ChunkAllocator MemTracker. The _reserved_bytes metric is only concerned when verifying the accuracy of MemTracker.

Therefore, reduce the number of sets and reduce the performance impact.

build.sh

be/src/gutil/strings/numbers.cc

yiguolei · 2022-06-28T06:23:48Z

be/src/vec/common/allocator.h

+        void* buf;
+
+        if (size >= MMAP_THRESHOLD) {
+            if (alignment > MMAP_MIN_ALIGNMENT)


Why not call populate to populate the memory to avoid too many page fault during usage?

No need for zero-fill, because mmap guarantees it.

be/src/runtime/memory/chunk_allocator.h

xinyiZzz · 2022-06-28T08:57:43Z

Based on the latest master (commit id: 7898c81)
The test results are the same as above.

yiguolei

lgtm

github-actions bot added the area/vectorization label May 15, 2022

xinyiZzz mentioned this pull request May 16, 2022

[Proposal] Memory performance optimization #9580

Closed

2 tasks

cambyzju reviewed May 16, 2022

View reviewed changes

be/src/common/config.h Show resolved Hide resolved

yiguolei added this to the v1.2 milestone May 16, 2022

cambyzju reviewed May 16, 2022

View reviewed changes

xinyiZzz force-pushed the fix_tracker_lru_cache_push branch from 9b9b6bb to f97096d Compare May 16, 2022 17:58

github-actions bot added the kind/docs Categorizes issue or PR as related to documentation. label May 16, 2022

yangzhg reviewed May 17, 2022

View reviewed changes

build.sh Outdated Show resolved Hide resolved

yangzhg reviewed May 17, 2022

View reviewed changes

be/src/gutil/strings/numbers.cc Show resolved Hide resolved

xinyiZzz force-pushed the fix_tracker_lru_cache_push branch 2 times, most recently from db7b85a to 345679c Compare June 28, 2022 03:52

yiguolei reviewed Jun 28, 2022

View reviewed changes

be/src/runtime/memory/chunk_allocator.h Outdated Show resolved Hide resolved

xinyiZzz added 2 commits June 28, 2022 15:53

vec stress test, Allocator introduce chunkallocator

64056a8

fix comment

c20832c

xinyiZzz force-pushed the fix_tracker_lru_cache_push branch from 345679c to c20832c Compare June 28, 2022 08:53

yiguolei approved these changes Jun 28, 2022

View reviewed changes

yiguolei merged commit deeb302 into apache:master Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] [Memory] [Vectorized] Stress test and optimize memory allocation #9581

[Enhancement] [Memory] [Vectorized] Stress test and optimize memory allocation #9581

Uh oh!

xinyiZzz commented May 15, 2022 •

edited

Loading

Uh oh!

Uh oh!

cambyzju May 16, 2022

Uh oh!

xinyiZzz May 16, 2022

Uh oh!

Uh oh!

Uh oh!

yiguolei Jun 28, 2022

Uh oh!

xinyiZzz Jun 28, 2022

Uh oh!

Uh oh!

xinyiZzz commented Jun 28, 2022

Uh oh!

yiguolei left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Enhancement] [Memory] [Vectorized] Stress test and optimize memory allocation #9581

[Enhancement] [Memory] [Vectorized] Stress test and optimize memory allocation #9581

Uh oh!

Conversation

xinyiZzz commented May 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Problem Summary:

Checklist(Required)

Further comments

1. Env and Test Set

2. Test

3. Detailed description

Uh oh!

Uh oh!

cambyzju May 16, 2022

Choose a reason for hiding this comment

Uh oh!

xinyiZzz May 16, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yiguolei Jun 28, 2022

Choose a reason for hiding this comment

Uh oh!

xinyiZzz Jun 28, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xinyiZzz commented Jun 28, 2022

Uh oh!

yiguolei left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xinyiZzz commented May 15, 2022 •

edited

Loading