Skip to content

Conversation

@xinyiZzz
Copy link
Contributor

@xinyiZzz xinyiZzz commented May 15, 2022

Proposed changes

Issue Number: close #9540
#9580

Problem Summary:

  1. High concurrency stress test on SSB and wide table. Compare the performance of turning the vectorization engine on and off. Turning on the vectorization engine is slower for most SSB queries.

  2. Optimize the Allocator in the vectorization engine. In most queries, the performance is improved by about 10%.
    Memory allocation between 4KB and 64MB will be through ChunkAllocator, those less than 4KB will be through malloc, and those greater than 64MB will be through MMAP.

  3. Optimize Chunk Allocator, increase the limit that allows chunks to be stolen from other core's arena, and optimize reserved bytes conf.

Checklist(Required)

  1. Does it affect the original behavior: (Yes)
  2. Has unit tests been added: (No)
  3. Has document been added or modified: (Yes)
  4. Does it need to update dependencies: (No)
  5. Are there any changes that cannot be rolled back: (Yes)

Further comments

Stress testing the vectorization engine.

1. Env and Test Set

> Based on Doris V1.0

Env: 1 FE, 1 BE
Test Set: 
	SSB, 100G, lineorder 60003w rows
	Width table from online service, 419 columns, 1710549 rows
set global parallel_fragment_exec_instance_num=10

jmeter conf:
	<stringProp name="ThreadGroup.num_threads">100</stringProp>
	<stringProp name="ThreadGroup.ramp_time">1</stringProp>
	<boolProp name="ThreadGroup.scheduler">true</boolProp>
	<stringProp name="ThreadGroup.duration">30</stringProp>
	<stringProp name="ThreadGroup.delay">0</stringProp>

actual concurrency = parallel_fragment_exec_instance_num * ThreadGroup.num_threads

2. Test

  • TO: Master, set global enable_vectorized_engine=false;
  • T1: Master, set global enable_vectorized_engine=true;
  • T2: Master, set global enable_vectorized_engine=true, tc_max_total_thread_cache_bytes=100G;
  • T3: This PR, set global enable_vectorized_engine=true, allocate 4k < size < 64M use ChunkAllocator;
  • T4: This PR, set global enable_vectorized_engine=true, Allocator 4k < size < 64M use chunkAllocator, and compile USE_MEM_TRACKER=0;
  • R1: (T0mid-T1mid)/T0mid, Compare the performance of turning the vectorization engine on and off.
  • R2: (T1mid-T3mid)/T1mid, Performance changes brought by allocating 4K < size < 64M memory through ChunkAllocator in the vectorization engine.
  • R3: (T1mid-T4mid)/T1mid, Same as above, close memtracker.

Form Notes: "xxx,xxx,xxx": Repeat 3 times, the AvgTime(ms) of each time.

query num_threads R1 R2 R3 T0 T1 T2 T3 T4
Q1.1 100 4.7% 9.2% 12.1% 36252,36903,36800 34297,35053,36087 35757,34483,33825 33314,31657,31838 30801,31496,30445
Q1.2 100 -3.8% 7% 6.2% 24017,24338,25478 25273.25222,25914 26647,24651,25406 23453,23498,23604 23771,23084,23704
Q1.3 100 -2.6% 9% 7.9 24349,23780,22844 24073,24487,24401 23842,23149,24198 22614,22984,23050 22678,22466,22225
Q2.1 20 -11.2% 0.6% 19.3% 89466,21528,21889 26300,24345,24222 89662,25042,24069 24094,24542,24197 20538,19651,19627
Q2.2 20 12.7% 4.9% 0.8% 16963,21435,18154 15855,16936,15047 16006,17251,16593 15072,14407,15648 15716,16347,15588
Q2.3 20 1% 3.2% 8.9% 15183,16194,13977 15551,15033,14801 14338,14605,14531 14302,14548,15301 14318,13601,13689
Q3.1 20 5.4% 19.4% 23.6% 32021,32176,31427 31037,30283,30231 38002,30272,30016 25162,23187,24411 27673,23147,22492
Q3.2 20 -8% 15.8% 17.3% 10379,10433,9893 11837,11184,11223 11403,11481,9788 9296,9452,9455 9576,9172,9283
Q3.3 20 -5% 10.6% 12.8% 8559,8472,8639 8713,9390,8992 8367,8153,8133 7998,8618,8040 7952,7476,7845
Q4.1 20 -35% 27.9% 31.8% 32249,29965,29136 47405,40357,40443 41912,36981,37571 31230,27683,29166 31848,27585,27435
Q4.2 20 -73.5 16.2% 15.2% 19979,18798,17169 34066,32614,30849 34560,34460,35194 27117,29645,27337 27651,29205,27149
Q4.3 20 -46.2% -2.8% 0.5% 19357,20762,19992 28256,29862,29230 27647,29216,29091 30523,30067,29260 29092,26644,30017
Width table (419 rows) 100 100% 17.6% 17.9% no work 4211,4546,4710 4089,4479,4551 3679,3745,3816 3664,3732,3829

image

3. Detailed description

  • T2: Theoretically, when the capacity of the tcmalloc thread cache is sufficient, the spin lock in the central free list will be avoided to a great extent, but in practice, the spin lock cost is still large in high concurrency queries, I will test this matter in more detail below.
  • T3: Because tcmalloc thread cache cannot avoid spin lock, the introduction of ChunkAllocator is equivalent to adding a layer of cache in User Mode.
    In allocator.h, Memory allocation between 4KB and 64MB will be through ChunkAllocator, those less than 4KB will be through malloc (for example, tcmalloc), and those greater than 64MB will be through MMAP.
    In the actual test, chunkallocator allocates less than 4KB of memory slower than malloc, and chunkallocator allocates more than 64MB of memory slower than MMAP, but the 4KB threshold is an empirical value, which needs to be determined by more detailed test later.
  • T4: Close memtracker at compile time can be selected during POC. Memtracker records the consumption value through an atomic variable. In high query concurrency, the atomic variable spin lock has a high cost. I'll optimize memtracker. After that

@yiguolei yiguolei added this to the v1.2 milestone May 16, 2022
} while (!_reserved_bytes.compare_exchange_weak(old_reserved_bytes, new_reserved_bytes));

// Reduce set metric frequency
if (_reserved_bytes % 100 == 32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to make sure the correctness?

At the first look, (_reserved_bytes % 100 < 32) or (_reserved_bytes % 100 > 32) both will not update the metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the first look, ChunkAllocator will allocate/free many times, the memory size of each allocate/free is a multiple of 2, so _reserved_bytes% 100 == 32 will definitely happen, and the latest _reserved_bytes value will be set each time .

The real-time and accurate _reserved_bytes value is not required. Usually, the value of _reserved_bytes is equal to ChunkAllocator MemTracker. The _reserved_bytes metric is only concerned when verifying the accuracy of MemTracker.

Therefore, reduce the number of sets and reduce the performance impact.

@xinyiZzz xinyiZzz force-pushed the fix_tracker_lru_cache_push branch from 9b9b6bb to f97096d Compare May 16, 2022 17:58
@github-actions github-actions bot added the kind/docs Categorizes issue or PR as related to documentation. label May 16, 2022
@xinyiZzz xinyiZzz force-pushed the fix_tracker_lru_cache_push branch 2 times, most recently from db7b85a to 345679c Compare June 28, 2022 03:52
void* buf;

if (size >= MMAP_THRESHOLD) {
if (alignment > MMAP_MIN_ALIGNMENT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call populate to populate the memory to avoid too many page fault during usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for zero-fill, because mmap guarantees it.

@xinyiZzz xinyiZzz force-pushed the fix_tracker_lru_cache_push branch from 345679c to c20832c Compare June 28, 2022 08:53
@xinyiZzz
Copy link
Contributor Author

Based on the latest master (commit id: 7898c81)
The test results are the same as above.

Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yiguolei yiguolei merged commit deeb302 into apache:master Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/vectorization kind/docs Categorizes issue or PR as related to documentation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] DCHECK failed caused by tls_ctx()->type() == ThreadContext::TaskType::UNKNOWN

4 participants