Skip to content

Conversation

@cambyzju
Copy link
Contributor

@cambyzju cambyzju commented Jun 26, 2024

The problem:
Huge memory usage while compact mow table, following is a example, compaction peak memory 45.19 GB used while input_row_num=701398908, output_row_num=175349727, filtered_row_num=526049181

succeed to do base compaction is_vertical=1. tablet=785333.186591797.61463385baeb9d59-9167a604aac7738e, output_version=[0-42], current_max_version=42, disk=/data/cdw/doris/be/storage, segments=73, input_rowset_size=16124232924, output_rowset_size=3981194537, input_row_num=701398908, output_row_num=175349727, filtered_row_num=526049181, merged_row_num=0. elapsed time=402.428s. cumulative_compaction_policy=size_based, compact_row_per_second=1742918
MemTrackerLimiter Label=BaseCompaction:785333, Type=compaction, Limit=-1.00 B(-1 B), Used=45.19 GB(48518743241 B), Peak=45.19 GB(48519793889 B)

How to Fix:
The reason is missed_rows and location_map are very expensive. So we opt it:

  1. Only save missed_rows set while we really need;
  2. add a config enable_missing_rows_correctness_check to control whether we need to collect missed_rows
  3. only when enable_rowid_conversion_correctness_check opened, we should collect location_map;

After fix
after fix, the memory could reduce from 45.19 GB to 9.89 GB

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@cambyzju
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39671 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 079613ff5f82aa78f163edec6ff1d07b1f456a12, data reload: false

------ Round 1 ----------------------------------
q1	17986	4569	4372	4372
q2	2352	183	185	183
q3	10514	1106	1060	1060
q4	10195	826	710	710
q5	7485	2696	2593	2593
q6	221	135	132	132
q7	954	595	610	595
q8	9231	2105	2083	2083
q9	8930	6493	6516	6493
q10	8902	3760	3688	3688
q11	489	231	233	231
q12	400	233	229	229
q13	17768	2994	2975	2975
q14	277	222	217	217
q15	516	483	472	472
q16	518	379	375	375
q17	969	679	674	674
q18	8111	7496	7332	7332
q19	6701	1430	1567	1430
q20	685	321	304	304
q21	4893	3185	4040	3185
q22	413	338	350	338
Total cold run time: 118510 ms
Total hot run time: 39671 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4369	4260	4229	4229
q2	371	268	269	268
q3	3014	2753	2737	2737
q4	1932	1649	1621	1621
q5	5307	5267	5300	5267
q6	223	126	125	125
q7	2169	1753	1753	1753
q8	3192	3346	3319	3319
q9	8345	8383	8405	8383
q10	3896	3687	3614	3614
q11	566	475	482	475
q12	768	610	593	593
q13	17382	3014	2967	2967
q14	287	275	265	265
q15	513	483	489	483
q16	479	414	415	414
q17	1777	1484	1447	1447
q18	7614	7549	7266	7266
q19	1710	1678	1577	1577
q20	1999	1782	1766	1766
q21	4940	4682	4606	4606
q22	612	565	582	565
Total cold run time: 71465 ms
Total hot run time: 53740 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 171273 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 079613ff5f82aa78f163edec6ff1d07b1f456a12, data reload: false

query1	915	379	369	369
query2	6457	2490	2367	2367
query3	6642	206	231	206
query4	19797	17222	17526	17222
query5	4169	470	494	470
query6	245	178	177	177
query7	4595	291	285	285
query8	304	297	308	297
query9	8520	2451	2419	2419
query10	613	286	268	268
query11	10734	10200	10130	10130
query12	129	77	81	77
query13	1636	384	369	369
query14	9416	7853	7823	7823
query15	244	190	189	189
query16	7899	272	268	268
query17	1893	555	528	528
query18	2006	283	279	279
query19	198	179	164	164
query20	92	84	83	83
query21	210	124	127	124
query22	4672	4330	4195	4195
query23	33829	32938	33077	32938
query24	11956	2862	2836	2836
query25	653	357	359	357
query26	1814	146	149	146
query27	3069	319	320	319
query28	7654	2098	2086	2086
query29	1114	618	607	607
query30	288	148	149	148
query31	962	734	743	734
query32	93	50	51	50
query33	767	274	282	274
query34	1002	461	482	461
query35	723	634	619	619
query36	1110	947	936	936
query37	295	69	71	69
query38	2866	2791	2708	2708
query39	861	795	815	795
query40	284	126	125	125
query41	52	54	78	54
query42	115	95	99	95
query43	609	565	554	554
query44	1266	739	766	739
query45	196	165	164	164
query46	1070	724	712	712
query47	1847	1754	1781	1754
query48	363	291	295	291
query49	1197	409	410	409
query50	778	382	383	382
query51	6931	6754	6698	6698
query52	106	93	94	93
query53	366	293	289	289
query54	1000	456	451	451
query55	75	75	74	74
query56	284	259	269	259
query57	1147	1031	1058	1031
query58	253	236	250	236
query59	3607	3224	3636	3224
query60	320	272	273	272
query61	92	89	93	89
query62	649	455	437	437
query63	335	292	290	290
query64	9895	2202	1750	1750
query65	3185	3127	3102	3102
query66	1409	337	324	324
query67	15451	15035	14985	14985
query68	4587	542	551	542
query69	549	486	378	378
query70	1186	1076	1059	1059
query71	396	281	281	281
query72	7586	2759	2640	2640
query73	735	334	327	327
query74	5881	5512	5473	5473
query75	3428	2667	2656	2656
query76	2849	910	909	909
query77	664	309	305	305
query78	10364	9854	9754	9754
query79	2124	515	515	515
query80	1112	467	478	467
query81	594	223	218	218
query82	790	102	100	100
query83	263	165	166	165
query84	246	84	97	84
query85	1933	298	324	298
query86	493	313	314	313
query87	3229	3089	3091	3089
query88	4011	2385	2367	2367
query89	468	399	381	381
query90	1799	194	185	185
query91	129	97	99	97
query92	60	49	48	48
query93	2517	523	521	521
query94	1285	188	189	188
query95	405	314	311	311
query96	600	274	268	268
query97	3262	3079	3028	3028
query98	212	199	191	191
query99	1264	843	840	840
Total cold run time: 278345 ms
Total hot run time: 171273 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.82 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 079613ff5f82aa78f163edec6ff1d07b1f456a12, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.04	0.04
query3	0.21	0.05	0.06
query4	1.68	0.07	0.06
query5	0.51	0.49	0.49
query6	1.12	0.73	0.73
query7	0.02	0.01	0.01
query8	0.05	0.04	0.04
query9	0.54	0.48	0.48
query10	0.54	0.53	0.53
query11	0.15	0.11	0.12
query12	0.14	0.11	0.13
query13	0.59	0.58	0.57
query14	0.77	0.78	0.77
query15	0.83	0.81	0.82
query16	0.35	0.35	0.36
query17	1.04	1.05	1.02
query18	0.22	0.22	0.26
query19	1.92	1.72	1.71
query20	0.01	0.01	0.01
query21	15.45	0.68	0.67
query22	4.32	6.89	2.21
query23	18.30	1.32	1.24
query24	1.91	0.25	0.23
query25	0.16	0.09	0.09
query26	0.26	0.18	0.17
query27	0.08	0.09	0.08
query28	13.26	1.02	0.99
query29	12.57	3.37	3.32
query30	0.25	0.06	0.05
query31	2.85	0.38	0.38
query32	3.27	0.47	0.46
query33	2.91	2.90	2.93
query34	17.05	4.41	4.37
query35	4.47	4.46	4.45
query36	0.65	0.47	0.46
query37	0.18	0.16	0.16
query38	0.15	0.15	0.15
query39	0.05	0.03	0.04
query40	0.18	0.14	0.14
query41	0.09	0.04	0.05
query42	0.06	0.05	0.04
query43	0.04	0.04	0.04
Total cold run time: 109.32 s
Total hot run time: 30.82 s

// rowid conversion correctness check when compaction for mow table
DEFINE_mBool(enable_rowid_conversion_correctness_check, "false");
// missing rows correctness check when compaction for mow table
DEFINE_mBool(enable_missing_rows_correctness_check, "true");
Copy link
Contributor

@zhannngchen zhannngchen Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the missing row check is costive for large dataset, we can disable it by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@zhannngchen zhannngchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 27, 2024
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39953 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a560153f4ccc888149dea10f01bddbc791b7c889, data reload: false

------ Round 1 ----------------------------------
q1	17604	4423	4275	4275
q2	2034	192	199	192
q3	10479	1169	1092	1092
q4	10184	767	810	767
q5	7468	2682	2617	2617
q6	212	133	132	132
q7	957	585	614	585
q8	9541	2129	2131	2129
q9	9009	6538	6503	6503
q10	8924	3738	3796	3738
q11	466	235	237	235
q12	488	232	230	230
q13	17843	2966	3010	2966
q14	261	232	229	229
q15	522	473	490	473
q16	521	383	386	383
q17	984	652	737	652
q18	8103	7563	7370	7370
q19	5735	1441	1529	1441
q20	640	341	322	322
q21	4905	3279	3986	3279
q22	394	343	346	343
Total cold run time: 117274 ms
Total hot run time: 39953 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4390	4290	4227	4227
q2	377	281	264	264
q3	3007	2798	2722	2722
q4	1879	1614	1636	1614
q5	5262	5267	5264	5264
q6	218	128	130	128
q7	2135	1739	1757	1739
q8	3181	3380	3320	3320
q9	8349	8342	8308	8308
q10	3866	3697	3670	3670
q11	581	489	509	489
q12	779	646	614	614
q13	17436	2975	3008	2975
q14	299	284	259	259
q15	529	488	491	488
q16	473	410	420	410
q17	1791	1511	1478	1478
q18	7675	7614	7483	7483
q19	1696	1461	1548	1461
q20	2025	1766	1799	1766
q21	4856	4764	4724	4724
q22	628	550	541	541
Total cold run time: 71432 ms
Total hot run time: 53944 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 170136 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a560153f4ccc888149dea10f01bddbc791b7c889, data reload: false

query1	920	390	371	371
query2	6448	2451	2483	2451
query3	6647	202	213	202
query4	20653	17554	17460	17460
query5	4207	455	475	455
query6	262	169	165	165
query7	4597	301	286	286
query8	306	299	275	275
query9	8469	2415	2409	2409
query10	604	281	284	281
query11	10458	9884	10158	9884
query12	126	85	101	85
query13	1634	351	352	351
query14	10178	7619	6752	6752
query15	247	190	188	188
query16	7873	272	260	260
query17	1914	543	519	519
query18	1992	271	266	266
query19	201	146	148	146
query20	87	81	80	80
query21	214	132	121	121
query22	4321	4154	4042	4042
query23	33714	33289	33142	33142
query24	11815	2832	2854	2832
query25	651	356	357	356
query26	1738	156	149	149
query27	2960	319	313	313
query28	7623	2072	2061	2061
query29	1043	626	602	602
query30	286	154	149	149
query31	932	720	748	720
query32	93	52	54	52
query33	788	299	331	299
query34	903	459	478	459
query35	729	620	603	603
query36	1079	937	912	912
query37	162	70	67	67
query38	2884	2764	2782	2764
query39	882	787	815	787
query40	278	125	120	120
query41	58	52	52	52
query42	123	101	100	100
query43	596	541	567	541
query44	1214	720	719	719
query45	195	170	163	163
query46	1071	722	707	707
query47	1830	1777	1762	1762
query48	369	291	312	291
query49	1135	404	412	404
query50	781	386	384	384
query51	6956	6734	6680	6680
query52	104	88	93	88
query53	358	286	291	286
query54	950	467	455	455
query55	74	77	72	72
query56	300	286	269	269
query57	1127	1098	1052	1052
query58	264	247	251	247
query59	3517	3235	3086	3086
query60	308	286	268	268
query61	92	90	89	89
query62	663	462	436	436
query63	320	282	284	282
query64	9806	2260	1727	1727
query65	3158	3280	3136	3136
query66	1380	337	323	323
query67	15626	15094	15140	15094
query68	4633	535	548	535
query69	586	522	374	374
query70	1192	1066	1177	1066
query71	420	280	269	269
query72	7364	2760	2615	2615
query73	746	321	324	321
query74	5888	5544	5487	5487
query75	3542	2670	2662	2662
query76	3156	965	941	941
query77	628	297	321	297
query78	10227	9797	9730	9730
query79	2219	515	506	506
query80	1787	471	468	468
query81	572	228	220	220
query82	841	106	105	105
query83	288	170	169	169
query84	258	87	84	84
query85	1232	276	270	270
query86	440	332	335	332
query87	3275	3062	3068	3062
query88	3720	2464	2438	2438
query89	474	369	379	369
query90	1767	192	190	190
query91	128	102	97	97
query92	67	49	48	48
query93	2198	515	495	495
query94	1234	186	182	182
query95	397	307	310	307
query96	585	276	273	273
query97	3254	3052	3044	3044
query98	229	206	192	192
query99	1252	852	839	839
Total cold run time: 277892 ms
Total hot run time: 170136 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.8 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit a560153f4ccc888149dea10f01bddbc791b7c889, data reload: false

query1	0.04	0.04	0.03
query2	0.08	0.04	0.04
query3	0.23	0.04	0.04
query4	1.68	0.08	0.08
query5	0.51	0.50	0.48
query6	1.13	0.72	0.72
query7	0.02	0.02	0.01
query8	0.05	0.04	0.05
query9	0.54	0.50	0.48
query10	0.56	0.55	0.55
query11	0.15	0.11	0.12
query12	0.14	0.12	0.12
query13	0.59	0.58	0.60
query14	0.77	0.79	0.77
query15	0.84	0.80	0.80
query16	0.37	0.37	0.36
query17	0.97	0.99	0.94
query18	0.22	0.27	0.20
query19	1.76	1.68	1.69
query20	0.01	0.01	0.02
query21	15.48	0.77	0.66
query22	4.54	6.50	2.32
query23	18.27	1.34	1.22
query24	2.16	0.22	0.23
query25	0.16	0.09	0.09
query26	0.26	0.19	0.18
query27	0.08	0.08	0.08
query28	13.21	1.03	0.99
query29	12.64	3.33	3.31
query30	0.25	0.07	0.05
query31	2.86	0.38	0.39
query32	3.27	0.48	0.47
query33	2.86	2.97	2.89
query34	17.06	4.45	4.43
query35	4.48	4.44	4.47
query36	0.65	0.45	0.47
query37	0.18	0.15	0.15
query38	0.16	0.15	0.13
query39	0.04	0.03	0.04
query40	0.18	0.14	0.14
query41	0.08	0.05	0.04
query42	0.05	0.06	0.05
query43	0.05	0.04	0.04
Total cold run time: 109.63 s
Total hot run time: 30.8 s

Copy link
Member

@xy720 xy720 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cambyzju
Copy link
Contributor Author

run p0

1 similar comment
@zhannngchen
Copy link
Contributor

run p0

@cambyzju cambyzju merged commit 1b1029f into apache:master Jun 28, 2024
zhannngchen pushed a commit that referenced this pull request Jul 1, 2024
zhannngchen pushed a commit that referenced this pull request Jul 1, 2024
mongo360 pushed a commit to mongo360/doris that referenced this pull request Aug 16, 2024
zhannngchen added a commit that referenced this pull request Nov 11, 2024
#43502)

Related PR: #36865

Problem Summary:

#36865 reduced the memory cost for compactions of MoW table
But when we merge the codes for cloud, such optimization is not applied
for cloud compaction
We found several cases that compaction of MoW table consume lots of
memory on cloud, this PR try to fix this issue

Co-authored-by: Chen Zhang <zhangchen@selectdb.com>
zhannngchen added a commit to zhannngchen/incubator-doris that referenced this pull request Nov 11, 2024
apache#43502)

Related PR: apache#36865

Problem Summary:

But when we merge the codes for cloud, such optimization is not applied
for cloud compaction
We found several cases that compaction of MoW table consume lots of
memory on cloud, this PR try to fix this issue

Co-authored-by: Chen Zhang <zhangchen@selectdb.com>
zzzxl1993 pushed a commit to zzzxl1993/doris that referenced this pull request Nov 12, 2024
apache#43502)

Related PR: apache#36865

Problem Summary:

apache#36865 reduced the memory cost for compactions of MoW table
But when we merge the codes for cloud, such optimization is not applied
for cloud compaction
We found several cases that compaction of MoW table consume lots of
memory on cloud, this PR try to fix this issue

Co-authored-by: Chen Zhang <zhangchen@selectdb.com>
924060929 pushed a commit to 924060929/incubator-doris that referenced this pull request Nov 12, 2024
apache#43502)

Related PR: apache#36865

Problem Summary:

apache#36865 reduced the memory cost for compactions of MoW table
But when we merge the codes for cloud, such optimization is not applied
for cloud compaction
We found several cases that compaction of MoW table consume lots of
memory on cloud, this PR try to fix this issue

Co-authored-by: Chen Zhang <zhangchen@selectdb.com>
py023 pushed a commit to py023/doris that referenced this pull request Nov 13, 2024
apache#43502)

Related PR: apache#36865

Problem Summary:

apache#36865 reduced the memory cost for compactions of MoW table
But when we merge the codes for cloud, such optimization is not applied
for cloud compaction
We found several cases that compaction of MoW table consume lots of
memory on cloud, this PR try to fix this issue

Co-authored-by: Chen Zhang <zhangchen@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants