Skip to content

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Jan 16, 2025

What problem does this PR solve?

Problem Summary:

image
The Memory of fe observer node increase over time and cannot significantly reduce after tigger full GC.

image
In follower node, when bulk load job replay, A key-value(jobId,-job) will be added to CallBackFactory to record the job but will not removeCallback, which cause memory leak in FE observer node.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jan 16, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Contributor Author

sollhui commented Jan 16, 2025

run buildall

@sollhui
Copy link
Contributor Author

sollhui commented Jan 16, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32464 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit cb7122fa7a5dacfdfc901a071a75ea705ff960cf, data reload: false

------ Round 1 ----------------------------------
q1	17584	5495	5402	5402
q2	2053	301	168	168
q3	10538	1225	756	756
q4	10202	969	534	534
q5	7513	2402	2155	2155
q6	192	165	136	136
q7	923	756	606	606
q8	9248	1339	1160	1160
q9	5239	4907	4884	4884
q10	6872	2347	1900	1900
q11	470	270	246	246
q12	344	373	221	221
q13	17806	3714	3195	3195
q14	236	225	218	218
q15	531	473	463	463
q16	622	624	576	576
q17	578	866	325	325
q18	6787	6429	6589	6429
q19	3521	972	542	542
q20	311	329	196	196
q21	2918	2234	2035	2035
q22	373	350	317	317
Total cold run time: 104861 ms
Total hot run time: 32464 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5622	5513	5407	5407
q2	247	341	245	245
q3	2247	2687	2339	2339
q4	1475	1861	1393	1393
q5	4349	4765	4723	4723
q6	170	158	127	127
q7	2077	1976	1806	1806
q8	2644	2789	2737	2737
q9	7333	7273	7334	7273
q10	3014	3326	2710	2710
q11	593	522	495	495
q12	683	839	668	668
q13	3536	3907	3388	3388
q14	308	311	284	284
q15	526	477	471	471
q16	671	695	655	655
q17	1253	1730	1269	1269
q18	7811	7499	7414	7414
q19	830	1190	1052	1052
q20	2015	2040	1928	1928
q21	5782	5149	5025	5025
q22	638	597	607	597
Total cold run time: 53824 ms
Total hot run time: 52006 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 195019 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit cb7122fa7a5dacfdfc901a071a75ea705ff960cf, data reload: false

query1	1350	994	946	946
query2	6292	2003	2018	2003
query3	11068	4678	4508	4508
query4	61210	31403	23219	23219
query5	5079	597	464	464
query6	417	213	192	192
query7	5491	510	298	298
query8	338	266	256	256
query9	8045	2641	2656	2641
query10	432	326	270	270
query11	15927	15474	15684	15474
query12	173	111	108	108
query13	1417	560	434	434
query14	11015	7341	6644	6644
query15	224	212	207	207
query16	7343	649	522	522
query17	1169	725	563	563
query18	1941	407	304	304
query19	189	182	180	180
query20	116	123	110	110
query21	212	126	109	109
query22	4625	4809	4510	4510
query23	34206	33136	33441	33136
query24	5645	2327	2310	2310
query25	498	466	399	399
query26	638	284	161	161
query27	1783	477	331	331
query28	4036	2494	2456	2456
query29	544	575	439	439
query30	220	192	173	173
query31	927	914	825	825
query32	73	59	58	58
query33	397	356	304	304
query34	758	852	525	525
query35	802	837	770	770
query36	1009	1026	971	971
query37	126	99	82	82
query38	4322	4439	4272	4272
query39	1514	1437	1460	1437
query40	200	116	103	103
query41	53	54	49	49
query42	120	104	101	101
query43	517	528	493	493
query44	1374	831	832	831
query45	185	181	171	171
query46	876	1056	663	663
query47	1914	1902	1855	1855
query48	415	428	331	331
query49	710	491	396	396
query50	689	683	414	414
query51	7188	7013	7012	7012
query52	103	106	93	93
query53	227	260	191	191
query54	488	514	436	436
query55	85	83	84	83
query56	271	281	285	281
query57	1230	1222	1193	1193
query58	262	241	248	241
query59	2962	2968	2841	2841
query60	273	278	260	260
query61	128	114	112	112
query62	755	723	637	637
query63	218	189	187	187
query64	1321	1068	653	653
query65	3226	3203	3156	3156
query66	693	398	299	299
query67	15958	15614	15630	15614
query68	5550	851	531	531
query69	506	303	265	265
query70	1232	1188	1158	1158
query71	404	299	299	299
query72	6088	3881	3894	3881
query73	787	768	362	362
query74	10102	9166	8944	8944
query75	3259	3148	2652	2652
query76	3746	1200	774	774
query77	479	372	300	300
query78	10082	9915	9404	9404
query79	3651	820	583	583
query80	811	544	468	468
query81	495	289	234	234
query82	589	159	129	129
query83	173	172	152	152
query84	289	100	77	77
query85	784	351	355	351
query86	378	311	306	306
query87	4520	4560	4428	4428
query88	4670	2181	2181	2181
query89	406	330	286	286
query90	1679	190	193	190
query91	131	138	116	116
query92	66	60	55	55
query93	2773	884	534	534
query94	758	394	298	298
query95	335	266	259	259
query96	508	645	285	285
query97	2867	2845	2740	2740
query98	221	199	198	198
query99	1294	1405	1272	1272
Total cold run time: 313645 ms
Total hot run time: 195019 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.74 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit cb7122fa7a5dacfdfc901a071a75ea705ff960cf, data reload: false

query1	0.04	0.05	0.03
query2	0.07	0.03	0.03
query3	0.24	0.06	0.07
query4	1.63	0.10	0.10
query5	0.43	0.43	0.41
query6	1.15	0.66	0.66
query7	0.03	0.01	0.02
query8	0.04	0.04	0.03
query9	0.59	0.50	0.51
query10	0.58	0.56	0.55
query11	0.14	0.10	0.11
query12	0.14	0.11	0.11
query13	0.61	0.60	0.61
query14	2.84	2.83	2.74
query15	0.90	0.85	0.83
query16	0.39	0.38	0.39
query17	1.06	1.03	1.01
query18	0.23	0.22	0.21
query19	1.98	1.84	1.96
query20	0.02	0.01	0.02
query21	15.35	0.90	0.59
query22	0.75	0.93	0.94
query23	14.87	1.41	0.63
query24	3.10	0.47	0.89
query25	0.19	0.15	0.15
query26	0.40	0.14	0.15
query27	0.08	0.07	0.05
query28	13.40	1.13	0.45
query29	12.57	3.96	3.29
query30	0.25	0.09	0.06
query31	2.82	0.61	0.39
query32	3.23	0.55	0.47
query33	3.06	3.04	3.02
query34	16.63	5.25	4.55
query35	4.59	4.55	4.61
query36	0.65	0.48	0.48
query37	0.09	0.06	0.06
query38	0.04	0.03	0.03
query39	0.04	0.02	0.03
query40	0.18	0.13	0.13
query41	0.08	0.03	0.02
query42	0.04	0.03	0.02
query43	0.04	0.03	0.03
Total cold run time: 105.56 s
Total hot run time: 30.74 s

@dataroaring dataroaring added usercase Important user case type label dev/3.0.x dev/2.1.x labels Jan 16, 2025
dataroaring
dataroaring previously approved these changes Jan 16, 2025
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 16, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

liaoxin01
liaoxin01 previously approved these changes Jan 16, 2025
@sollhui sollhui dismissed stale reviews from liaoxin01 and dataroaring via c6412e1 January 17, 2025 07:03
@sollhui
Copy link
Contributor Author

sollhui commented Jan 17, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32500 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c6412e1c59576bf7c12d49af73bee44c8025c5e6, data reload: false

------ Round 1 ----------------------------------
q1	17571	5494	5420	5420
q2	2058	299	173	173
q3	10419	1236	747	747
q4	10236	954	535	535
q5	7925	2372	2145	2145
q6	194	165	133	133
q7	893	774	604	604
q8	9241	1339	1159	1159
q9	5225	4916	4842	4842
q10	6878	2318	1902	1902
q11	494	281	250	250
q12	345	357	227	227
q13	17790	3677	3096	3096
q14	235	245	216	216
q15	524	478	475	475
q16	638	626	592	592
q17	566	865	322	322
q18	6897	6423	6566	6423
q19	2207	942	568	568
q20	328	327	212	212
q21	3120	2299	2140	2140
q22	387	359	319	319
Total cold run time: 104171 ms
Total hot run time: 32500 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5624	5433	5436	5433
q2	245	335	241	241
q3	2335	2687	2476	2476
q4	1461	1857	1416	1416
q5	4456	4872	4855	4855
q6	176	163	130	130
q7	2135	1987	1861	1861
q8	2613	2815	2739	2739
q9	7299	7179	7281	7179
q10	3006	3295	2808	2808
q11	579	511	482	482
q12	631	781	577	577
q13	3588	3924	3368	3368
q14	277	302	299	299
q15	543	481	481	481
q16	637	688	656	656
q17	1233	1752	1286	1286
q18	7736	7532	7250	7250
q19	860	1128	1142	1128
q20	1985	2049	1887	1887
q21	5794	5085	5079	5079
q22	599	588	569	569
Total cold run time: 53812 ms
Total hot run time: 52200 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 194307 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c6412e1c59576bf7c12d49af73bee44c8025c5e6, data reload: false

query1	1301	942	949	942
query2	6508	2095	2009	2009
query3	11037	4371	4538	4371
query4	60101	28438	23087	23087
query5	5706	626	463	463
query6	434	216	191	191
query7	5596	515	311	311
query8	319	240	230	230
query9	8537	2588	2573	2573
query10	485	309	259	259
query11	17477	15186	15430	15186
query12	174	116	120	116
query13	1500	552	417	417
query14	11117	7181	6774	6774
query15	232	216	192	192
query16	7395	661	508	508
query17	1207	733	601	601
query18	1908	417	320	320
query19	202	223	162	162
query20	122	125	111	111
query21	215	129	105	105
query22	4399	4862	4449	4449
query23	34037	33326	33457	33326
query24	5462	2371	2321	2321
query25	482	473	408	408
query26	655	287	157	157
query27	1803	480	347	347
query28	4140	2457	2426	2426
query29	528	596	421	421
query30	224	198	157	157
query31	944	887	819	819
query32	72	59	59	59
query33	451	372	318	318
query34	743	858	529	529
query35	810	846	768	768
query36	1033	1054	966	966
query37	153	96	80	80
query38	4291	4343	4330	4330
query39	1513	1450	1449	1449
query40	201	118	103	103
query41	52	50	50	50
query42	128	109	106	106
query43	516	541	511	511
query44	1354	812	830	812
query45	184	173	170	170
query46	901	1079	678	678
query47	1918	1919	1867	1867
query48	382	423	316	316
query49	730	494	402	402
query50	658	686	423	423
query51	7109	6920	7079	6920
query52	101	104	92	92
query53	235	266	192	192
query54	506	511	418	418
query55	85	89	80	80
query56	272	278	270	270
query57	1266	1217	1158	1158
query58	246	242	231	231
query59	3166	3342	2914	2914
query60	279	282	254	254
query61	123	116	116	116
query62	734	703	643	643
query63	228	203	193	193
query64	1271	1058	650	650
query65	3294	3161	3169	3161
query66	730	402	305	305
query67	16073	15541	15430	15430
query68	5044	845	521	521
query69	519	301	272	272
query70	1219	1159	1128	1128
query71	415	294	255	255
query72	6393	3821	3971	3821
query73	818	770	355	355
query74	10023	9049	9101	9049
query75	3207	3190	2762	2762
query76	3796	1208	794	794
query77	486	368	277	277
query78	10121	10036	9333	9333
query79	2456	861	610	610
query80	1133	545	439	439
query81	537	274	236	236
query82	356	157	134	134
query83	223	173	159	159
query84	286	89	86	86
query85	752	356	294	294
query86	427	320	304	304
query87	4499	4577	4348	4348
query88	3773	2180	2145	2145
query89	412	340	296	296
query90	1624	191	192	191
query91	131	134	107	107
query92	71	59	59	59
query93	2710	874	539	539
query94	766	411	300	300
query95	331	302	255	255
query96	476	616	284	284
query97	2832	2875	2778	2778
query98	217	213	194	194
query99	1276	1364	1259	1259
Total cold run time: 313384 ms
Total hot run time: 194307 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.35 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c6412e1c59576bf7c12d49af73bee44c8025c5e6, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.03	0.03
query3	0.24	0.07	0.07
query4	1.62	0.10	0.11
query5	0.43	0.44	0.40
query6	1.16	0.65	0.65
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.59	0.51	0.49
query10	0.54	0.56	0.55
query11	0.14	0.10	0.09
query12	0.13	0.11	0.11
query13	0.61	0.60	0.60
query14	2.73	2.74	2.72
query15	0.89	0.84	0.83
query16	0.36	0.38	0.39
query17	0.96	1.03	1.09
query18	0.23	0.21	0.20
query19	2.00	1.82	2.03
query20	0.01	0.01	0.01
query21	15.38	0.93	0.60
query22	0.75	0.81	0.71
query23	15.22	1.51	0.64
query24	2.94	1.53	2.57
query25	0.21	0.15	0.10
query26	0.19	0.14	0.14
query27	0.06	0.04	0.06
query28	14.58	0.99	0.43
query29	12.61	3.96	3.28
query30	0.25	0.09	0.06
query31	2.83	0.58	0.39
query32	3.24	0.54	0.46
query33	2.98	2.98	3.02
query34	16.52	5.19	4.55
query35	4.52	4.55	4.51
query36	0.81	0.46	0.48
query37	0.09	0.06	0.06
query38	0.04	0.03	0.03
query39	0.04	0.02	0.02
query40	0.17	0.13	0.13
query41	0.08	0.02	0.02
query42	0.04	0.03	0.02
query43	0.04	0.03	0.03
Total cold run time: 106.4 s
Total hot run time: 31.35 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 89158f0 into apache:master Jan 20, 2025
26 of 28 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 20, 2025
![image](https://github.com/user-attachments/assets/ad83cf90-e148-414c-9278-2cad8f2cd9ed)
The Memory of fe observer node increase over time and cannot
significantly reduce after tigger full GC.


![image](https://github.com/user-attachments/assets/ac607b0c-73d1-44bc-a0ef-1a0e91c910fe)
In follower node, when bulk load job replay, A key-value(jobId,-job)
will be added to` CallBackFactory` to record the job but will not
`removeCallback`, which cause memory leak in FE observer node.
dataroaring pushed a commit that referenced this pull request Jan 23, 2025
#47074 (#47244)

Cherry-picked from #47074

Co-authored-by: hui lai <laihui@selectdb.com>
lzyy2024 pushed a commit to lzyy2024/doris that referenced this pull request Feb 21, 2025
…7074)

![image](https://github.com/user-attachments/assets/ad83cf90-e148-414c-9278-2cad8f2cd9ed)
The Memory of fe observer node increase over time and cannot
significantly reduce after tigger full GC.


![image](https://github.com/user-attachments/assets/ac607b0c-73d1-44bc-a0ef-1a0e91c910fe)
In follower node, when bulk load job replay, A key-value(jobId,-job)
will be added to` CallBackFactory` to record the job but will not
`removeCallback`, which cause memory leak in FE observer node.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.4-merged reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants