Skip to content

Conversation

@yujun777
Copy link
Contributor

When partition reblancer choose candidate tablets, it will call tabletListOfA.removeAll(tabletListOfB), but list.removeAll(list)'s runtime is O(n^2). Then if each BE contains 10w+ tablets, it's rather slow. And we found a online case the tablet scheduler thread is busy at it.

So need improve this search.

Proposed changes

Issue Number: close #xxx

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@yujun777
Copy link
Contributor Author

run buildall

Copy link
Contributor

@deardeng deardeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 40085 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 444d4c370ed015bdfa8b5ea27a35f29a3113b542, data reload: false

------ Round 1 ----------------------------------
q1	17946	4516	4668	4516
q2	2661	189	191	189
q3	12321	1151	1023	1023
q4	10524	745	805	745
q5	7826	2781	2748	2748
q6	230	138	139	138
q7	994	602	601	601
q8	9228	2091	2071	2071
q9	9005	6512	6509	6509
q10	8981	3708	3724	3708
q11	471	243	238	238
q12	431	238	225	225
q13	17766	2991	3001	2991
q14	272	226	213	213
q15	511	481	485	481
q16	508	393	377	377
q17	978	620	748	620
q18	8097	7524	7356	7356
q19	4813	1453	1468	1453
q20	670	321	332	321
q21	5153	3223	3916	3223
q22	400	339	347	339
Total cold run time: 119786 ms
Total hot run time: 40085 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4349	4208	4196	4196
q2	375	257	270	257
q3	2976	2754	2753	2753
q4	1876	1634	1622	1622
q5	5267	5255	5302	5255
q6	214	126	130	126
q7	2152	1709	1691	1691
q8	3209	3367	3340	3340
q9	8348	8375	8302	8302
q10	3862	3692	3658	3658
q11	576	481	480	480
q12	766	608	588	588
q13	16535	2967	2978	2967
q14	288	254	254	254
q15	532	475	466	466
q16	454	406	430	406
q17	1771	1491	1481	1481
q18	7620	7696	7460	7460
q19	2240	1628	1614	1614
q20	1964	1815	1740	1740
q21	4851	4630	4743	4630
q22	611	536	550	536
Total cold run time: 70836 ms
Total hot run time: 53822 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173237 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 444d4c370ed015bdfa8b5ea27a35f29a3113b542, data reload: false

query1	926	395	386	386
query2	6465	2371	2411	2371
query3	6651	206	210	206
query4	18879	17455	17302	17302
query5	4173	488	508	488
query6	246	162	180	162
query7	4596	293	300	293
query8	337	292	292	292
query9	8396	2410	2374	2374
query10	592	295	299	295
query11	10442	10143	10136	10136
query12	135	88	88	88
query13	1628	360	355	355
query14	9686	7694	7691	7691
query15	261	197	199	197
query16	7999	266	262	262
query17	1907	534	531	531
query18	2025	277	271	271
query19	191	177	160	160
query20	94	89	81	81
query21	213	130	124	124
query22	4259	3999	4151	3999
query23	33659	33112	33099	33099
query24	11153	2821	2838	2821
query25	636	359	359	359
query26	1406	155	156	155
query27	3006	333	339	333
query28	7196	2042	2051	2042
query29	908	638	621	621
query30	288	147	154	147
query31	943	726	756	726
query32	91	58	56	56
query33	777	287	296	287
query34	940	479	489	479
query35	742	607	637	607
query36	1089	959	956	956
query37	155	74	74	74
query38	2881	2696	2741	2696
query39	858	795	803	795
query40	220	133	129	129
query41	65	56	53	53
query42	129	104	107	104
query43	594	572	553	553
query44	1192	731	741	731
query45	206	167	169	167
query46	1095	711	739	711
query47	1842	1729	1779	1729
query48	372	306	340	306
query49	1066	441	411	411
query50	770	386	392	386
query51	6778	6755	6609	6609
query52	100	95	93	93
query53	365	291	291	291
query54	860	440	439	439
query55	77	73	72	72
query56	277	255	265	255
query57	1148	1067	1048	1048
query58	262	249	253	249
query59	3524	3130	3264	3130
query60	291	287	278	278
query61	95	100	127	100
query62	642	441	434	434
query63	315	296	303	296
query64	8930	2249	1714	1714
query65	3306	3152	3120	3120
query66	1391	335	347	335
query67	15268	14994	14897	14897
query68	4614	549	542	542
query69	466	308	329	308
query70	1132	1155	1091	1091
query71	398	291	292	291
query72	7267	5576	5120	5120
query73	764	326	325	325
query74	5874	5508	5403	5403
query75	3384	2661	2669	2661
query76	2421	992	909	909
query77	452	315	301	301
query78	10299	9737	9635	9635
query79	2616	524	520	520
query80	1702	470	474	470
query81	573	222	227	222
query82	833	113	113	113
query83	348	169	169	169
query84	274	88	90	88
query85	1832	284	265	265
query86	489	303	335	303
query87	3294	3116	3138	3116
query88	3957	2347	2342	2342
query89	477	397	400	397
query90	1801	194	196	194
query91	133	101	98	98
query92	65	51	54	51
query93	2515	524	509	509
query94	1240	190	189	189
query95	414	314	328	314
query96	608	266	265	265
query97	3254	3073	3027	3027
query98	230	194	198	194
query99	1298	855	848	848
Total cold run time: 273033 ms
Total hot run time: 173237 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.25 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 444d4c370ed015bdfa8b5ea27a35f29a3113b542, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.04
query3	0.22	0.05	0.06
query4	1.68	0.07	0.06
query5	0.48	0.47	0.49
query6	1.12	0.73	0.74
query7	0.02	0.02	0.01
query8	0.05	0.04	0.04
query9	0.54	0.48	0.49
query10	0.55	0.58	0.54
query11	0.16	0.11	0.12
query12	0.15	0.12	0.12
query13	0.58	0.59	0.60
query14	0.76	0.77	0.78
query15	0.85	0.81	0.82
query16	0.35	0.35	0.37
query17	1.03	0.97	0.97
query18	0.23	0.23	0.27
query19	1.80	1.74	1.68
query20	0.01	0.01	0.00
query21	15.41	0.68	0.66
query22	4.03	7.85	1.84
query23	18.28	1.39	1.25
query24	2.16	0.23	0.21
query25	0.15	0.08	0.07
query26	0.25	0.18	0.17
query27	0.09	0.08	0.09
query28	13.22	1.01	1.02
query29	12.61	3.26	3.27
query30	0.25	0.06	0.06
query31	2.88	0.40	0.39
query32	3.25	0.47	0.47
query33	2.90	2.88	2.85
query34	17.28	4.34	4.43
query35	4.42	4.45	4.54
query36	0.64	0.47	0.47
query37	0.18	0.15	0.15
query38	0.16	0.15	0.14
query39	0.04	0.03	0.04
query40	0.18	0.14	0.14
query41	0.09	0.05	0.04
query42	0.06	0.05	0.05
query43	0.04	0.04	0.04
Total cold run time: 109.27 s
Total hot run time: 30.25 s

dataroaring
dataroaring previously approved these changes Jun 19, 2024
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 19, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@yujun777
Copy link
Contributor Author

run buildall

@yujun777 yujun777 marked this pull request as draft June 19, 2024 06:55
@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Jun 19, 2024
@yujun777 yujun777 force-pushed the improve-partition-rebalance branch from 58591e6 to 3cd9670 Compare June 27, 2024 03:41
@yujun777
Copy link
Contributor Author

run buildall

@yujun777 yujun777 marked this pull request as ready for review June 27, 2024 03:47
@yujun777
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39837 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 70ef4c0ef5f8c39d95ef242c382c1bfd22506b90, data reload: false

------ Round 1 ----------------------------------
q1	17614	4342	4244	4244
q2	2014	187	190	187
q3	10480	1221	1166	1166
q4	10184	808	969	808
q5	7493	2647	2724	2647
q6	218	138	138	138
q7	954	609	611	609
q8	9223	2062	2043	2043
q9	8818	6444	6437	6437
q10	9013	3684	3726	3684
q11	463	234	235	234
q12	447	245	230	230
q13	17765	2983	3017	2983
q14	263	213	225	213
q15	521	485	484	484
q16	512	378	373	373
q17	950	703	745	703
q18	8005	7460	7367	7367
q19	7140	1427	1482	1427
q20	639	306	326	306
q21	4875	3213	3951	3213
q22	391	341	362	341
Total cold run time: 117982 ms
Total hot run time: 39837 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4401	4257	4214	4214
q2	368	261	275	261
q3	2967	2973	3007	2973
q4	2012	1690	1744	1690
q5	5496	5515	5501	5501
q6	220	130	129	129
q7	2268	1935	1858	1858
q8	3331	3421	3400	3400
q9	8750	8785	8736	8736
q10	4223	3706	3768	3706
q11	588	507	504	504
q12	797	664	650	650
q13	16073	3237	3186	3186
q14	311	269	274	269
q15	526	503	478	478
q16	485	439	461	439
q17	1840	1512	1488	1488
q18	8176	8036	7659	7659
q19	1781	1587	1626	1587
q20	2380	1906	1851	1851
q21	5030	4943	4850	4850
q22	631	575	575	575
Total cold run time: 72654 ms
Total hot run time: 56004 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 171620 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 70ef4c0ef5f8c39d95ef242c382c1bfd22506b90, data reload: false

query1	922	395	376	376
query2	6447	2634	2285	2285
query3	6636	210	214	210
query4	19983	17574	17165	17165
query5	3582	471	492	471
query6	263	184	169	169
query7	4582	289	293	289
query8	318	301	297	297
query9	8457	2339	2321	2321
query10	557	293	274	274
query11	10578	10110	10065	10065
query12	116	83	83	83
query13	1630	363	363	363
query14	10207	7567	7572	7567
query15	254	185	186	185
query16	8008	263	264	263
query17	1900	542	511	511
query18	2095	274	268	268
query19	194	147	152	147
query20	87	86	81	81
query21	213	131	126	126
query22	4375	4043	3949	3949
query23	33848	33620	33546	33546
query24	10930	2800	2876	2800
query25	588	382	373	373
query26	1025	159	157	157
query27	2309	329	322	322
query28	6568	2081	2067	2067
query29	887	627	648	627
query30	256	155	158	155
query31	971	759	730	730
query32	96	54	55	54
query33	760	292	298	292
query34	985	479	494	479
query35	780	629	611	611
query36	1127	977	972	972
query37	153	74	79	74
query38	2946	2816	2842	2816
query39	864	840	842	840
query40	211	128	130	128
query41	59	56	55	55
query42	119	102	105	102
query43	620	549	565	549
query44	1248	733	744	733
query45	203	173	167	167
query46	1055	722	700	700
query47	1867	1763	1754	1754
query48	371	304	296	296
query49	852	433	432	432
query50	757	409	398	398
query51	6874	6726	6764	6726
query52	103	95	95	95
query53	368	295	293	293
query54	895	472	458	458
query55	75	78	77	77
query56	300	289	290	289
query57	1155	1056	1093	1056
query58	257	259	265	259
query59	3245	3327	3184	3184
query60	323	298	289	289
query61	112	112	113	112
query62	615	439	443	439
query63	326	291	291	291
query64	8665	2338	1821	1821
query65	3273	3116	3120	3116
query66	784	331	347	331
query67	15616	14865	14941	14865
query68	8234	541	563	541
query69	705	490	374	374
query70	1193	1144	1157	1144
query71	543	293	293	293
query72	8503	5612	2747	2747
query73	869	325	325	325
query74	5910	5547	5501	5501
query75	4533	2678	2656	2656
query76	4316	918	956	918
query77	754	309	306	306
query78	10721	9755	9709	9709
query79	9366	574	520	520
query80	981	480	490	480
query81	552	222	220	220
query82	225	99	102	99
query83	319	169	167	167
query84	277	83	86	83
query85	937	279	263	263
query86	356	309	300	300
query87	3333	3074	3071	3071
query88	4311	2478	2458	2458
query89	508	395	401	395
query90	2107	187	189	187
query91	131	99	104	99
query92	67	48	50	48
query93	6029	503	494	494
query94	1422	189	188	188
query95	400	314	313	313
query96	616	268	274	268
query97	3272	3042	3048	3042
query98	216	199	202	199
query99	1151	862	817	817
Total cold run time: 289702 ms
Total hot run time: 171620 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.29 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 70ef4c0ef5f8c39d95ef242c382c1bfd22506b90, data reload: false

query1	0.04	0.03	0.04
query2	0.08	0.03	0.04
query3	0.22	0.05	0.05
query4	1.71	0.07	0.07
query5	0.51	0.49	0.48
query6	1.15	0.72	0.72
query7	0.02	0.02	0.01
query8	0.05	0.04	0.04
query9	0.54	0.49	0.50
query10	0.54	0.54	0.54
query11	0.15	0.11	0.11
query12	0.15	0.12	0.13
query13	0.60	0.59	0.60
query14	0.78	0.79	0.77
query15	0.84	0.81	0.81
query16	0.35	0.37	0.37
query17	1.03	1.02	1.03
query18	0.22	0.25	0.27
query19	1.85	1.83	1.74
query20	0.01	0.00	0.00
query21	15.43	0.76	0.67
query22	3.77	7.03	2.51
query23	18.25	1.38	1.34
query24	2.20	0.22	0.22
query25	0.17	0.09	0.08
query26	0.26	0.18	0.17
query27	0.08	0.09	0.08
query28	13.17	1.03	1.00
query29	12.64	3.25	3.26
query30	0.25	0.06	0.06
query31	2.86	0.38	0.40
query32	3.29	0.47	0.49
query33	2.89	2.99	2.90
query34	17.11	4.44	4.40
query35	4.54	4.53	4.48
query36	0.66	0.46	0.46
query37	0.19	0.15	0.15
query38	0.15	0.15	0.14
query39	0.04	0.03	0.03
query40	0.18	0.14	0.14
query41	0.09	0.04	0.04
query42	0.05	0.04	0.05
query43	0.04	0.04	0.05
Total cold run time: 109.15 s
Total hot run time: 31.29 s

@yujun777
Copy link
Contributor Author

run feut

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 27, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@dataroaring dataroaring merged commit c92e090 into apache:master Jun 27, 2024
yujun777 added a commit to yujun777/doris that referenced this pull request Jun 28, 2024
… candidate speed (apache#36509)

When partition reblancer choose candidate tablets, it will call
tabletListOfA.removeAll(tabletListOfB), but list.removeAll(list)'s
runtime is O(n^2). Then if each BE contains 10w+ tablets, it's rather
slow. And we found a online case the tablet scheduler thread is busy at
it.

So need improve this search.
dataroaring pushed a commit that referenced this pull request Jun 28, 2024
dataroaring pushed a commit that referenced this pull request Jun 28, 2024
dataroaring pushed a commit that referenced this pull request Jun 28, 2024
… candidate speed (#36509)

When partition reblancer choose candidate tablets, it will call
tabletListOfA.removeAll(tabletListOfB), but list.removeAll(list)'s
runtime is O(n^2). Then if each BE contains 10w+ tablets, it's rather
slow. And we found a online case the tablet scheduler thread is busy at
it.

So need improve this search.
@xiaokang xiaokang mentioned this pull request Jul 14, 2024
mongo360 pushed a commit to mongo360/doris that referenced this pull request Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants