Skip to content

Conversation

@yujun777
Copy link
Contributor

@yujun777 yujun777 commented Jul 30, 2024

improve tablet repair sched:

  1. if a tablet had version incomplete replicas, then first fix the version incomplete replicas, after that if tablet still no enough replicas, then add new replicas.

This improvement will also fix the bug: for 3 replica on 3 backend, if one replica A 's backend is dead, one replica B miss versions, then the tablet's status is REPLICA_MISSING. Since no new backend to locate a new replica, the sched will always fail, also the missing versions B will not be repair. This PR will try to fix replica B firstly, only after that it then try to add a new replica.

  1. when load data fail, then repair this tablet imm;

  2. increase those tablets' sched priority:
    a) recently write failed;
    b) had version incomplete replicas;
    c) mow;

  3. fix colocate table health status, if colocate tablet's replica are not alive, its status should be unrecoverable.

But adjust priority is still not enough due to the sched pending queue limit size 2000. TabletChecker will put 2000 sched tasks into the sched pending queue, if the queue is full, even the highest priority sched task couldn't put into the queue. It need to wait until the sched pending queue is not full later.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@yujun777
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 41523 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1af1e0f469f452cb89160ff2a496b9824f43ca3a, data reload: false

------ Round 1 ----------------------------------
q1	18089	4257	4168	4168
q2	2425	214	225	214
q3	12232	1320	1392	1320
q4	10337	800	932	800
q5	8111	3036	3002	3002
q6	226	141	139	139
q7	1033	625	637	625
q8	9471	1688	1927	1688
q9	8411	6604	6585	6585
q10	8802	3823	3829	3823
q11	426	243	249	243
q12	406	227	226	226
q13	17760	2930	2912	2912
q14	275	244	247	244
q15	521	476	492	476
q16	486	390	393	390
q17	971	921	877	877
q18	7992	7279	7295	7279
q19	1383	1209	1202	1202
q20	563	322	341	322
q21	5232	4705	4723	4705
q22	357	283	283	283
Total cold run time: 115509 ms
Total hot run time: 41523 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4098	4001	4041	4001
q2	326	235	225	225
q3	2989	2969	2977	2969
q4	1907	1881	1868	1868
q5	5235	5209	5193	5193
q6	219	130	133	130
q7	2030	1680	1697	1680
q8	3203	3243	3272	3243
q9	8290	8246	8287	8246
q10	3738	3807	3813	3807
q11	554	457	437	437
q12	723	544	541	541
q13	14719	2937	2956	2937
q14	291	253	256	253
q15	516	476	477	476
q16	437	398	388	388
q17	1720	1691	1693	1691
q18	7743	7326	7220	7220
q19	1645	1652	1662	1652
q20	1958	1739	1757	1739
q21	5377	5095	5108	5095
q22	524	458	464	458
Total cold run time: 68242 ms
Total hot run time: 54249 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169211 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1af1e0f469f452cb89160ff2a496b9824f43ca3a, data reload: false

query1	916	376	371	371
query2	6475	1726	1761	1726
query3	6655	212	226	212
query4	20088	17399	17395	17395
query5	4297	530	513	513
query6	282	195	172	172
query7	4606	310	303	303
query8	251	215	197	197
query9	8538	2378	2369	2369
query10	453	282	262	262
query11	10393	9833	10012	9833
query12	143	96	87	87
query13	1616	365	366	365
query14	9692	7141	8497	7141
query15	207	160	167	160
query16	6987	456	430	430
query17	933	554	546	546
query18	1711	287	289	287
query19	193	146	142	142
query20	92	87	86	86
query21	210	103	103	103
query22	4149	4020	3907	3907
query23	33740	33088	32975	32975
query24	10266	3074	3060	3060
query25	674	401	411	401
query26	1726	153	181	153
query27	2833	285	271	271
query28	6737	1960	1944	1944
query29	1194	423	410	410
query30	288	149	152	149
query31	932	769	745	745
query32	105	58	56	56
query33	702	319	342	319
query34	900	475	487	475
query35	871	737	724	724
query36	1009	865	906	865
query37	237	79	78	78
query38	2862	2801	2777	2777
query39	877	798	815	798
query40	281	113	119	113
query41	49	45	46	45
query42	123	100	99	99
query43	479	449	430	430
query44	1209	730	752	730
query45	204	177	174	174
query46	1083	795	789	789
query47	1768	1690	1689	1689
query48	381	291	287	287
query49	1176	441	441	441
query50	913	448	453	448
query51	6761	6772	6639	6639
query52	107	91	93	91
query53	263	189	182	182
query54	638	465	458	458
query55	79	77	79	77
query56	277	259	261	259
query57	1130	1056	1038	1038
query58	275	269	265	265
query59	2751	2587	2419	2419
query60	303	292	289	289
query61	102	114	145	114
query62	924	654	658	654
query63	218	193	193	193
query64	5930	1960	1910	1910
query65	3142	3101	3119	3101
query66	1404	336	341	336
query67	15202	14753	14957	14753
query68	4475	545	557	545
query69	452	311	349	311
query70	1091	1092	1053	1053
query71	361	281	276	276
query72	7508	2687	2565	2565
query73	777	336	324	324
query74	6040	5574	5670	5574
query75	3364	2745	2746	2745
query76	2333	1344	1414	1344
query77	612	319	310	310
query78	9349	8922	8924	8922
query79	1396	545	559	545
query80	947	538	517	517
query81	526	228	226	226
query82	1188	135	130	130
query83	294	170	170	170
query84	262	81	81	81
query85	1277	324	322	322
query86	404	295	295	295
query87	3260	3156	3126	3126
query88	2950	2505	2406	2406
query89	377	303	292	292
query90	1760	206	200	200
query91	130	108	102	102
query92	61	52	55	52
query93	1394	609	617	609
query94	873	284	302	284
query95	388	271	271	271
query96	603	279	278	278
query97	3185	3026	3027	3026
query98	220	199	192	192
query99	1676	1288	1314	1288
Total cold run time: 261545 ms
Total hot run time: 169211 ms

@yujun777
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

ClickBench: Total hot run time: 30.07 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 1af1e0f469f452cb89160ff2a496b9824f43ca3a, data reload: false

query1	0.04	0.04	0.04
query2	0.07	0.04	0.04
query3	0.22	0.05	0.05
query4	1.68	0.07	0.07
query5	0.49	0.49	0.48
query6	1.13	0.71	0.72
query7	0.02	0.01	0.02
query8	0.05	0.05	0.04
query9	0.57	0.52	0.52
query10	0.57	0.57	0.57
query11	0.16	0.11	0.12
query12	0.15	0.12	0.13
query13	0.61	0.60	0.59
query14	0.78	0.80	0.80
query15	0.90	0.87	0.85
query16	0.35	0.35	0.35
query17	1.01	0.97	1.01
query18	0.22	0.21	0.21
query19	1.84	1.72	1.76
query20	0.01	0.03	0.00
query21	15.42	0.77	0.66
query22	3.86	8.39	1.42
query23	17.94	1.30	1.32
query24	2.27	0.22	0.22
query25	0.19	0.09	0.08
query26	0.32	0.21	0.22
query27	0.46	0.23	0.23
query28	13.16	1.01	0.97
query29	12.56	3.31	3.30
query30	0.25	0.06	0.06
query31	2.87	0.41	0.40
query32	3.26	0.49	0.48
query33	2.97	2.94	2.93
query34	15.43	4.25	4.23
query35	4.30	4.28	4.29
query36	0.68	0.47	0.49
query37	0.19	0.16	0.16
query38	0.17	0.15	0.15
query39	0.04	0.03	0.04
query40	0.15	0.12	0.12
query41	0.10	0.04	0.05
query42	0.06	0.05	0.04
query43	0.05	0.04	0.04
Total cold run time: 107.57 s
Total hot run time: 30.07 s

@doris-robot
Copy link

TPC-H: Total hot run time: 41505 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3502bf4c24674f5ba48f9e9ac638978592b517e2, data reload: false

------ Round 1 ----------------------------------
q1	17631	4042	4025	4025
q2	2024	202	207	202
q3	10456	1296	1332	1296
q4	10161	846	965	846
q5	7655	2946	2964	2946
q6	221	140	139	139
q7	1035	612	615	612
q8	9442	1831	1936	1831
q9	8501	6600	6602	6600
q10	8772	3840	3867	3840
q11	431	253	257	253
q12	450	230	230	230
q13	17769	2960	2945	2945
q14	276	251	249	249
q15	513	483	486	483
q16	525	402	395	395
q17	958	895	895	895
q18	8015	7242	7191	7191
q19	1462	1216	1213	1213
q20	565	333	340	333
q21	5223	4703	4800	4703
q22	352	285	278	278
Total cold run time: 112437 ms
Total hot run time: 41505 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4068	4047	4016	4016
q2	332	220	223	220
q3	2980	2983	3134	2983
q4	1964	1992	2000	1992
q5	5616	5487	5445	5445
q6	220	142	135	135
q7	2111	1787	1804	1787
q8	3339	3395	3348	3348
q9	8668	8696	8778	8696
q10	3969	4021	3979	3979
q11	545	448	455	448
q12	713	578	570	570
q13	16302	3159	3115	3115
q14	291	271	267	267
q15	538	477	505	477
q16	469	420	410	410
q17	1776	1724	1724	1724
q18	8238	7754	7834	7754
q19	1729	1698	1717	1698
q20	2026	1894	1806	1806
q21	5794	5336	5387	5336
q22	519	466	465	465
Total cold run time: 72207 ms
Total hot run time: 56671 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169586 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3502bf4c24674f5ba48f9e9ac638978592b517e2, data reload: false

query1	905	385	362	362
query2	6470	1812	1779	1779
query3	6662	212	227	212
query4	20186	17417	17237	17237
query5	3669	536	531	531
query6	276	169	164	164
query7	4594	298	291	291
query8	262	188	199	188
query9	8507	2375	2366	2366
query10	429	293	267	267
query11	10476	10071	10032	10032
query12	124	86	87	86
query13	1629	380	370	370
query14	9176	7105	8664	7105
query15	206	161	171	161
query16	6828	445	458	445
query17	959	577	568	568
query18	1931	304	296	296
query19	198	152	147	147
query20	141	83	82	82
query21	202	100	98	98
query22	4163	4018	4019	4018
query23	33787	33539	33406	33406
query24	10235	3141	3031	3031
query25	702	414	412	412
query26	1676	154	161	154
query27	2963	278	287	278
query28	7641	1973	1979	1973
query29	1275	452	424	424
query30	231	153	156	153
query31	944	802	758	758
query32	101	56	55	55
query33	666	328	335	328
query34	912	528	496	496
query35	911	815	758	758
query36	1041	928	911	911
query37	178	80	81	80
query38	2965	2870	2779	2779
query39	866	843	817	817
query40	247	114	136	114
query41	46	45	46	45
query42	123	100	106	100
query43	497	446	420	420
query44	1162	718	728	718
query45	206	176	176	176
query46	1080	826	798	798
query47	1808	1673	1677	1673
query48	369	297	296	296
query49	895	424	421	421
query50	920	441	437	437
query51	6692	6650	6617	6617
query52	97	93	90	90
query53	259	183	183	183
query54	611	455	450	450
query55	76	76	76	76
query56	289	269	264	264
query57	1151	1008	1016	1008
query58	270	266	301	266
query59	2585	2427	2448	2427
query60	292	283	277	277
query61	133	96	101	96
query62	896	662	663	662
query63	210	181	185	181
query64	5675	1918	1909	1909
query65	3169	3083	3113	3083
query66	1197	326	324	324
query67	15214	14683	14812	14683
query68	4419	553	574	553
query69	697	393	322	322
query70	1161	1074	1062	1062
query71	449	282	288	282
query72	7586	2771	2511	2511
query73	788	322	330	322
query74	6079	5588	5643	5588
query75	3375	2756	2765	2756
query76	2953	1317	1388	1317
query77	581	321	316	316
query78	9403	8851	8835	8835
query79	1877	534	531	531
query80	985	536	520	520
query81	580	225	235	225
query82	829	135	134	134
query83	263	180	172	172
query84	270	82	132	82
query85	1169	325	318	318
query86	458	287	286	286
query87	3275	3097	3083	3083
query88	2972	2395	2422	2395
query89	391	305	293	293
query90	1646	198	198	198
query91	135	103	109	103
query92	62	52	54	52
query93	1542	604	601	601
query94	827	298	294	294
query95	386	271	267	267
query96	604	281	277	277
query97	3159	3024	3033	3024
query98	224	202	193	193
query99	1651	1306	1262	1262
Total cold run time: 262052 ms
Total hot run time: 169586 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.84 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3502bf4c24674f5ba48f9e9ac638978592b517e2, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.04
query3	0.22	0.06	0.05
query4	1.68	0.08	0.08
query5	0.48	0.49	0.48
query6	1.13	0.71	0.72
query7	0.02	0.01	0.02
query8	0.05	0.05	0.04
query9	0.57	0.50	0.51
query10	0.57	0.55	0.56
query11	0.16	0.12	0.12
query12	0.15	0.12	0.12
query13	0.60	0.60	0.59
query14	0.78	0.79	0.79
query15	0.89	0.85	0.87
query16	0.34	0.35	0.34
query17	0.98	0.97	0.99
query18	0.22	0.20	0.21
query19	1.86	1.74	1.73
query20	0.01	0.01	0.00
query21	15.40	0.76	0.65
query22	4.11	8.54	1.19
query23	17.80	1.36	1.39
query24	2.29	0.22	0.22
query25	0.19	0.08	0.07
query26	0.34	0.21	0.21
query27	0.45	0.22	0.22
query28	13.18	1.00	0.97
query29	12.55	3.34	3.32
query30	0.25	0.06	0.05
query31	2.86	0.40	0.40
query32	3.26	0.49	0.48
query33	2.94	2.96	2.90
query34	15.42	4.26	4.25
query35	4.30	4.29	4.30
query36	0.68	0.48	0.50
query37	0.18	0.16	0.15
query38	0.17	0.16	0.14
query39	0.04	0.03	0.03
query40	0.16	0.13	0.13
query41	0.10	0.05	0.04
query42	0.05	0.05	0.05
query43	0.04	0.04	0.04
Total cold run time: 107.58 s
Total hot run time: 29.84 s

@yujun777 yujun777 force-pushed the impr-tablet-sched-2 branch from acb18c8 to 14717f6 Compare July 30, 2024 10:28
@yujun777
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 41850 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 14717f655bcf1e90f32008c3a660d07213f25bbf, data reload: false

------ Round 1 ----------------------------------
q1	17610	4057	4037	4037
q2	2018	209	213	209
q3	10445	1326	1314	1314
q4	10179	789	899	789
q5	7669	2986	2955	2955
q6	222	137	139	137
q7	1027	621	610	610
q8	9637	1875	1946	1875
q9	8539	6651	6625	6625
q10	8773	3860	3851	3851
q11	437	257	260	257
q12	410	235	233	233
q13	17763	3002	2965	2965
q14	273	244	248	244
q15	533	482	496	482
q16	524	407	393	393
q17	964	932	910	910
q18	7938	7387	7321	7321
q19	2634	1216	1213	1213
q20	561	343	328	328
q21	5325	4823	4834	4823
q22	357	286	279	279
Total cold run time: 113838 ms
Total hot run time: 41850 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4121	4173	4129	4129
q2	348	244	240	240
q3	3139	3167	3308	3167
q4	2063	2044	2045	2044
q5	5748	5655	5712	5655
q6	224	158	141	141
q7	2258	1933	1872	1872
q8	3328	3399	3426	3399
q9	8765	8692	8818	8692
q10	3907	4081	3862	3862
q11	567	461	466	461
q12	750	588	581	581
q13	16405	3121	3112	3112
q14	301	265	273	265
q15	546	490	488	488
q16	459	400	415	400
q17	1755	1709	1720	1709
q18	8201	7841	7729	7729
q19	1739	1704	1714	1704
q20	2104	1830	1839	1830
q21	5739	5569	5138	5138
q22	532	491	462	462
Total cold run time: 72999 ms
Total hot run time: 57080 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169797 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 14717f655bcf1e90f32008c3a660d07213f25bbf, data reload: false

query1	907	371	372	371
query2	6449	1797	1809	1797
query3	6656	214	234	214
query4	18968	17454	17361	17361
query5	3649	508	515	508
query6	283	178	169	169
query7	4605	283	284	283
query8	266	204	195	195
query9	8492	2366	2360	2360
query10	427	282	260	260
query11	10574	10074	10153	10074
query12	131	87	86	86
query13	1632	366	372	366
query14	8794	7204	8425	7204
query15	204	161	161	161
query16	6852	463	423	423
query17	948	549	543	543
query18	1901	283	296	283
query19	188	145	137	137
query20	88	85	85	85
query21	197	102	97	97
query22	4293	4222	4052	4052
query23	33770	33630	33199	33199
query24	9217	3184	3062	3062
query25	661	385	395	385
query26	1372	150	155	150
query27	3012	279	278	278
query28	7839	2004	2005	2004
query29	1036	420	423	420
query30	231	153	153	153
query31	953	796	750	750
query32	102	58	57	57
query33	721	303	322	303
query34	908	476	501	476
query35	864	752	782	752
query36	1041	879	891	879
query37	177	79	82	79
query38	2934	2838	2901	2838
query39	910	811	813	811
query40	282	110	111	110
query41	45	42	44	42
query42	116	100	98	98
query43	480	445	439	439
query44	1162	739	724	724
query45	210	180	178	178
query46	1078	816	787	787
query47	1820	1721	1716	1716
query48	365	285	287	285
query49	901	431	413	413
query50	924	430	428	428
query51	6783	6744	6664	6664
query52	101	95	92	92
query53	252	184	182	182
query54	662	459	449	449
query55	76	76	76	76
query56	286	265	292	265
query57	1142	1044	1033	1033
query58	277	278	295	278
query59	2759	2431	2510	2431
query60	288	267	280	267
query61	96	96	92	92
query62	887	661	639	639
query63	210	182	180	180
query64	5879	1937	1881	1881
query65	3137	3063	3073	3063
query66	1229	384	329	329
query67	15300	14854	14719	14719
query68	5964	566	574	566
query69	693	402	308	308
query70	1107	1079	1042	1042
query71	429	275	271	271
query72	7570	2693	2535	2535
query73	925	319	322	319
query74	5995	5644	5592	5592
query75	3646	2748	2747	2747
query76	3730	1359	1399	1359
query77	720	309	313	309
query78	9572	9019	9007	9007
query79	1980	544	522	522
query80	812	510	511	510
query81	550	235	225	225
query82	774	130	128	128
query83	196	168	166	166
query84	259	79	77	77
query85	1314	324	307	307
query86	461	294	296	294
query87	3266	3072	3117	3072
query88	3568	2382	2394	2382
query89	393	295	290	290
query90	1760	190	190	190
query91	137	116	111	111
query92	61	53	50	50
query93	1699	597	614	597
query94	767	296	303	296
query95	381	271	274	271
query96	603	291	283	283
query97	3203	3008	3033	3008
query98	219	200	197	197
query99	1704	1281	1295	1281
Total cold run time: 263457 ms
Total hot run time: 169797 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.29 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 14717f655bcf1e90f32008c3a660d07213f25bbf, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.04	0.03
query3	0.22	0.05	0.05
query4	1.68	0.07	0.07
query5	0.49	0.47	0.49
query6	1.14	0.72	0.71
query7	0.02	0.01	0.02
query8	0.05	0.04	0.05
query9	0.56	0.50	0.51
query10	0.56	0.57	0.57
query11	0.16	0.12	0.12
query12	0.16	0.12	0.12
query13	0.61	0.59	0.59
query14	0.78	0.79	0.78
query15	0.88	0.86	0.86
query16	0.35	0.36	0.36
query17	1.01	1.02	1.01
query18	0.23	0.21	0.24
query19	1.82	1.76	1.72
query20	0.01	0.01	0.01
query21	15.39	0.78	0.65
query22	3.80	7.76	1.64
query23	18.05	1.29	1.25
query24	2.24	0.23	0.22
query25	0.18	0.08	0.08
query26	0.32	0.20	0.20
query27	0.46	0.23	0.22
query28	13.16	1.01	0.97
query29	12.57	3.32	3.34
query30	0.26	0.07	0.05
query31	2.86	0.40	0.40
query32	3.23	0.49	0.48
query33	2.90	2.96	2.93
query34	15.41	4.27	4.25
query35	4.30	4.28	4.28
query36	0.67	0.48	0.49
query37	0.18	0.15	0.17
query38	0.16	0.15	0.15
query39	0.04	0.04	0.04
query40	0.16	0.14	0.13
query41	0.10	0.05	0.04
query42	0.06	0.05	0.05
query43	0.04	0.05	0.04
Total cold run time: 107.39 s
Total hot run time: 30.29 s

@yujun777
Copy link
Contributor Author

run buildall

@yujun777 yujun777 changed the title [improvement](tablet scheduler) improve tablet sched priority [improvement](tablet scheduler) Adjust tablet sched priority to improve load data succ Jul 30, 2024
@doris-robot
Copy link

TPC-H: Total hot run time: 43523 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b3021b99795b5ae2c4ef7c178ef0c458f0c6d884, data reload: false

------ Round 1 ----------------------------------
q1	18143	4383	4409	4383
q2	2585	216	199	199
q3	11716	1394	1459	1394
q4	10793	866	1023	866
q5	8026	3162	3177	3162
q6	247	140	139	139
q7	1116	635	637	635
q8	9469	1966	1981	1966
q9	9026	7186	7181	7181
q10	8763	3935	3918	3918
q11	431	259	251	251
q12	453	230	232	230
q13	17752	2935	2951	2935
q14	281	238	243	238
q15	528	490	487	487
q16	490	420	405	405
q17	999	956	960	956
q18	8035	7426	7294	7294
q19	1408	1357	1357	1357
q20	600	327	321	321
q21	5467	4933	4980	4933
q22	365	280	273	273
Total cold run time: 116693 ms
Total hot run time: 43523 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4281	4252	4396	4252
q2	343	227	225	225
q3	2993	3064	3056	3056
q4	1932	1949	1931	1931
q5	5347	5335	5438	5335
q6	238	134	133	133
q7	2127	1720	1721	1720
q8	3454	3531	3522	3522
q9	8842	8802	8798	8798
q10	3846	3929	3959	3929
q11	570	452	447	447
q12	774	552	536	536
q13	11750	2976	2996	2976
q14	299	260	253	253
q15	531	508	485	485
q16	448	412	390	390
q17	1886	1841	1850	1841
q18	7810	7417	7271	7271
q19	1838	1840	1836	1836
q20	2043	1779	1730	1730
q21	5632	5410	5335	5335
q22	526	441	466	441
Total cold run time: 67510 ms
Total hot run time: 56442 ms

@yujun777 yujun777 changed the title [improvement](tablet scheduler) Adjust tablet sched priority to improve load data succ [improvement](tablet scheduler) Adjust tablet sched priority to help load data succ Jul 30, 2024
@doris-robot
Copy link

TPC-DS: Total hot run time: 169172 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b3021b99795b5ae2c4ef7c178ef0c458f0c6d884, data reload: false

query1	918	385	380	380
query2	6471	1792	1777	1777
query3	6673	217	231	217
query4	20281	17358	17267	17267
query5	4292	515	524	515
query6	288	173	171	171
query7	4613	289	296	289
query8	275	203	201	201
query9	8495	2379	2357	2357
query10	477	288	269	269
query11	10537	9998	9920	9920
query12	141	92	86	86
query13	1631	380	370	370
query14	9213	7339	7839	7339
query15	201	166	167	166
query16	7100	450	437	437
query17	963	560	537	537
query18	1921	277	296	277
query19	197	145	161	145
query20	90	86	88	86
query21	211	99	100	99
query22	4236	3911	3852	3852
query23	33631	32733	32962	32733
query24	10320	3078	3082	3078
query25	697	407	400	400
query26	1818	147	166	147
query27	3025	269	268	268
query28	7025	1974	1942	1942
query29	1363	419	419	419
query30	276	155	149	149
query31	943	766	753	753
query32	98	56	54	54
query33	699	308	323	308
query34	918	484	475	475
query35	884	725	735	725
query36	988	865	853	853
query37	260	80	82	80
query38	2891	2766	2805	2766
query39	868	806	822	806
query40	277	113	115	113
query41	51	47	44	44
query42	122	132	103	103
query43	467	446	433	433
query44	1194	729	728	728
query45	219	177	177	177
query46	1076	827	772	772
query47	1806	1719	1720	1719
query48	354	299	287	287
query49	1197	431	420	420
query50	901	436	451	436
query51	6866	6770	6803	6770
query52	100	94	90	90
query53	260	178	186	178
query54	639	451	450	450
query55	78	76	77	76
query56	300	271	255	255
query57	1126	1031	1066	1031
query58	287	269	279	269
query59	2703	2446	2612	2446
query60	293	280	275	275
query61	100	98	97	97
query62	909	691	666	666
query63	205	187	192	187
query64	5872	1907	1953	1907
query65	3171	3091	3108	3091
query66	1432	330	339	330
query67	15533	14813	14844	14813
query68	4366	559	577	559
query69	450	303	298	298
query70	1115	1060	1045	1045
query71	411	280	274	274
query72	7144	2736	2511	2511
query73	775	323	325	323
query74	5964	5589	5607	5589
query75	3382	2764	2771	2764
query76	2216	1354	1397	1354
query77	442	312	306	306
query78	9383	8896	8880	8880
query79	1834	544	526	526
query80	1208	515	527	515
query81	570	228	226	226
query82	1029	131	133	131
query83	248	182	177	177
query84	280	88	146	88
query85	1344	321	312	312
query86	387	309	302	302
query87	3268	3104	3107	3104
query88	2959	2413	2408	2408
query89	394	290	299	290
query90	1807	202	194	194
query91	129	102	106	102
query92	62	52	54	52
query93	1502	613	610	610
query94	902	302	298	298
query95	388	275	270	270
query96	602	280	280	280
query97	3210	3031	3041	3031
query98	225	192	199	192
query99	1622	1303	1285	1285
Total cold run time: 263084 ms
Total hot run time: 169172 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.08 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b3021b99795b5ae2c4ef7c178ef0c458f0c6d884, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.03	0.04
query3	0.21	0.05	0.05
query4	1.68	0.06	0.07
query5	0.49	0.47	0.50
query6	1.14	0.72	0.72
query7	0.02	0.01	0.01
query8	0.05	0.04	0.04
query9	0.55	0.50	0.52
query10	0.57	0.59	0.57
query11	0.16	0.12	0.12
query12	0.15	0.12	0.12
query13	0.61	0.59	0.60
query14	0.77	0.79	0.83
query15	0.91	0.85	0.86
query16	0.35	0.35	0.35
query17	0.99	0.98	0.98
query18	0.22	0.21	0.22
query19	1.79	1.74	1.72
query20	0.01	0.01	0.01
query21	15.40	0.77	0.65
query22	4.33	7.60	1.47
query23	18.07	1.39	1.28
query24	2.24	0.23	0.21
query25	0.18	0.09	0.08
query26	0.31	0.22	0.21
query27	0.46	0.23	0.23
query28	13.16	1.00	0.97
query29	12.57	3.33	3.30
query30	0.25	0.06	0.05
query31	2.86	0.42	0.41
query32	3.23	0.50	0.49
query33	2.95	2.95	2.93
query34	15.51	4.24	4.25
query35	4.30	4.29	4.26
query36	0.69	0.47	0.47
query37	0.19	0.17	0.17
query38	0.17	0.15	0.16
query39	0.04	0.04	0.04
query40	0.16	0.13	0.13
query41	0.09	0.04	0.05
query42	0.05	0.04	0.04
query43	0.04	0.04	0.04
Total cold run time: 108.03 s
Total hot run time: 30.08 s

@yujun777
Copy link
Contributor Author

run buildall

@yujun777 yujun777 force-pushed the impr-tablet-sched-2 branch from 25580c9 to 499046c Compare July 31, 2024 02:33
@yujun777
Copy link
Contributor Author

run buildall

3 similar comments
@yujun777
Copy link
Contributor Author

run buildall

@yujun777
Copy link
Contributor Author

run buildall

@yujun777
Copy link
Contributor Author

run buildall

@yujun777
Copy link
Contributor Author

run feut

@yujun777
Copy link
Contributor Author

run buildall

@yujun777 yujun777 force-pushed the impr-tablet-sched-2 branch from b1cbeea to 84bf2b8 Compare July 31, 2024 09:03
@yujun777
Copy link
Contributor Author

run buildall

@yujun777 yujun777 force-pushed the impr-tablet-sched-2 branch from 84bf2b8 to b2ce3c2 Compare August 1, 2024 01:47
@github-actions github-actions bot added the doing label Aug 1, 2024
@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run buildall

@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run performance

@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Aug 1, 2024
@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 41731 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 43858d4dbddcb430475e497db9daea4429f81dd5, data reload: false

------ Round 1 ----------------------------------
q1	17638	4129	4131	4129
q2	2023	205	219	205
q3	10618	1295	1366	1295
q4	10334	824	891	824
q5	7581	3033	2997	2997
q6	228	139	136	136
q7	1054	601	596	596
q8	9437	1852	1973	1852
q9	8516	6661	6618	6618
q10	8769	3834	3822	3822
q11	442	245	255	245
q12	408	227	227	227
q13	17763	2952	2924	2924
q14	277	248	251	248
q15	530	489	502	489
q16	531	396	392	392
q17	978	944	913	913
q18	7979	7304	7299	7299
q19	1384	1239	1221	1221
q20	565	315	326	315
q21	5334	4704	4802	4704
q22	352	291	280	280
Total cold run time: 112741 ms
Total hot run time: 41731 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4036	4029	4011	4011
q2	338	230	223	223
q3	2981	3016	3149	3016
q4	2010	2054	1962	1962
q5	5610	5512	5683	5512
q6	224	131	130	130
q7	2154	1786	1834	1786
q8	3312	3386	3685	3386
q9	8689	8653	8774	8653
q10	3961	4060	3972	3972
q11	552	463	458	458
q12	777	590	602	590
q13	15609	3119	3138	3119
q14	303	297	271	271
q15	553	484	492	484
q16	471	422	422	422
q17	1768	1735	1720	1720
q18	8233	7674	7918	7674
q19	1783	1719	1694	1694
q20	2082	1873	1825	1825
q21	5779	5467	5312	5312
q22	512	436	466	436
Total cold run time: 71737 ms
Total hot run time: 56656 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 170219 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 43858d4dbddcb430475e497db9daea4429f81dd5, data reload: false

query1	915	378	378	378
query2	6470	1726	1736	1726
query3	6659	208	226	208
query4	20224	17602	17338	17338
query5	3672	515	533	515
query6	300	171	168	168
query7	4594	321	323	321
query8	265	198	200	198
query9	8526	2423	2411	2411
query10	449	282	288	282
query11	10495	10039	10022	10022
query12	119	90	88	88
query13	1618	372	373	372
query14	9625	7566	7017	7017
query15	205	163	162	162
query16	6942	432	441	432
query17	950	564	562	562
query18	1895	307	283	283
query19	195	144	140	140
query20	91	85	85	85
query21	198	99	96	96
query22	4210	4152	3994	3994
query23	33924	33837	33401	33401
query24	10349	3153	3091	3091
query25	702	419	407	407
query26	1858	158	155	155
query27	3331	284	289	284
query28	7455	2054	2038	2038
query29	1305	449	425	425
query30	240	160	160	160
query31	965	775	805	775
query32	108	55	61	55
query33	707	315	331	315
query34	942	511	522	511
query35	896	747	755	747
query36	1037	882	855	855
query37	282	79	79	79
query38	2908	2805	2752	2752
query39	870	804	834	804
query40	258	110	111	110
query41	45	44	44	44
query42	140	101	102	101
query43	508	423	428	423
query44	1185	727	733	727
query45	220	179	181	179
query46	1095	800	764	764
query47	1795	1714	1699	1699
query48	356	294	294	294
query49	922	422	409	409
query50	907	430	437	430
query51	6832	6758	6715	6715
query52	118	92	93	92
query53	249	182	178	178
query54	626	455	462	455
query55	81	78	78	78
query56	284	270	279	270
query57	1147	1061	1024	1024
query58	266	259	309	259
query59	2611	2618	2479	2479
query60	294	285	298	285
query61	94	91	94	91
query62	892	666	657	657
query63	210	176	182	176
query64	5620	1905	1858	1858
query65	3200	3099	3089	3089
query66	1331	338	329	329
query67	15334	14874	14863	14863
query68	4314	580	587	580
query69	568	344	311	311
query70	1121	1031	1103	1031
query71	415	278	278	278
query72	7185	2706	2529	2529
query73	763	336	333	333
query74	6088	5606	5623	5606
query75	3408	2747	2752	2747
query76	2152	1370	1391	1370
query77	473	316	306	306
query78	9500	8944	8922	8922
query79	1730	533	547	533
query80	934	509	507	507
query81	556	227	228	227
query82	675	132	130	130
query83	273	171	167	167
query84	258	78	77	77
query85	1199	314	311	311
query86	478	309	317	309
query87	3331	3114	3105	3105
query88	3844	2517	2600	2517
query89	376	295	290	290
query90	1780	190	191	190
query91	123	99	101	99
query92	58	49	49	49
query93	2183	635	618	618
query94	782	300	287	287
query95	381	264	262	262
query96	604	283	287	283
query97	3242	3054	3015	3015
query98	225	207	201	201
query99	1676	1299	1307	1299
Total cold run time: 263811 ms
Total hot run time: 170219 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.65 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 43858d4dbddcb430475e497db9daea4429f81dd5, data reload: false

query1	0.05	0.03	0.03
query2	0.08	0.04	0.04
query3	0.22	0.05	0.05
query4	1.69	0.07	0.06
query5	0.49	0.47	0.48
query6	1.14	0.72	0.71
query7	0.02	0.01	0.01
query8	0.05	0.04	0.05
query9	0.57	0.50	0.52
query10	0.56	0.56	0.56
query11	0.15	0.12	0.12
query12	0.15	0.12	0.13
query13	0.61	0.60	0.60
query14	0.77	0.78	0.81
query15	0.91	0.87	0.86
query16	0.36	0.36	0.35
query17	0.99	1.00	1.01
query18	0.22	0.20	0.21
query19	1.86	1.73	1.70
query20	0.02	0.01	0.01
query21	15.39	0.78	0.65
query22	4.35	8.75	1.04
query23	17.88	1.27	1.33
query24	2.25	0.23	0.22
query25	0.18	0.08	0.09
query26	0.33	0.22	0.22
query27	0.46	0.23	0.23
query28	13.16	1.00	0.98
query29	12.52	3.29	3.28
query30	0.28	0.06	0.07
query31	3.23	0.41	0.40
query32	3.25	0.49	0.48
query33	2.92	2.98	2.92
query34	15.43	4.25	4.26
query35	4.31	4.29	4.28
query36	0.68	0.47	0.48
query37	0.19	0.17	0.17
query38	0.18	0.15	0.15
query39	0.04	0.04	0.03
query40	0.16	0.13	0.13
query41	0.09	0.05	0.05
query42	0.05	0.04	0.04
query43	0.05	0.04	0.04
Total cold run time: 108.29 s
Total hot run time: 29.65 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 4, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2024

PR approved by at least one committer and no changes requested.

Copy link
Contributor

@deardeng deardeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 18bbda0 into apache:master Aug 5, 2024
dataroaring pushed a commit that referenced this pull request Aug 5, 2024
dataroaring pushed a commit that referenced this pull request Aug 11, 2024
…load data succ (#38528)

improve tablet repair sched:

1. if a tablet had version incomplete replicas, then first fix the
version incomplete replicas, after that if tablet still no enough
replicas, then add new replicas.

This improvement will also fix the bug: for 3 replica on 3 backend, if
one replica A 's backend is dead, one replica B miss versions, then the
tablet's status is REPLICA_MISSING. Since no new backend to locate a new
replica, the sched will always fail, also the missing versions B will
not be repair. This PR will try to fix replica B firstly, only after
that it then try to add a new replica.

2. when load data fail, then repair this tablet imm;

3. increase those tablets' sched priority:
a) recently write failed;
b) had version incomplete  replicas;
c) mow;

4. fix colocate table health status, if colocate tablet's replica are
not alive, its status should be unrecoverable.

But adjust priority is still not enough due to the sched pending queue
limit size 2000. TabletChecker will put 2000 sched tasks into the sched
pending queue, if the queue is full, even the highest priority sched
task couldn't put into the queue. It need to wait until the sched
pending queue is not full later.
dataroaring pushed a commit that referenced this pull request Aug 16, 2024
…load data succ (#38528)

improve tablet repair sched:

1. if a tablet had version incomplete replicas, then first fix the
version incomplete replicas, after that if tablet still no enough
replicas, then add new replicas.

This improvement will also fix the bug: for 3 replica on 3 backend, if
one replica A 's backend is dead, one replica B miss versions, then the
tablet's status is REPLICA_MISSING. Since no new backend to locate a new
replica, the sched will always fail, also the missing versions B will
not be repair. This PR will try to fix replica B firstly, only after
that it then try to add a new replica.

2. when load data fail, then repair this tablet imm;

3. increase those tablets' sched priority:
a) recently write failed;
b) had version incomplete  replicas;
c) mow;

4. fix colocate table health status, if colocate tablet's replica are
not alive, its status should be unrecoverable.

But adjust priority is still not enough due to the sched pending queue
limit size 2000. TabletChecker will put 2000 sched tasks into the sched
pending queue, if the queue is full, even the highest priority sched
task couldn't put into the queue. It need to wait until the sched
pending queue is not full later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.6-merged dev/3.0.2-merged doing reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants