Skip to content

Conversation

@kaka11chen
Copy link
Contributor

@kaka11chen kaka11chen commented Sep 10, 2024

Proposed changes

Issue

When a scanner scheduler is stuck in executing a scan task, other scan tasks will starve and have no chance to execute, which will affect other queries. Currently, the scan task hopes to scan as much data as possible to reduce the overhead of scheduling switching. Currently, it hopes to obtain up to 10MB of data in doris_scanner_row_bytes. However, if a query scans a table with many rows of data, but the filtering rate is very high, the filter will eventually filter out a lot of data and will never get 10MB of data. It will keep getting and executing expression filtering, which will cause other scan tasks to starve.

Solution

The current solution is to check max_run_time_ms by MonotonicStopWatch. After executing for a maximum of 1s, it will yield self's task for other tasks. When the scan task executes some time-consuming tasks, it needs to slice to do it.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@kaka11chen
Copy link
Contributor Author

run buildall

@kaka11chen kaka11chen force-pushed the opt_scanner_schedule_starvation_1 branch from 86c08f6 to ca2e7e3 Compare September 11, 2024 01:44
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.82% (9394/25510)
Line Coverage: 28.23% (77508/274553)
Region Coverage: 27.62% (40006/144820)
Branch Coverage: 24.25% (20350/83908)
Coverage Report: http://coverage.selectdb-in.cc/coverage/ca2e7e30d79d976c343e3d870765294fd5f6a6f0_ca2e7e30d79d976c343e3d870765294fd5f6a6f0/report/index.html

@kaka11chen kaka11chen force-pushed the opt_scanner_schedule_starvation_1 branch from ca2e7e3 to af7b099 Compare September 18, 2024 02:05
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 42132 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit af7b099b0728f301be3b992c3f35c5d82c8d3c60, data reload: false

------ Round 1 ----------------------------------
q1	17912	7483	7273	7273
q2	2060	167	161	161
q3	11267	1214	1206	1206
q4	10757	741	807	741
q5	7738	3261	3153	3153
q6	236	148	148	148
q7	1031	633	623	623
q8	9733	2146	2081	2081
q9	7041	6545	6527	6527
q10	7402	2315	2244	2244
q11	450	254	257	254
q12	410	221	215	215
q13	17787	3021	2997	2997
q14	257	222	228	222
q15	576	542	529	529
q16	680	621	631	621
q17	1004	822	818	818
q18	7528	7060	6792	6792
q19	1409	1012	945	945
q20	615	276	277	276
q21	4025	3413	3289	3289
q22	1118	1042	1017	1017
Total cold run time: 111036 ms
Total hot run time: 42132 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7249	7245	7256	7245
q2	325	231	227	227
q3	2904	2798	2835	2798
q4	1927	1702	1706	1702
q5	5384	5429	5431	5429
q6	231	138	138	138
q7	2107	1743	1719	1719
q8	3182	3342	3344	3342
q9	8325	8384	8410	8384
q10	3372	3366	3376	3366
q11	583	466	487	466
q12	795	598	580	580
q13	5660	2953	2994	2953
q14	285	264	270	264
q15	560	499	502	499
q16	722	668	675	668
q17	1779	1536	1548	1536
q18	7587	7306	7394	7306
q19	1655	1453	1533	1453
q20	2003	1819	1809	1809
q21	5365	5159	5194	5159
q22	1111	1036	1006	1006
Total cold run time: 63111 ms
Total hot run time: 58049 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 194081 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit af7b099b0728f301be3b992c3f35c5d82c8d3c60, data reload: false

query1	976	358	385	358
query2	6513	2071	2049	2049
query3	6705	209	223	209
query4	33668	23626	23386	23386
query5	4326	480	454	454
query6	270	175	156	156
query7	4620	296	303	296
query8	272	210	230	210
query9	9553	2680	2675	2675
query10	481	281	287	281
query11	18064	15125	15212	15125
query12	161	98	98	98
query13	1625	428	393	393
query14	10533	7632	6582	6582
query15	253	179	179	179
query16	8094	448	474	448
query17	1646	582	564	564
query18	2284	344	302	302
query19	364	143	144	143
query20	118	112	107	107
query21	214	103	105	103
query22	4641	4398	4231	4231
query23	34989	34131	34031	34031
query24	11136	2969	2835	2835
query25	631	383	381	381
query26	1198	161	159	159
query27	2871	281	279	279
query28	8065	2478	2455	2455
query29	832	402	401	401
query30	325	176	153	153
query31	990	768	768	768
query32	93	55	65	55
query33	769	301	286	286
query34	983	481	480	480
query35	889	730	728	728
query36	1095	948	926	926
query37	164	90	82	82
query38	4020	3923	3969	3923
query39	1448	1413	1423	1413
query40	205	92	93	92
query41	49	45	47	45
query42	119	94	96	94
query43	526	484	487	484
query44	1300	830	802	802
query45	192	161	167	161
query46	1139	759	774	759
query47	1884	1796	1822	1796
query48	438	365	366	365
query49	1119	400	383	383
query50	826	404	402	402
query51	7103	7025	6880	6880
query52	97	86	87	86
query53	253	180	177	177
query54	1208	463	457	457
query55	74	82	80	80
query56	279	255	257	255
query57	1229	1084	1061	1061
query58	232	224	229	224
query59	3069	2980	2823	2823
query60	292	247	260	247
query61	104	97	105	97
query62	859	669	784	669
query63	210	195	185	185
query64	5232	636	624	624
query65	3251	3169	3177	3169
query66	1436	341	315	315
query67	16097	15656	15347	15347
query68	3108	874	861	861
query69	463	355	348	348
query70	1225	1198	1139	1139
query71	338	334	335	334
query72	6083	3860	3882	3860
query73	598	578	586	578
query74	9262	9089	8962	8962
query75	3094	2893	2923	2893
query76	1965	868	880	868
query77	386	364	359	359
query78	9388	9275	9284	9275
query79	927	871	893	871
query80	615	639	570	570
query81	446	250	251	250
query82	233	229	236	229
query83	160	157	179	157
query84	234	109	106	106
query85	685	356	349	349
query86	313	300	308	300
query87	4421	4318	4293	4293
query88	4365	4065	4042	4042
query89	367	357	354	354
query90	1451	309	306	306
query91	162	165	164	164
query92	79	69	72	69
query93	918	910	889	889
query94	572	345	379	345
query95	417	410	409	409
query96	487	484	481	481
query97	3116	3182	3114	3114
query98	233	222	224	222
query99	1394	1325	1301	1301
Total cold run time: 292206 ms
Total hot run time: 194081 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.33% (9583/25672)
Line Coverage: 28.72% (79240/275883)
Region Coverage: 28.19% (41017/145520)
Branch Coverage: 24.81% (20905/84254)
Coverage Report: http://coverage.selectdb-in.cc/coverage/af7b099b0728f301be3b992c3f35c5d82c8d3c60_af7b099b0728f301be3b992c3f35c5d82c8d3c60/report/index.html

@doris-robot
Copy link

ClickBench: Total hot run time: 32.69 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit af7b099b0728f301be3b992c3f35c5d82c8d3c60, data reload: false

query1	0.05	0.04	0.04
query2	0.06	0.02	0.03
query3	0.24	0.06	0.06
query4	1.65	0.11	0.10
query5	0.52	0.50	0.49
query6	1.16	0.73	0.73
query7	0.02	0.01	0.01
query8	0.04	0.03	0.03
query9	0.56	0.50	0.50
query10	0.55	0.55	0.56
query11	0.15	0.10	0.11
query12	0.15	0.11	0.11
query13	0.60	0.59	0.58
query14	2.97	2.93	2.98
query15	0.90	0.84	0.83
query16	0.38	0.38	0.37
query17	1.03	1.01	1.06
query18	0.23	0.21	0.20
query19	1.90	1.97	1.90
query20	0.01	0.00	0.01
query21	15.36	0.60	0.58
query22	2.64	2.56	1.92
query23	17.09	1.11	0.86
query24	2.92	0.94	1.47
query25	0.24	0.13	0.04
query26	0.41	0.15	0.13
query27	0.04	0.03	0.04
query28	10.33	1.09	1.07
query29	12.58	3.20	3.15
query30	0.24	0.05	0.06
query31	2.89	0.37	0.38
query32	3.29	0.47	0.46
query33	2.98	3.02	3.01
query34	17.12	4.41	4.39
query35	4.42	4.40	4.37
query36	0.68	0.48	0.49
query37	0.08	0.06	0.05
query38	0.04	0.03	0.04
query39	0.03	0.02	0.02
query40	0.15	0.12	0.13
query41	0.07	0.02	0.02
query42	0.03	0.02	0.02
query43	0.04	0.03	0.02
Total cold run time: 106.84 s
Total hot run time: 32.69 s

@kaka11chen kaka11chen marked this pull request as ready for review September 18, 2024 14:42
morningman
morningman previously approved these changes Sep 20, 2024
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kaka11chen kaka11chen force-pushed the opt_scanner_schedule_starvation_1 branch 3 times, most recently from a1b1e78 to 2087b56 Compare September 23, 2024 05:30
@kaka11chen
Copy link
Contributor Author

run buildall

Copy link
Contributor

@zhiqiang-hhhh zhiqiang-hhhh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@kaka11chen kaka11chen force-pushed the opt_scanner_schedule_starvation_1 branch from 2087b56 to 725afd0 Compare September 23, 2024 06:45
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 43020 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 725afd027645b108ae0b9d6dbac3074ca69ed5af, data reload: false

------ Round 1 ----------------------------------
q1	17581	7681	7213	7213
q2	2047	1313	1427	1313
q3	11498	1175	1178	1175
q4	10375	746	696	696
q5	7823	3202	3157	3157
q6	244	154	150	150
q7	1055	617	624	617
q8	9532	2140	2069	2069
q9	7037	6520	6448	6448
q10	7013	2257	2299	2257
q11	455	250	249	249
q12	406	218	213	213
q13	17776	3012	2961	2961
q14	248	211	233	211
q15	572	524	503	503
q16	701	618	622	618
q17	979	836	799	799
q18	7373	6802	6767	6767
q19	1404	1069	1030	1030
q20	577	289	281	281
q21	4234	3295	3263	3263
q22	1125	1034	1030	1030
Total cold run time: 110055 ms
Total hot run time: 43020 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7272	7261	7291	7261
q2	349	253	248	248
q3	3124	2943	2944	2943
q4	2000	1821	1781	1781
q5	5599	5783	5597	5597
q6	229	138	138	138
q7	2162	1814	1787	1787
q8	3265	3436	3440	3436
q9	8725	8776	8755	8755
q10	3536	3509	3482	3482
q11	567	491	499	491
q12	798	626	618	618
q13	5444	3179	3285	3179
q14	311	291	276	276
q15	576	541	530	530
q16	759	666	690	666
q17	1804	1602	1589	1589
q18	8159	7740	7841	7740
q19	1720	1555	1513	1513
q20	2125	1889	1908	1889
q21	5649	5427	5340	5340
q22	1123	1042	1012	1012
Total cold run time: 65296 ms
Total hot run time: 60271 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 195427 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 725afd027645b108ae0b9d6dbac3074ca69ed5af, data reload: false

query1	1200	745	745	745
query2	6252	2118	2095	2095
query3	10783	4067	3945	3945
query4	66988	28603	23465	23465
query5	4835	502	470	470
query6	414	163	163	163
query7	5544	309	292	292
query8	299	221	213	213
query9	8890	2635	2614	2614
query10	456	283	272	272
query11	17210	15172	15782	15172
query12	173	100	100	100
query13	1481	428	409	409
query14	10518	7747	7550	7550
query15	195	170	166	166
query16	7024	445	423	423
query17	1096	575	562	562
query18	1722	331	319	319
query19	192	154	148	148
query20	116	106	105	105
query21	201	106	101	101
query22	4531	4163	4398	4163
query23	34346	33930	34532	33930
query24	6042	2959	2850	2850
query25	520	408	428	408
query26	655	163	163	163
query27	1652	283	293	283
query28	3840	2461	2424	2424
query29	684	434	438	434
query30	246	160	158	158
query31	971	804	840	804
query32	75	57	57	57
query33	451	311	292	292
query34	891	508	494	494
query35	870	744	726	726
query36	1064	952	948	948
query37	149	85	85	85
query38	4025	3920	3948	3920
query39	1503	1434	1409	1409
query40	208	98	97	97
query41	50	50	50	50
query42	119	101	99	99
query43	532	497	484	484
query44	1168	819	804	804
query45	199	167	168	167
query46	1151	772	739	739
query47	1907	1821	1818	1818
query48	459	358	361	358
query49	713	417	421	417
query50	867	410	406	406
query51	7127	6801	6873	6801
query52	106	89	90	89
query53	254	177	182	177
query54	610	480	468	468
query55	79	80	78	78
query56	290	280	279	279
query57	1226	1113	1095	1095
query58	235	240	235	235
query59	3166	3074	3088	3074
query60	314	267	279	267
query61	107	108	102	102
query62	776	700	652	652
query63	214	182	185	182
query64	1332	654	608	608
query65	3225	3181	3193	3181
query66	640	323	303	303
query67	16075	15543	15564	15543
query68	3755	589	565	565
query69	537	303	311	303
query70	1167	1133	1128	1128
query71	389	268	278	268
query72	6695	4250	4009	4009
query73	759	327	332	327
query74	9753	8970	9169	8970
query75	3414	2693	2689	2689
query76	2596	905	841	841
query77	510	287	297	287
query78	9897	9245	9156	9156
query79	2160	546	551	546
query80	1003	441	459	441
query81	548	243	243	243
query82	244	144	141	141
query83	167	136	157	136
query84	290	83	82	82
query85	949	305	286	286
query86	371	312	294	294
query87	4468	4459	4338	4338
query88	3712	2334	2331	2331
query89	396	281	283	281
query90	1919	190	193	190
query91	200	146	148	146
query92	58	52	50	50
query93	2373	542	541	541
query94	763	299	296	296
query95	358	261	253	253
query96	610	287	282	282
query97	3271	3098	3112	3098
query98	218	193	200	193
query99	1557	1304	1266	1266
Total cold run time: 316360 ms
Total hot run time: 195427 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.62 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 725afd027645b108ae0b9d6dbac3074ca69ed5af, data reload: false

query1	0.05	0.05	0.04
query2	0.06	0.03	0.03
query3	0.23	0.06	0.06
query4	1.65	0.10	0.10
query5	0.52	0.52	0.53
query6	1.13	0.73	0.73
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.58	0.51	0.50
query10	0.55	0.54	0.56
query11	0.15	0.11	0.11
query12	0.14	0.11	0.11
query13	0.60	0.59	0.59
query14	2.98	2.90	2.93
query15	0.90	0.82	0.82
query16	0.38	0.39	0.38
query17	1.05	1.07	1.07
query18	0.21	0.21	0.21
query19	1.87	1.81	1.98
query20	0.01	0.01	0.01
query21	15.36	0.61	0.60
query22	2.66	2.52	2.48
query23	16.90	0.88	0.75
query24	3.58	1.15	1.88
query25	0.15	0.16	0.25
query26	0.53	0.14	0.14
query27	0.04	0.05	0.04
query28	9.47	1.10	1.06
query29	12.55	3.27	3.21
query30	0.25	0.06	0.06
query31	2.86	0.39	0.37
query32	3.30	0.46	0.46
query33	2.97	3.03	2.98
query34	16.86	4.38	4.34
query35	4.43	4.46	4.47
query36	0.66	0.50	0.48
query37	0.09	0.06	0.06
query38	0.04	0.04	0.03
query39	0.02	0.02	0.03
query40	0.15	0.12	0.13
query41	0.07	0.02	0.02
query42	0.04	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 106.14 s
Total hot run time: 33.62 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.27% (9606/25775)
Line Coverage: 28.68% (79508/277186)
Region Coverage: 28.12% (41107/146187)
Branch Coverage: 24.76% (20956/84624)
Coverage Report: http://coverage.selectdb-in.cc/coverage/725afd027645b108ae0b9d6dbac3074ca69ed5af_725afd027645b108ae0b9d6dbac3074ca69ed5af/report/index.html

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Sep 26, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@yiguolei yiguolei merged commit e40f7f8 into apache:master Sep 26, 2024
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Sep 30, 2024
…che#40641)

When a scanner scheduler is stuck in executing a scan task, other scan
tasks will starve and have no chance to execute, which will affect other
queries. Currently, the scan task hopes to scan as much data as possible
to reduce the overhead of scheduling switching. Currently, it hopes to
obtain up to 10MB of data in `doris_scanner_row_bytes`. However, if a
query scans a table with many rows of data, but the filtering rate is
very high, the filter will eventually filter out a lot of data and will
never get 10MB of data. It will keep getting and executing expression
filtering, which will cause other scan tasks to starve.

The current solution is to check `max_run_time_ms` by
`MonotonicStopWatch`. After executing for a maximum of 1s, it will yield
self's task for other tasks. When the scan task executes some
time-consuming tasks, it needs to slice to do it.
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Sep 30, 2024
…che#40641)

When a scanner scheduler is stuck in executing a scan task, other scan
tasks will starve and have no chance to execute, which will affect other
queries. Currently, the scan task hopes to scan as much data as possible
to reduce the overhead of scheduling switching. Currently, it hopes to
obtain up to 10MB of data in `doris_scanner_row_bytes`. However, if a
query scans a table with many rows of data, but the filtering rate is
very high, the filter will eventually filter out a lot of data and will
never get 10MB of data. It will keep getting and executing expression
filtering, which will cause other scan tasks to starve.

The current solution is to check `max_run_time_ms` by
`MonotonicStopWatch`. After executing for a maximum of 1s, it will yield
self's task for other tasks. When the scan task executes some
time-consuming tasks, it needs to slice to do it.
morningman pushed a commit that referenced this pull request Sep 30, 2024
morningman pushed a commit that referenced this pull request Sep 30, 2024
cjj2010 pushed a commit to cjj2010/doris that referenced this pull request Oct 12, 2024
…che#40641)

## Proposed changes

### Issue
When a scanner scheduler is stuck in executing a scan task, other scan
tasks will starve and have no chance to execute, which will affect other
queries. Currently, the scan task hopes to scan as much data as possible
to reduce the overhead of scheduling switching. Currently, it hopes to
obtain up to 10MB of data in `doris_scanner_row_bytes`. However, if a
query scans a table with many rows of data, but the filtering rate is
very high, the filter will eventually filter out a lot of data and will
never get 10MB of data. It will keep getting and executing expression
filtering, which will cause other scan tasks to starve.

### Solution
The current solution is to check `max_run_time_ms` by
`MonotonicStopWatch`. After executing for a maximum of 1s, it will yield
self's task for other tasks. When the scan task executes some
time-consuming tasks, it needs to slice to do it.
morningman pushed a commit that referenced this pull request Dec 30, 2024
…ate materialization‌ case of parquet reader (#46121)

### What problem does this PR solve?

Related PR: #40641

Problem Summary:

[Fix](parquet-reader) Fixed the issue of excessive scanning data in late
materialization‌ case of parquet reader introduced by #40641 in
scenarios with particularly high filtering rates.
github-actions bot pushed a commit that referenced this pull request Dec 30, 2024
…ate materialization‌ case of parquet reader (#46121)

### What problem does this PR solve?

Related PR: #40641

Problem Summary:

[Fix](parquet-reader) Fixed the issue of excessive scanning data in late
materialization‌ case of parquet reader introduced by #40641 in
scenarios with particularly high filtering rates.
github-actions bot pushed a commit that referenced this pull request Dec 30, 2024
…ate materialization‌ case of parquet reader (#46121)

### What problem does this PR solve?

Related PR: #40641

Problem Summary:

[Fix](parquet-reader) Fixed the issue of excessive scanning data in late
materialization‌ case of parquet reader introduced by #40641 in
scenarios with particularly high filtering rates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.7-merged dev/3.0.2-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants