[opt](split) get file splits in batch mode #34032

AshinGau · 2024-04-23T10:26:31Z

Proposed changes

When scanning a table with many files, It will take a lot of time to transfer splits to backends.(20s of the following 1209172 splits).

|   0:VHIVE_SCAN_NODE(71)                                            |
|      table: level3partition                                        |
|      inputSplitNum=1209172, totalFileSize=6527616577, scanRanges=3 |
|      partition=60591/60591                                         |
|      cardinality=1, numNodes=3                                     |
|      pushdown agg=NONE                                             |
|      limit: 1                                                      |
+--------------------------------------------------------------------+

Therefore, using batch mode to fetch the file splits, BE can do scanning while fetch the file splits.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

doris-robot · 2024-04-23T10:26:36Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

github-actions · 2024-04-23T10:33:05Z

clang-tidy review says "All clean, LGTM! 👍"

be/src/common/config.cpp

be/src/vec/exec/scan/split_source_connector.cpp

fe/fe-common/src/main/java/org/apache/doris/common/Config.java

github-actions · 2024-05-07T07:27:17Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2024-05-07T07:39:16Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2024-05-07T07:41:16Z

clang-tidy review says "All clean, LGTM! 👍"

AshinGau · 2024-05-07T07:43:04Z

run buildall

github-actions · 2024-05-07T07:49:26Z

clang-tidy review says "All clean, LGTM! 👍"

AshinGau · 2024-05-07T12:08:23Z

run buildall

github-actions · 2024-05-07T12:13:30Z

clang-tidy review says "All clean, LGTM! 👍"

AshinGau · 2024-05-08T01:08:40Z

run buildall

github-actions · 2024-05-08T01:14:15Z

clang-tidy review says "All clean, LGTM! 👍"

doris-robot · 2024-05-08T02:49:06Z

TeamCity be ut coverage result:
Function Coverage: 35.69% (8985/25178)
Line Coverage: 27.35% (74208/271362)
Region Coverage: 26.60% (38373/144259)
Branch Coverage: 23.40% (19565/83610)
Coverage Report: http://coverage.selectdb-in.cc/coverage/36185172b418fbd82887a7dfa972ff441cc57cf2_36185172b418fbd82887a7dfa972ff441cc57cf2/report/index.html

Jibing-Li

LGTM

github-actions · 2024-05-08T09:40:43Z

PR approved by at least one committer and no changes requested.

github-actions · 2024-05-08T09:40:46Z

PR approved by anyone and no changes requested.

morningman · 2024-05-08T09:49:43Z

fe/fe-core/src/main/java/org/apache/doris/datasource/FileQueryScanNode.java

+                SplitSource splitSource = new SplitSource(
+                        this::splitToScanRange, backend, locationProperties, splits, pathPartitionKeys);
+                splitSources.add(splitSource);
+                SplitSourceManager.registerSplitSource(splitSource);


Do not use singleton

morningman · 2024-05-08T09:57:20Z

fe/fe-core/src/main/java/org/apache/doris/datasource/SplitSourceManager.java

+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+
+public class SplitSourceManager {


Suggest:

not using singletion

extends MasterDaemon class

morningman · 2024-05-08T10:00:25Z

fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java

        return QeProcessorImpl.INSTANCE.reportExecStatus(params, getClientAddr());
    }

+    public TFetchSplitBatchResult fetchSplitBatch(TFetchSplitBatchRequest request) throws TException {


Missing @OverRide

morningman · 2024-05-08T10:09:55Z

be/src/vec/exec/scan/split_source_connector.cpp

+            LOG(WARNING) << "Failed to get batch of split source: {}, try to reopen" << e1.what();
+            RETURN_IF_ERROR(coord.reopen());
+            try {
+                coord->fetchSplitBatch(result, request);


Maybe we should not retry when failure.
If first call fail, it is highly possible the second would fail too.
Simply fail this query to avoid avalanche

morningman · 2024-05-08T10:11:24Z

fe/fe-common/src/main/java/org/apache/doris/common/Config.java

+    @ConfField(mutable = true, masterOnly = false, description = {
+            "如果切片数量超过阈值，BE将通过batch方式获取scan ranges",
+            "If the number of splits exceeds the threshold, scan ranges will be got through batch mode."})
+    public static int num_splits_in_batch_mode = 10000;


Better be a session varible?

github-actions · 2024-05-13T14:16:20Z

clang-tidy review says "All clean, LGTM! 👍"

AshinGau · 2024-05-13T14:50:33Z

run buildall

doris-robot · 2024-05-13T16:20:58Z

TPC-H: Total hot run time: 41933 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f9d87914102858d10220d80bd03afdae5deeb48a, data reload: false

------ Round 1 ----------------------------------
q1	17615	4481	4291	4291
q2	2027	187	190	187
q3	10475	1291	1152	1152
q4	10189	870	829	829
q5	7475	2764	2816	2764
q6	225	136	133	133
q7	1044	611	619	611
q8	9252	2214	2138	2138
q9	9403	6834	6695	6695
q10	9409	4034	3931	3931
q11	457	239	250	239
q12	440	242	234	234
q13	18146	3093	3284	3093
q14	255	208	213	208
q15	500	468	478	468
q16	471	407	399	399
q17	1001	725	684	684
q18	8385	7819	7700	7700
q19	6304	1611	1544	1544
q20	632	322	326	322
q21	5253	4036	4154	4036
q22	353	289	275	275
Total cold run time: 119311 ms
Total hot run time: 41933 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4540	4423	4401	4401
q2	384	272	263	263
q3	3140	2959	2882	2882
q4	1906	1614	1617	1614
q5	5486	5510	5505	5505
q6	218	125	125	125
q7	2370	1972	2024	1972
q8	3297	3426	3447	3426
q9	8624	8707	8683	8683
q10	3976	3841	3894	3841
q11	609	501	494	494
q12	802	644	614	614
q13	17060	3192	3268	3192
q14	299	276	288	276
q15	516	457	477	457
q16	468	414	418	414
q17	1786	1493	1474	1474
q18	7782	7567	7584	7567
q19	1679	1566	1570	1566
q20	1971	1764	1744	1744
q21	11312	4893	4740	4740
q22	568	491	479	479
Total cold run time: 78793 ms
Total hot run time: 55729 ms

doris-robot · 2024-05-13T16:32:21Z

TPC-DS: Total hot run time: 187732 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f9d87914102858d10220d80bd03afdae5deeb48a, data reload: false

query1	912	370	356	356
query2	6466	2472	2258	2258
query3	6647	217	223	217
query4	23165	21295	21288	21288
query5	4172	422	455	422
query6	258	175	173	173
query7	4582	304	292	292
query8	243	190	190	190
query9	8385	2355	2333	2333
query10	437	256	261	256
query11	14975	14184	14163	14163
query12	140	95	87	87
query13	1649	380	377	377
query14	9894	8550	7706	7706
query15	216	177	177	177
query16	7853	282	268	268
query17	1705	585	565	565
query18	1994	282	278	278
query19	205	156	155	155
query20	95	89	86	86
query21	209	132	131	131
query22	5028	4837	4860	4837
query23	34049	33619	33664	33619
query24	6663	3001	2987	2987
query25	555	421	354	354
query26	698	158	154	154
query27	1896	318	318	318
query28	3824	2025	2031	2025
query29	843	614	608	608
query30	235	156	156	156
query31	983	767	769	767
query32	95	52	54	52
query33	503	263	248	248
query34	898	474	495	474
query35	772	681	669	669
query36	1038	949	905	905
query37	103	67	69	67
query38	2898	2758	2732	2732
query39	1599	1592	1598	1592
query40	197	128	126	126
query41	42	37	39	37
query42	101	95	97	95
query43	612	550	580	550
query44	1078	733	755	733
query45	270	255	247	247
query46	1062	713	702	702
query47	1973	1866	1891	1866
query48	381	322	296	296
query49	777	400	405	400
query50	776	382	393	382
query51	6756	6605	6603	6603
query52	101	100	89	89
query53	354	281	289	281
query54	529	432	429	429
query55	75	72	74	72
query56	241	221	227	221
query57	1234	1148	1154	1148
query58	222	201	201	201
query59	3558	3338	3272	3272
query60	263	237	261	237
query61	92	88	90	88
query62	563	459	487	459
query63	308	289	284	284
query64	8377	7394	7400	7394
query65	3123	3088	3140	3088
query66	797	352	333	333
query67	15452	14992	15047	14992
query68	4725	533	531	531
query69	476	303	309	303
query70	1239	1143	1154	1143
query71	384	282	276	276
query72	7367	2577	2331	2331
query73	693	328	324	324
query74	6467	6139	6062	6062
query75	3306	2688	2640	2640
query76	2290	1029	952	952
query77	416	271	275	271
query78	10608	10333	10237	10237
query79	2578	517	514	514
query80	1099	499	444	444
query81	521	225	220	220
query82	723	93	97	93
query83	238	174	172	172
query84	248	92	85	85
query85	1269	278	268	268
query86	509	323	284	284
query87	3275	3097	3118	3097
query88	4405	2421	2431	2421
query89	485	392	402	392
query90	2036	192	195	192
query91	125	98	102	98
query92	63	52	49	49
query93	1953	510	500	500
query94	1149	184	186	184
query95	403	309	309	309
query96	604	281	268	268
query97	3227	3007	2988	2988
query98	239	222	220	220
query99	1238	897	907	897
Total cold run time: 270554 ms
Total hot run time: 187732 ms

doris-robot · 2024-05-14T02:06:05Z

TeamCity be ut coverage result:
Function Coverage: 35.66% (8990/25208)
Line Coverage: 27.32% (74295/271957)
Region Coverage: 26.55% (38402/144616)
Branch Coverage: 23.37% (19579/83794)
Coverage Report: http://coverage.selectdb-in.cc/coverage/f9d87914102858d10220d80bd03afdae5deeb48a_f9d87914102858d10220d80bd03afdae5deeb48a/report/index.html

morningman

LGTM

morningman

LGTM

github-actions · 2024-05-14T04:01:50Z

PR approved by at least one committer and no changes requested.

kaka11chen

LGTM

When scanning a table with many files, It will take a lot of time to transfer splits to backends.(20s of the following 1209172 splits). Therefore, using batch mode to fetch the file splits, BE can do scanning while fetch the file splits.

bp #34032

PR #34032 introduce a new method to get splits batch by batch, but it removed a logic that BE will merge scan ranges to avoid too many scan ranges being scheduled. This PR mainly changes: 1. Add scan range merging logic back. 2. Change the default file split size from 8MB to 64MB, to avoid too many small split.

PR apache#34032 introduce a new method to get splits batch by batch, but it removed a logic that BE will merge scan ranges to avoid too many scan ranges being scheduled. This PR mainly changes: 1. Add scan range merging logic back. 2. Change the default file split size from 8MB to 64MB, to avoid too many small split.

PR #34032 introduce a new method to get splits batch by batch, but it removed a logic that BE will merge scan ranges to avoid too many scan ranges being scheduled. This PR mainly changes: 1. Add scan range merging logic back. 2. Change the default file split size from 8MB to 64MB, to avoid too many small split.

github-actions bot added the meta-change label Apr 23, 2024

morningman reviewed Apr 25, 2024

View reviewed changes

be/src/common/config.cpp Show resolved Hide resolved

be/src/vec/exec/scan/split_source_connector.cpp Outdated Show resolved Hide resolved

fe/fe-common/src/main/java/org/apache/doris/common/Config.java Outdated Show resolved Hide resolved

AshinGau force-pushed the split_source branch from dd7e46f to 432e69d Compare May 7, 2024 07:20

AshinGau force-pushed the split_source branch 2 times, most recently from 9e436b6 to 380cea0 Compare May 7, 2024 07:34

AshinGau marked this pull request as ready for review May 7, 2024 07:34

AshinGau force-pushed the split_source branch from 380cea0 to 700a3bc Compare May 7, 2024 07:42

AshinGau force-pushed the split_source branch from 700a3bc to 49c8d72 Compare May 7, 2024 12:06

AshinGau force-pushed the split_source branch from 49c8d72 to 3618517 Compare May 8, 2024 01:08

Jibing-Li previously approved these changes May 8, 2024

View reviewed changes

github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels May 8, 2024

morningman reviewed May 8, 2024

View reviewed changes

AshinGau dismissed Jibing-Li’s stale review via d1ce88f May 8, 2024 14:35

AshinGau force-pushed the split_source branch from 3618517 to d1ce88f Compare May 8, 2024 14:35

morningman approved these changes May 14, 2024

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label May 14, 2024

morningman added the dev/3.0.x label May 14, 2024

kaka11chen approved these changes May 14, 2024

View reviewed changes

AshinGau merged commit cc457a2 into apache:master May 14, 2024

gavinchou added dev/3.0.0-merged and removed dev/3.0.x labels May 16, 2024

yiguolei added the 2.1.0-conflict label May 18, 2024

AshinGau mentioned this pull request May 21, 2024

[opt](split) get file splits in batch mode #35107

Merged

morningman pushed a commit that referenced this pull request May 21, 2024

[opt](split) get file splits in batch mode (#34032) (#35107)

98f8eb5

bp #34032

morningman added dev/2.1.4-merged and removed dev/2.1.x 2.1.0-conflict labels May 21, 2024

AshinGau mentioned this pull request Jun 11, 2024

[opt](split) close the batch mode of file split in default #36108

Closed

morningman mentioned this pull request Jul 30, 2024

[opt](catalog) merge scan range to avoid too many splits #38311

Merged

[opt](split) get file splits in batch mode #34032

[opt](split) get file splits in batch mode #34032

Uh oh!

Conversation

AshinGau commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Further comments

Uh oh!

doris-robot commented Apr 23, 2024

Uh oh!

github-actions bot commented Apr 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented May 7, 2024

Uh oh!

github-actions bot commented May 7, 2024

Uh oh!

github-actions bot commented May 7, 2024

Uh oh!

AshinGau commented May 7, 2024

Uh oh!

github-actions bot commented May 7, 2024

Uh oh!

AshinGau commented May 7, 2024

Uh oh!

github-actions bot commented May 7, 2024

Uh oh!

AshinGau commented May 8, 2024

Uh oh!

github-actions bot commented May 8, 2024

Uh oh!

doris-robot commented May 8, 2024

Uh oh!

Jibing-Li left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 8, 2024

Uh oh!

github-actions bot commented May 8, 2024

Uh oh!

morningman May 8, 2024

Choose a reason for hiding this comment

Uh oh!

morningman May 8, 2024

Choose a reason for hiding this comment

Uh oh!

morningman May 8, 2024

Choose a reason for hiding this comment

Uh oh!

morningman May 8, 2024

Choose a reason for hiding this comment

Uh oh!

morningman May 8, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 13, 2024

Uh oh!

AshinGau commented May 13, 2024

Uh oh!

doris-robot commented May 13, 2024

Uh oh!

doris-robot commented May 13, 2024

Uh oh!

doris-robot commented May 14, 2024

Uh oh!

morningman left a comment

Choose a reason for hiding this comment

Uh oh!

morningman left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 14, 2024

Uh oh!

kaka11chen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

AshinGau commented Apr 23, 2024 •

edited

Loading