Skip to content

Conversation

@BePPPower
Copy link
Contributor

@BePPPower BePPPower commented May 20, 2024

Previously, each Block data would write a row group every time when data was written to a Parquet file.
Which causing too many row group in a single parquet file. The the reading performance on this kind of file
is very bad.

This PR use Arrow's WriteRecordBatch interface, which allows caching data to be written to Parquet.
We can control the size of the row group cache."

Currently, simply accumulate the size of the blocks to determine the cache size.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@BePPPower
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.70% (9013/25248)
Line Coverage: 27.35% (74531/272481)
Region Coverage: 26.60% (38546/144888)
Branch Coverage: 23.43% (19660/83910)
Coverage Report: http://coverage.selectdb-in.cc/coverage/05dc72253b142f26bb31e16ba2dda870dc0988cc_05dc72253b142f26bb31e16ba2dda870dc0988cc/report/index.html

@BePPPower
Copy link
Contributor Author

run p0

@morningman morningman changed the title [Fix](Outfile) Specify the row group size when writing data to Parquet files. [opt](parquet-writer) Specify the row group size when writing data to Parquet files. May 31, 2024
@BePPPower
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.28% (9231/25443)
Line Coverage: 27.63% (75693/273981)
Region Coverage: 26.85% (39197/146002)
Branch Coverage: 23.60% (19892/84290)
Coverage Report: http://coverage.selectdb-in.cc/coverage/c01358070f2a7b1f64554534703a882c7d638434_c01358070f2a7b1f64554534703a882c7d638434/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 41482 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c01358070f2a7b1f64554534703a882c7d638434, data reload: false

------ Round 1 ----------------------------------
q1	17740	4393	4290	4290
q2	2031	208	197	197
q3	10447	1241	1179	1179
q4	10759	857	822	822
q5	8163	2771	2687	2687
q6	232	139	143	139
q7	1011	654	638	638
q8	9291	2203	2147	2147
q9	11977	7254	6864	6864
q10	9197	3945	4008	3945
q11	457	256	249	249
q12	456	231	231	231
q13	20573	3189	3189	3189
q14	266	238	218	218
q15	534	488	480	480
q16	505	416	397	397
q17	992	733	709	709
q18	8211	7683	7746	7683
q19	3096	1521	1521	1521
q20	643	327	321	321
q21	5101	3248	4018	3248
q22	398	339	328	328
Total cold run time: 122080 ms
Total hot run time: 41482 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4588	4438	4439	4438
q2	390	277	275	275
q3	3094	2927	2876	2876
q4	1962	1675	1656	1656
q5	5349	5409	5532	5409
q6	205	124	129	124
q7	2193	1807	1780	1780
q8	3181	3371	3381	3371
q9	8547	8556	8693	8556
q10	4062	3916	3742	3742
q11	576	476	479	476
q12	775	627	633	627
q13	15903	3135	3137	3135
q14	305	300	264	264
q15	525	493	471	471
q16	477	426	426	426
q17	1799	1522	1541	1522
q18	7934	7613	7367	7367
q19	1668	1636	1579	1579
q20	1961	1781	1770	1770
q21	8119	4774	4711	4711
q22	604	516	529	516
Total cold run time: 74217 ms
Total hot run time: 55091 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169074 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c01358070f2a7b1f64554534703a882c7d638434, data reload: false

query1	915	374	375	374
query2	6437	2418	2284	2284
query3	6642	201	212	201
query4	21710	17438	17201	17201
query5	4114	416	422	416
query6	247	159	148	148
query7	4589	296	286	286
query8	330	285	287	285
query9	8516	2395	2368	2368
query10	464	294	259	259
query11	10448	10099	10136	10099
query12	134	92	85	85
query13	1629	363	356	356
query14	10130	6879	7303	6879
query15	234	186	192	186
query16	7514	269	256	256
query17	1360	511	513	511
query18	1936	268	267	267
query19	202	158	155	155
query20	90	84	83	83
query21	214	129	133	129
query22	4365	4065	3932	3932
query23	33590	33121	32932	32932
query24	7125	2932	2841	2841
query25	582	358	366	358
query26	709	156	156	156
query27	2041	319	325	319
query28	3814	2047	2036	2036
query29	855	611	594	594
query30	244	154	151	151
query31	996	747	725	725
query32	86	52	56	52
query33	495	291	283	283
query34	859	489	486	486
query35	711	627	649	627
query36	1022	893	905	893
query37	102	65	65	65
query38	2908	2770	2784	2770
query39	819	791	779	779
query40	196	126	118	118
query41	51	44	50	44
query42	105	92	99	92
query43	574	547	563	547
query44	1064	722	755	722
query45	184	185	165	165
query46	1051	708	735	708
query47	1865	1761	1779	1761
query48	373	301	297	297
query49	843	380	386	380
query50	752	404	386	386
query51	6730	6696	6656	6656
query52	104	90	96	90
query53	352	298	292	292
query54	563	457	436	436
query55	82	78	74	74
query56	257	245	247	245
query57	1112	1031	1055	1031
query58	232	214	204	204
query59	3496	3162	3016	3016
query60	270	257	261	257
query61	89	89	89	89
query62	595	457	440	440
query63	319	297	294	294
query64	8421	2219	1753	1753
query65	3177	3079	3082	3079
query66	830	343	338	338
query67	15148	14769	14656	14656
query68	4575	545	535	535
query69	488	270	264	264
query70	1145	1139	1112	1112
query71	368	272	266	266
query72	7817	2706	2612	2612
query73	720	328	324	324
query74	6005	5567	5714	5567
query75	3318	2633	2608	2608
query76	2179	955	980	955
query77	402	273	280	273
query78	10142	9860	9824	9824
query79	2487	506	511	506
query80	911	473	446	446
query81	529	220	225	220
query82	799	95	95	95
query83	259	196	179	179
query84	250	92	92	92
query85	1153	329	318	318
query86	441	296	312	296
query87	3298	3123	3125	3123
query88	4050	2380	2358	2358
query89	480	387	379	379
query90	2084	192	192	192
query91	135	110	109	109
query92	70	49	52	49
query93	1781	517	497	497
query94	1116	205	199	199
query95	418	320	315	315
query96	592	265	269	265
query97	3185	3015	3063	3015
query98	250	231	214	214
query99	1180	870	872	870
Total cold run time: 260640 ms
Total hot run time: 169074 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.56 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c01358070f2a7b1f64554534703a882c7d638434, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.04	0.04
query3	0.23	0.05	0.05
query4	1.68	0.06	0.07
query5	0.50	0.49	0.49
query6	1.12	0.73	0.72
query7	0.02	0.01	0.01
query8	0.06	0.04	0.04
query9	0.53	0.49	0.49
query10	0.55	0.56	0.56
query11	0.15	0.12	0.11
query12	0.14	0.12	0.12
query13	0.59	0.59	0.60
query14	0.79	0.78	0.77
query15	0.82	0.81	0.82
query16	0.36	0.36	0.34
query17	0.93	0.96	0.96
query18	0.23	0.24	0.24
query19	1.76	1.66	1.65
query20	0.02	0.02	0.01
query21	15.59	0.68	0.68
query22	4.36	6.99	1.93
query23	18.33	1.26	1.23
query24	1.34	0.40	0.25
query25	0.15	0.09	0.08
query26	0.25	0.16	0.17
query27	0.08	0.09	0.08
query28	13.32	1.01	1.00
query29	13.27	3.35	3.27
query30	0.24	0.06	0.05
query31	2.87	0.38	0.38
query32	3.34	0.46	0.46
query33	2.86	2.91	2.88
query34	17.02	4.43	4.44
query35	4.46	4.54	4.65
query36	0.65	0.48	0.45
query37	0.18	0.16	0.15
query38	0.16	0.15	0.14
query39	0.05	0.04	0.04
query40	0.17	0.14	0.15
query41	0.09	0.06	0.05
query42	0.05	0.05	0.04
query43	0.04	0.03	0.04
Total cold run time: 109.47 s
Total hot run time: 30.56 s

morningman
morningman previously approved these changes May 31, 2024
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 31, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@kaka11chen
Copy link
Contributor

This solution will consume a lot of memory, so we need to discuss a better solution. @BePPPower

@kaka11chen kaka11chen self-requested a review May 31, 2024 11:08
@BePPPower
Copy link
Contributor Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Jun 5, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Jun 5, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.42% (8986/24676)
Line Coverage: 27.93% (73414/262847)
Region Coverage: 27.36% (38010/138927)
Branch Coverage: 23.95% (19289/80530)
Coverage Report: http://coverage.selectdb-in.cc/coverage/d9a2a820fe479dfc550b64203ef3fd94ce5fb4ca_d9a2a820fe479dfc550b64203ef3fd94ce5fb4ca/report/index.html

@BePPPower
Copy link
Contributor Author

run performance

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Jun 6, 2024

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 6, 2024
Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

builder.created_by(
fmt::format("{}({})", doris::get_short_version(), parquet::DEFAULT_CREATED_BY));
_parquet_writer_properties = builder.build();
_parquet_writer_properties = builder.created_by("DorisBE")->build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need this?

@doris-robot
Copy link

TPC-H: Total hot run time: 41257 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d9a2a820fe479dfc550b64203ef3fd94ce5fb4ca, data reload: false

------ Round 1 ----------------------------------
q1	17619	4348	4322	4322
q2	2010	202	202	202
q3	10436	1245	1160	1160
q4	10181	857	791	791
q5	7493	2726	2744	2726
q6	219	138	142	138
q7	952	622	604	604
q8	9223	2158	2101	2101
q9	9310	6778	6801	6778
q10	9293	3944	3892	3892
q11	447	248	242	242
q12	480	236	241	236
q13	17207	3187	3225	3187
q14	273	240	230	230
q15	506	466	465	465
q16	466	403	387	387
q17	1012	666	789	666
q18	8343	7881	7821	7821
q19	3842	1460	1237	1237
q20	650	331	323	323
q21	5180	3416	3966	3416
q22	393	339	333	333
Total cold run time: 115535 ms
Total hot run time: 41257 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4618	4397	4393	4393
q2	374	271	267	267
q3	3127	2969	2936	2936
q4	1943	1623	1630	1623
q5	5439	5529	5481	5481
q6	215	126	126	126
q7	2240	1825	1830	1825
q8	3246	3406	3348	3348
q9	8529	8612	8676	8612
q10	4067	3876	3657	3657
q11	580	485	477	477
q12	793	633	685	633
q13	16087	3118	3144	3118
q14	310	277	271	271
q15	519	476	485	476
q16	480	422	431	422
q17	1808	1479	1531	1479
q18	8113	7504	7314	7314
q19	1760	1512	1600	1512
q20	3009	1804	1765	1765
q21	8242	4850	4601	4601
q22	626	528	579	528
Total cold run time: 76125 ms
Total hot run time: 54864 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172761 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d9a2a820fe479dfc550b64203ef3fd94ce5fb4ca, data reload: false

query1	914	392	368	368
query2	6483	2317	2495	2317
query3	6642	210	207	207
query4	22667	17514	17200	17200
query5	4116	440	454	440
query6	250	161	151	151
query7	4581	292	285	285
query8	309	281	272	272
query9	8510	2453	2427	2427
query10	440	290	280	280
query11	10611	9964	10049	9964
query12	131	85	85	85
query13	1634	358	354	354
query14	10120	7656	6859	6859
query15	220	186	183	183
query16	7783	270	258	258
query17	1805	516	531	516
query18	1944	276	278	276
query19	202	151	157	151
query20	96	84	82	82
query21	219	130	128	128
query22	4255	4035	4068	4035
query23	33786	33005	33040	33005
query24	11253	2924	2845	2845
query25	615	352	353	352
query26	740	153	153	153
query27	2321	322	318	318
query28	6329	2051	2046	2046
query29	859	616	604	604
query30	274	152	146	146
query31	944	764	733	733
query32	91	51	54	51
query33	759	298	273	273
query34	958	474	468	468
query35	740	645	633	633
query36	1138	923	935	923
query37	137	71	67	67
query38	2891	2775	2748	2748
query39	841	783	804	783
query40	209	120	122	120
query41	53	52	52	52
query42	122	97	97	97
query43	558	565	553	553
query44	1171	729	763	729
query45	193	165	171	165
query46	1080	735	722	722
query47	1889	1806	1828	1806
query48	367	291	297	291
query49	1092	440	414	414
query50	764	386	387	386
query51	6796	6583	6648	6583
query52	98	98	88	88
query53	354	294	284	284
query54	849	470	443	443
query55	77	74	73	73
query56	277	256	248	248
query57	1115	1071	1076	1071
query58	261	243	243	243
query59	3402	3209	3187	3187
query60	296	304	268	268
query61	90	116	86	86
query62	639	445	450	445
query63	324	288	297	288
query64	8839	2192	1750	1750
query65	3202	3095	3124	3095
query66	821	345	336	336
query67	15396	14808	14964	14808
query68	4481	541	520	520
query69	471	307	300	300
query70	1192	1117	1062	1062
query71	386	272	272	272
query72	7198	5907	5881	5881
query73	734	324	323	323
query74	5976	5496	5451	5451
query75	3370	2628	2625	2625
query76	2633	941	938	938
query77	449	306	303	303
query78	10605	9963	9753	9753
query79	2602	538	534	534
query80	2283	478	479	478
query81	548	227	224	224
query82	969	104	102	102
query83	280	172	169	169
query84	271	93	87	87
query85	1997	305	323	305
query86	496	314	322	314
query87	3312	3102	3102	3102
query88	4336	2368	2353	2353
query89	487	393	383	383
query90	1845	189	195	189
query91	139	105	109	105
query92	59	50	53	50
query93	2859	510	503	503
query94	1232	195	197	195
query95	416	332	327	327
query96	596	272	268	268
query97	3201	3002	3051	3002
query98	222	203	198	198
query99	1305	828	858	828
Total cold run time: 275892 ms
Total hot run time: 172761 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.36 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit d9a2a820fe479dfc550b64203ef3fd94ce5fb4ca, data reload: false

query1	0.04	0.03	0.04
query2	0.09	0.04	0.04
query3	0.23	0.06	0.05
query4	1.70	0.07	0.08
query5	0.50	0.49	0.49
query6	1.13	0.74	0.73
query7	0.02	0.02	0.01
query8	0.04	0.04	0.05
query9	0.54	0.48	0.49
query10	0.54	0.56	0.54
query11	0.16	0.11	0.11
query12	0.14	0.12	0.12
query13	0.59	0.59	0.59
query14	0.78	0.78	0.79
query15	0.83	0.81	0.82
query16	0.36	0.36	0.35
query17	0.94	0.98	1.00
query18	0.21	0.24	0.25
query19	1.89	1.73	1.68
query20	0.02	0.01	0.01
query21	15.50	0.66	0.64
query22	3.85	8.02	1.90
query23	18.24	1.36	1.22
query24	2.15	0.22	0.20
query25	0.15	0.09	0.08
query26	0.27	0.16	0.17
query27	0.08	0.08	0.08
query28	13.18	1.02	0.98
query29	13.44	3.34	3.23
query30	0.24	0.05	0.05
query31	2.88	0.38	0.39
query32	3.27	0.47	0.47
query33	2.89	2.95	2.92
query34	16.90	4.45	4.41
query35	4.46	4.47	4.60
query36	0.70	0.47	0.47
query37	0.17	0.15	0.16
query38	0.14	0.15	0.13
query39	0.04	0.04	0.03
query40	0.17	0.13	0.15
query41	0.09	0.04	0.04
query42	0.06	0.04	0.04
query43	0.04	0.04	0.03
Total cold run time: 109.66 s
Total hot run time: 30.36 s

@morningman morningman merged commit 844adc6 into apache:master Jun 7, 2024
morningman pushed a commit to morningman/doris that referenced this pull request Jun 7, 2024
… Parquet files. (apache#35081)

Previously, each Block data would write a row group every time when data
was written to a Parquet file.

Using Arrow's `WriteRecordBatch` interface allows caching data to be
written to Parquet. We can control the size of the row group cache."

Currently, simply accumulate the size of the blocks to determine the
cache size.
morningman added a commit that referenced this pull request Jun 7, 2024
… Parquet files. (#35081) (#36042)

bp #35081

Co-authored-by: Tiewei Fang <43782773+BePPPower@users.noreply.github.com>
dataroaring pushed a commit that referenced this pull request Jun 13, 2024
… Parquet files. (#35081)

Previously, each Block data would write a row group every time when data
was written to a Parquet file.

Using Arrow's `WriteRecordBatch` interface allows caching data to be
written to Parquet. We can control the size of the row group cache."

Currently, simply accumulate the size of the blocks to determine the
cache size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.4-merged dev/3.0.0-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants