Skip to content

Conversation

@wsjz
Copy link
Contributor

@wsjz wsjz commented Apr 15, 2024

Proposed changes

Issue Number: #31442

hive3 support create table with column's default value
if use hive3, we can write default value to table

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@wsjz
Copy link
Contributor Author

wsjz commented Apr 15, 2024

run buildall

@wsjz
Copy link
Contributor Author

wsjz commented Apr 17, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 38403 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 971cbc38a61ec7f0cc4fab9869cfbe4ab9943db2, data reload: false

------ Round 1 ----------------------------------
q1	17914	4446	4348	4348
q2	2581	204	194	194
q3	11063	1118	1159	1118
q4	10185	736	841	736
q5	7499	2662	2635	2635
q6	215	134	137	134
q7	1007	602	588	588
q8	9223	2077	2034	2034
q9	7726	6574	6508	6508
q10	8511	3555	3502	3502
q11	451	229	225	225
q12	454	220	210	210
q13	17771	2958	3048	2958
q14	281	220	227	220
q15	515	480	472	472
q16	523	393	382	382
q17	969	648	676	648
q18	7487	6825	6721	6721
q19	1607	1482	1543	1482
q20	659	311	296	296
q21	3584	2697	2835	2697
q22	353	295	300	295
Total cold run time: 110578 ms
Total hot run time: 38403 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4230	4159	4216	4159
q2	375	267	267	267
q3	2965	2686	2748	2686
q4	1876	1582	1600	1582
q5	5288	5281	5325	5281
q6	208	124	128	124
q7	2229	1869	1868	1868
q8	3229	3347	3329	3329
q9	8507	8506	8572	8506
q10	3883	3773	3720	3720
q11	575	493	482	482
q12	708	592	608	592
q13	16476	2930	2919	2919
q14	305	270	279	270
q15	509	479	475	475
q16	489	420	425	420
q17	1788	1497	1458	1458
q18	7417	7407	7390	7390
q19	1642	1550	1560	1550
q20	1979	1772	1726	1726
q21	4883	4669	4791	4669
q22	522	459	434	434
Total cold run time: 70083 ms
Total hot run time: 53907 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 183843 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 971cbc38a61ec7f0cc4fab9869cfbe4ab9943db2, data reload: false

query1	902	372	372	372
query2	6902	2415	2489	2415
query3	6657	210	208	208
query4	22873	21265	21348	21265
query5	4132	410	413	410
query6	264	174	168	168
query7	4579	291	284	284
query8	230	174	174	174
query9	8500	2364	2355	2355
query10	580	247	263	247
query11	14767	14121	14142	14121
query12	140	88	88	88
query13	1644	372	368	368
query14	10298	7894	7764	7764
query15	261	184	195	184
query16	8196	267	259	259
query17	1922	574	552	552
query18	2106	284	273	273
query19	324	147	151	147
query20	95	83	84	83
query21	195	136	126	126
query22	5022	4857	4774	4774
query23	33956	33228	32892	32892
query24	11641	3032	2896	2896
query25	636	393	389	389
query26	1767	149	149	149
query27	3049	314	311	311
query28	7654	2030	2012	2012
query29	1021	595	598	595
query30	303	169	170	169
query31	975	717	719	717
query32	95	51	52	51
query33	739	243	254	243
query34	1112	477	481	477
query35	834	700	699	699
query36	1050	909	924	909
query37	278	70	70	70
query38	3364	3188	3197	3188
query39	1589	1515	1525	1515
query40	273	125	128	125
query41	46	43	42	42
query42	101	94	96	94
query43	599	523	541	523
query44	1173	736	717	717
query45	279	289	258	258
query46	1072	724	744	724
query47	1896	1861	1837	1837
query48	356	301	296	296
query49	1181	367	382	367
query50	747	387	394	387
query51	6724	6585	6570	6570
query52	98	90	90	90
query53	348	278	286	278
query54	308	240	235	235
query55	77	74	74	74
query56	245	220	216	216
query57	1210	1113	1112	1112
query58	219	189	194	189
query59	3462	3370	3139	3139
query60	250	235	229	229
query61	91	121	88	88
query62	638	441	445	441
query63	308	274	284	274
query64	6200	3931	3398	3398
query65	3104	3019	2994	2994
query66	1376	348	333	333
query67	15266	14812	14993	14812
query68	5198	535	548	535
query69	483	307	308	307
query70	1217	1160	1179	1160
query71	1402	1268	1265	1265
query72	6678	2589	2423	2423
query73	706	320	317	317
query74	6805	6367	6347	6347
query75	3511	2636	2618	2618
query76	3475	932	967	932
query77	398	268	268	268
query78	10822	10225	10267	10225
query79	3727	521	522	521
query80	1951	423	428	423
query81	529	240	247	240
query82	1501	96	93	93
query83	347	166	168	166
query84	268	86	82	82
query85	1615	273	327	273
query86	468	288	307	288
query87	3468	3290	3312	3290
query88	4739	2413	2411	2411
query89	491	367	378	367
query90	2031	183	178	178
query91	121	97	102	97
query92	62	46	48	46
query93	4785	515	508	508
query94	1226	182	178	178
query95	384	300	291	291
query96	599	266	267	266
query97	3139	2901	2919	2901
query98	233	226	209	209
query99	1276	865	857	857
Total cold run time: 291954 ms
Total hot run time: 183843 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.08 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 971cbc38a61ec7f0cc4fab9869cfbe4ab9943db2, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.03
query3	0.22	0.05	0.05
query4	1.68	0.07	0.08
query5	0.50	0.50	0.50
query6	1.48	0.73	0.73
query7	0.02	0.02	0.02
query8	0.05	0.04	0.04
query9	0.54	0.50	0.50
query10	0.56	0.55	0.57
query11	0.15	0.11	0.11
query12	0.14	0.12	0.12
query13	0.60	0.59	0.58
query14	0.75	0.78	0.76
query15	0.81	0.80	0.80
query16	0.36	0.35	0.37
query17	0.94	0.97	0.97
query18	0.22	0.24	0.24
query19	1.84	1.66	1.69
query20	0.01	0.01	0.01
query21	15.81	0.67	0.66
query22	5.10	7.22	1.68
query23	18.27	1.37	1.19
query24	1.42	0.31	0.30
query25	0.14	0.08	0.08
query26	0.26	0.17	0.17
query27	0.08	0.07	0.08
query28	13.26	1.00	0.97
query29	12.56	3.27	3.24
query30	0.26	0.07	0.05
query31	2.87	0.38	0.39
query32	3.29	0.47	0.47
query33	2.79	2.84	2.82
query34	17.24	4.35	4.42
query35	4.48	4.49	4.53
query36	0.66	0.45	0.46
query37	0.18	0.15	0.16
query38	0.15	0.15	0.14
query39	0.05	0.04	0.04
query40	0.18	0.13	0.13
query41	0.10	0.05	0.05
query42	0.05	0.05	0.05
query43	0.03	0.03	0.04
Total cold run time: 110.22 s
Total hot run time: 30.08 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit 971cbc38a61ec7f0cc4fab9869cfbe4ab9943db2 with default session variables
Stream load json:         18 seconds loaded 2358488459 Bytes, about 124 MB/s
Stream load orc:          58 seconds loaded 1101869774 Bytes, about 18 MB/s
Stream load parquet:      32 seconds loaded 861443392 Bytes, about 25 MB/s
Insert into select:       13.4 seconds inserted 10000000 Rows, about 746K ops/s

@wsjz
Copy link
Contributor Author

wsjz commented Apr 17, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 38247 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4def3293f1d30157ccdc57f2b1837e1ea731ab50, data reload: false

------ Round 1 ----------------------------------
q1	17616	4250	4228	4228
q2	2006	192	198	192
q3	10446	1103	1135	1103
q4	10201	773	727	727
q5	7507	2667	2662	2662
q6	216	134	131	131
q7	1020	598	585	585
q8	9228	2068	2011	2011
q9	7208	6529	6478	6478
q10	8581	3504	3479	3479
q11	460	238	233	233
q12	522	213	211	211
q13	18875	2911	2924	2911
q14	256	219	237	219
q15	515	482	485	482
q16	518	373	369	369
q17	951	700	676	676
q18	7257	6747	6690	6690
q19	3855	1509	1467	1467
q20	651	313	295	295
q21	3459	2806	2846	2806
q22	352	292	299	292
Total cold run time: 111700 ms
Total hot run time: 38247 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4355	4200	4194	4194
q2	367	261	272	261
q3	2984	2764	2688	2688
q4	1878	1606	1574	1574
q5	5335	5324	5258	5258
q6	209	121	121	121
q7	2215	1824	1844	1824
q8	3207	3317	3319	3317
q9	8555	8566	8489	8489
q10	4093	3840	3825	3825
q11	640	508	511	508
q12	786	593	614	593
q13	17427	3378	3194	3194
q14	314	279	273	273
q15	506	482	490	482
q16	502	435	434	434
q17	1816	1520	1513	1513
q18	8113	7897	7917	7897
q19	1677	1507	1525	1507
q20	2024	1852	1817	1817
q21	5056	4964	4904	4904
q22	542	455	476	455
Total cold run time: 72601 ms
Total hot run time: 55128 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 183374 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4def3293f1d30157ccdc57f2b1837e1ea731ab50, data reload: false

query1	880	366	358	358
query2	6633	2699	2325	2325
query3	6799	198	194	194
query4	23222	21271	21333	21271
query5	4121	405	418	405
query6	266	184	167	167
query7	4598	296	292	292
query8	223	186	173	173
query9	8395	2359	2339	2339
query10	432	256	243	243
query11	14799	14425	14160	14160
query12	134	93	85	85
query13	1634	367	381	367
query14	8768	7679	7212	7212
query15	248	178	176	176
query16	8135	259	263	259
query17	1935	552	547	547
query18	2099	279	274	274
query19	213	140	147	140
query20	92	84	83	83
query21	197	128	128	128
query22	5145	4928	4826	4826
query23	33650	33087	33200	33087
query24	8562	3025	3011	3011
query25	577	378	381	378
query26	680	158	148	148
query27	2347	358	372	358
query28	5986	2084	2057	2057
query29	861	631	623	623
query30	324	179	182	179
query31	993	751	774	751
query32	97	55	54	54
query33	591	245	240	240
query34	963	496	499	496
query35	862	710	709	709
query36	1047	923	908	908
query37	110	75	77	75
query38	3545	3442	3374	3374
query39	1060	1027	1022	1022
query40	166	129	136	129
query41	45	44	42	42
query42	104	95	98	95
query43	567	575	559	559
query44	1111	722	764	722
query45	292	289	260	260
query46	1101	726	730	726
query47	2030	1966	1946	1946
query48	362	303	303	303
query49	835	384	375	375
query50	778	404	385	385
query51	6800	6774	6679	6679
query52	104	90	86	86
query53	353	286	279	279
query54	271	232	239	232
query55	75	76	80	76
query56	254	237	234	234
query57	1293	1186	1216	1186
query58	219	204	230	204
query59	3810	3370	3300	3300
query60	255	227	227	227
query61	102	109	102	102
query62	571	451	438	438
query63	301	276	280	276
query64	4456	3978	3052	3052
query65	3054	3012	3025	3012
query66	745	320	322	320
query67	15458	14911	14763	14763
query68	5214	528	521	521
query69	523	295	290	290
query70	1181	1147	1138	1138
query71	1372	1259	1260	1259
query72	6534	2663	2456	2456
query73	697	316	318	316
query74	6814	6315	6426	6315
query75	3308	2596	2628	2596
query76	3269	946	940	940
query77	556	255	263	255
query78	10850	10127	10189	10127
query79	6421	505	513	505
query80	1820	420	425	420
query81	520	239	239	239
query82	1132	98	96	96
query83	334	167	166	166
query84	268	89	79	79
query85	1344	343	263	263
query86	457	304	304	304
query87	3451	3279	3263	3263
query88	5079	2392	2405	2392
query89	471	364	363	363
query90	1896	176	178	176
query91	121	94	97	94
query92	54	47	46	46
query93	5773	507	503	503
query94	1107	177	174	174
query95	383	290	295	290
query96	608	264	261	261
query97	3140	2936	2959	2936
query98	227	221	234	221
query99	1229	879	868	868
Total cold run time: 282888 ms
Total hot run time: 183374 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.96 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 4def3293f1d30157ccdc57f2b1837e1ea731ab50, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.05
query3	0.23	0.06	0.06
query4	1.66	0.07	0.09
query5	0.49	0.49	0.51
query6	1.47	0.72	0.74
query7	0.02	0.01	0.01
query8	0.05	0.05	0.04
query9	0.55	0.50	0.49
query10	0.54	0.54	0.55
query11	0.15	0.11	0.11
query12	0.15	0.12	0.12
query13	0.60	0.58	0.58
query14	0.76	0.76	0.78
query15	0.82	0.80	0.80
query16	0.35	0.37	0.36
query17	1.01	1.01	1.01
query18	0.22	0.26	0.21
query19	1.87	1.80	1.74
query20	0.01	0.00	0.00
query21	15.40	0.66	0.65
query22	4.27	7.58	1.53
query23	18.29	1.37	1.24
query24	1.46	0.43	0.20
query25	0.15	0.08	0.08
query26	0.26	0.17	0.17
query27	0.08	0.07	0.08
query28	13.28	0.99	0.97
query29	12.62	3.27	3.28
query30	0.26	0.06	0.05
query31	2.84	0.40	0.38
query32	3.26	0.47	0.47
query33	2.77	2.83	2.83
query34	17.02	4.38	4.48
query35	4.50	4.42	4.48
query36	0.65	0.47	0.47
query37	0.19	0.16	0.16
query38	0.15	0.14	0.14
query39	0.04	0.04	0.03
query40	0.18	0.15	0.14
query41	0.11	0.05	0.05
query42	0.06	0.05	0.05
query43	0.04	0.04	0.03
Total cold run time: 108.95 s
Total hot run time: 29.96 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit 4def3293f1d30157ccdc57f2b1837e1ea731ab50 with default session variables
Stream load json:         19 seconds loaded 2358488459 Bytes, about 118 MB/s
Stream load orc:          58 seconds loaded 1101869774 Bytes, about 18 MB/s
Stream load parquet:      32 seconds loaded 861443392 Bytes, about 25 MB/s
Insert into select:       14.1 seconds inserted 10000000 Rows, about 709K ops/s

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 18, 2024
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman morningman merged commit 02fa772 into apache:master Apr 19, 2024
morningman pushed a commit that referenced this pull request Apr 19, 2024
Issue Number: #31442

hive3 support create table with column's default value
if use hive3, we can write default value to table
morningman pushed a commit to morningman/doris that referenced this pull request Apr 30, 2024
…#33666)

Issue Number: apache#31442

hive3 support create table with column's default value
if use hive3, we can write default value to table
dataroaring pushed a commit that referenced this pull request May 1, 2024
…4.0 (#34371)

* [feature](insert)use optional location and add hive regression test (#33153)

* [feature](iceberg)The new DDL syntax is added to create iceberg partitioned tables (#33338)

support partition by :

```
create table tb1 (c1 string, ts datetime) engine = iceberg partition by (c1, day(ts)) () properties ("a"="b")
```

* [Enhancement](hive-writer) Adjust table sink exchange rebalancer params. (#33397)

Issue Number:  #31442

Change table sink exchange rebalancer params to node level and adjust these params to improve write performance by better balance.

rebalancer params:
```
DEFINE_mInt64(table_sink_partition_write_min_data_processed_rebalance_threshold,
              "26214400"); // 25MB
// Minimum partition data processed to rebalance writers in exchange when partition writing
DEFINE_mInt64(table_sink_partition_write_min_partition_data_processed_rebalance_threshold,
              "15728640"); // 15MB
```

* [feature](profile) add transaction statistics for profile (#33488)

1. commit total time
2. fs operator total time
     rename file count
     rename dir count
     delete dir count
3. add partition total time
    add partition count
4. update partition total time
    update partition count
like:
```
      -  Transaction  Commit  Time:  906ms
          -  FileSystem  Operator  Time:  833ms
              -  Rename  File  Count:  4
              -  Rename  Dir  Count:  0
              -  Delete  Dir  Count:  0
          -  HMS  Add  Partition  Time:  0ms
              -  HMS  Add  Partition  Count:  0
          -  HMS  Update  Partition  Time:  68ms
              -  HMS  Update  Partition  Count:  4
```

* [feature](iceberg) add iceberg transaction implement (#33629)

Issue #31442

add iceberg transaction

* [feature](insert)support default value when create hive table (#33666)

Issue Number: #31442

hive3 support create table with column's default value
if use hive3, we can write default value to table

* [refactor](filesystem)refactor `filesystem` interface (#33361)

1. Remame`list` to `globList` . The path of this `list` needs to have a wildcard character, and the corresponding hdfs interface is `globStatus`, so the modified name is `globList`.
2. If you only need to view files based on paths, you can use the `listFiles` operation.
3. Merge `listLocatedFiles` function into `listFiles` function.

* [opt](meta-cache) refine the meta cache (#33449)

1. Use `caffeine` instead of `guava cache` to get better performace
2. Add a new class `CacheFactory`

    All (Async)LoadingCache should be built from `CacheFactory`

3. Use separator executor for different caches

    1. rowCountRefreshExecutor
      For row count cache.
      Row count cache is an async loading cache, and we can ignore the result
      if cache missing or thread pool is full.
      So use a separate executor for this cache.

    2.  commonRefreshExecutor
      For other caches. Other caches are sync loading cache.
      But commonRefreshExecutor will be used for async refresh.
      That is, if cache entry is missing, the cache value will be loaded in caller thread, sychronously.
      if cache entry need refresh, it will be reloaded in commonRefreshExecutor.

    3. fileListingExecutor
      File listing is a heavy operation, so use a separate executor for it.
      For fileCache, the refresh operation will still use commonRefreshExecutor to trigger refresh.
      And fileListingExecutor will be used to list file.

4. Change the refresh and expire logic of caches

    For most of caches, set `refreshAfterWrite` strategy, so that
    even if the cache entry is expired, the old entry can still be
    used while new entry is being loaded.

5. Add new global variable `enable_get_row_count_from_file_list`

    Default is true, if false, will disable getting row count from file list

* [bugfix](hive)delete write path after hive insert (#33798)

Issue #31442

1. delete file according query id
2. delete write path after insert

* [Enhancement](multi-catalog) Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags. (#33858)

Many domestic cloud vendors are compatible with the s3 protocol. However, early versions of s3 client will only generate path style http requests (aws/aws-sdk-java-v2#763) when encountering endpoints that do not start with s3, while some cloud vendors only support virtual host style http request.

Therefore, Doris used `forceVirtualHosted` in `S3URI` to convert it into a virtual hosted path and implemented it through path style.
For example:
For s3 uri `s3://my-bucket/data/file.txt`, It will eventually be parsed into:
- virtualBucket: my-bucket
- Bucket: data (bucket must be set, otherwise the s3 client will report an error) Especially this step is particularly tricky because of the limitations of the s3 client.
- Key: file.txt

 The path style mode is used to generate an http request similar to the virtual host by setting the endpoint to virtualBucket + original endpoint, setting the bucket and key.
**However, the bucket and key here are inconsistent with the original concepts of s3, but the aws client happens to be able to generate an http request similar to the virtual host through the path style mode.**

However, after #30799 we have upgrade the aws sdk version from 2.17.257 to 2.20.131. The current aws s3 client can already generate a virtual host by third party by default style of http request. So in #31111 need to set the path style option, let the s3 client use doris' virtual bucket mechanism to continue working.

**Finally, the virtual bucket mechanism is too confusing and tricky, and we no longer need it with the new version of s3 client.**

### Resolution:

Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags.

This class represents a fully qualified location in S3 for input/output operations expressed as as URI.
 #### For AWS S3, URI common styles:
  - AWS Client Style(Hadoop S3 Style): `s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88`
  - Virtual Host Style: `https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path Style: `https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
 
  Regarding the above-mentioned common styles, we can use <code>isPathStyle</code> to control whether to use path style
  or virtual host style.
  "Virtual host style" is the currently mainstream and recommended approach to use, so the default value of
  <code>isPathStyle</code> is false.
 
  #### Other Styles:
  - Virtual Host AWS Client (Hadoop S3) Mixed Style:
    `s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path AWS Client (Hadoop S3) Mixed Style:
     `s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
 
  For these two styles, we can use <code>isPathStyle</code> and <code>forceParsingByStandardUri</code>
  to control whether to use.
  Virtual Host AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = false && forceParsingByStandardUri = true</code>
  Path AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = true && forceParsingByStandardUri = true</code>
 
  When the incoming location is url encoded, the encoded string will be returned.
  For <code>getKey()</code>, <code>getQueryParams()</code> will return the encoding string

* [improvement](hive)add the `queryid` to the temporary file path (#34278)

`_temp_<table_name>` to `_temp_<queryid>_<table_name>`.
Prevent users from having a table with the name `_temp_<table_name>`.

So as to partition temp dir

* [feature](Cloud) Load index data into index cache when writing data (#34046)

* [Feature](hive-writer) Implements s3 file committer. (#33937)

Issue Number: #31442

[Feature] (hive-writer) Implements s3 file committer. 

S3 committer will start multipart uploading all files on BE side, and then complete multipart upload these files on FE side. If you do not complete multi parts of a file, the file will not be visible. So in this way, the atomicity of a single file can be guaranteed. But it still cannot guarantee the atomicity of multiple files. Because hive committers have best-effort semantics, this shortens the inconsistent time window.

## ChangeList:
- Add `used_by_s3_committer` in `FileWriterOptions` on BE side to start multi-part uploading files, then complete multi-part uploading files on FE side.
- `cosn://`use s3 client on FE side, because it need to complete multi-part uploading files on FE side.
-  Add `Status directoryExists(String dir)` and `Status deleteDirectory` in `FileSystem`.

---------

Co-authored-by: slothever <18522955+wsjz@users.noreply.github.com>
Co-authored-by: wuwenchi <wuwenchihdu@hotmail.com>
Co-authored-by: Qi Chen <kaka11.chen@gmail.com>
Co-authored-by: AlexYue <yj976240184@gmail.com>
@wsjz wsjz deleted the default_val branch July 4, 2024 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.3-merged meta-change reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants