Skip to content

Conversation

@morningman
Copy link
Contributor

@morningman morningman commented Feb 19, 2024

Proposed changes

After this #30799, we upgrade the aws sdk version from 2.17.257 to 2.20.131.
The default endpoint url style logic seem changes.
So I set pathStyleAccessEnabled(true) to fix this.
Tested on S3, OSS, COS

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

@morningman morningman marked this pull request as ready for review February 19, 2024 09:09
@morningman
Copy link
Contributor Author

run buildall

@morningman
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 41836 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 40ed5fcf2e4d6a2cf1b7feb4f8bee0d8ae17f0be, data reload: false

------ Round 1 ----------------------------------
q1	17904	5199	5122	5122
q2	2665	145	139	139
q3	11847	1057	1010	1010
q4	4653	985	981	981
q5	7691	3266	3266	3266
q6	199	139	136	136
q7	1266	785	766	766
q8	9243	2084	2089	2084
q9	7633	6730	6715	6715
q10	8328	2645	2646	2645
q11	413	209	219	209
q12	736	335	329	329
q13	18005	3694	3664	3664
q14	295	261	257	257
q15	610	500	511	500
q16	488	403	414	403
q17	925	842	874	842
q18	7307	6647	6652	6647
q19	1684	1484	1483	1483
q20	631	349	361	349
q21	6310	3949	3994	3949
q22	880	349	340	340
Total cold run time: 109713 ms
Total hot run time: 41836 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4859	4872	4875	4872
q2	293	192	179	179
q3	3586	3604	3591	3591
q4	2528	2550	2547	2547
q5	5756	5751	5918	5751
q6	209	126	130	126
q7	2269	1645	1673	1645
q8	3030	3073	3083	3073
q9	8671	8726	8693	8693
q10	6904	4254	4233	4233
q11	505	360	369	360
q12	765	533	543	533
q13	4211	3438	3382	3382
q14	265	237	231	231
q15	594	508	533	508
q16	473	441	432	432
q17	1665	1614	1620	1614
q18	8313	7706	7640	7640
q19	1639	1654	1635	1635
q20	2106	1825	1824	1824
q21	6590	6217	6142	6142
q22	570	500	513	500
Total cold run time: 65801 ms
Total hot run time: 59511 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 177990 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 40ed5fcf2e4d6a2cf1b7feb4f8bee0d8ae17f0be, data reload: false

query1	929	347	350	347
query2	6511	1775	1719	1719
query3	6695	219	203	203
query4	23104	21164	21163	21163
query5	4268	469	384	384
query6	257	164	168	164
query7	4604	309	301	301
query8	243	215	201	201
query9	8445	2859	2846	2846
query10	430	219	243	219
query11	15072	14457	14725	14457
query12	145	83	83	83
query13	1696	432	420	420
query14	9013	7803	7790	7790
query15	210	188	186	186
query16	7409	261	262	261
query17	1395	570	537	537
query18	1949	273	265	265
query19	191	156	147	147
query20	89	82	87	82
query21	189	123	124	123
query22	4896	4776	4719	4719
query23	32610	31609	31695	31609
query24	12889	3454	3432	3432
query25	660	366	357	357
query26	1897	155	163	155
query27	3031	326	320	320
query28	6491	1851	1826	1826
query29	1179	635	623	623
query30	284	135	149	135
query31	931	752	766	752
query32	99	60	60	60
query33	745	243	237	237
query34	1082	494	505	494
query35	982	865	834	834
query36	972	901	865	865
query37	266	62	64	62
query38	3318	3189	3222	3189
query39	1405	1348	1317	1317
query40	280	105	108	105
query41	36	34	35	34
query42	106	102	104	102
query43	479	453	425	425
query44	1065	695	701	695
query45	196	184	178	178
query46	1045	786	769	769
query47	1628	1526	1567	1526
query48	421	349	349	349
query49	1212	306	299	299
query50	790	374	378	374
query51	5422	5137	5121	5121
query52	110	95	98	95
query53	399	304	296	296
query54	291	221	227	221
query55	84	82	86	82
query56	235	210	201	201
query57	1042	909	975	909
query58	226	200	203	200
query59	2242	2024	2200	2024
query60	237	209	218	209
query61	90	83	81	81
query62	578	389	378	378
query63	334	298	282	282
query64	6292	3093	3181	3093
query65	3302	3248	3249	3248
query66	1333	330	322	322
query67	14690	14529	14328	14328
query68	5172	574	572	572
query69	502	363	365	363
query70	1271	1225	1206	1206
query71	367	275	258	258
query72	6310	2790	2657	2657
query73	692	321	311	311
query74	6752	6471	6405	6405
query75	3205	2586	2561	2561
query76	3101	1112	1214	1112
query77	351	249	250	249
query78	9411	8781	8749	8749
query79	979	520	511	511
query80	534	364	346	346
query81	435	207	200	200
query82	239	84	86	84
query83	142	120	125	120
query84	227	81	80	80
query85	1065	350	331	331
query86	294	302	304	302
query87	3425	3298	3262	3262
query88	2716	2291	2274	2274
query89	446	359	357	357
query90	2087	169	173	169
query91	161	127	127	127
query92	58	49	50	49
query93	971	503	528	503
query94	1241	189	185	185
query95	495	8780	384	384
query96	575	263	261	261
query97	4477	4260	4298	4260
query98	222	201	206	201
query99	1054	711	718	711
Total cold run time: 269723 ms
Total hot run time: 177990 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.09 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 40ed5fcf2e4d6a2cf1b7feb4f8bee0d8ae17f0be, data reload: false

query1	0.03	0.02	0.03
query2	0.06	0.02	0.02
query3	0.23	0.07	0.08
query4	1.65	0.10	0.10
query5	0.48	0.48	0.49
query6	1.36	0.61	0.60
query7	0.02	0.02	0.01
query8	0.03	0.02	0.02
query9	0.51	0.45	0.46
query10	0.49	0.49	0.50
query11	0.12	0.09	0.09
query12	0.12	0.10	0.10
query13	0.60	0.59	0.58
query14	0.76	0.81	0.78
query15	0.83	0.81	0.79
query16	0.33	0.33	0.33
query17	0.93	0.91	0.92
query18	0.21	0.18	0.16
query19	1.71	1.70	1.74
query20	0.02	0.01	0.01
query21	15.43	0.61	0.55
query22	2.91	3.76	2.64
query23	17.45	1.08	0.92
query24	2.04	0.22	0.54
query25	0.64	0.06	0.08
query26	0.17	0.14	0.16
query27	0.06	0.06	0.05
query28	12.10	0.84	0.83
query29	12.49	3.37	3.32
query30	0.52	0.48	0.49
query31	2.78	0.37	0.37
query32	3.33	0.47	0.48
query33	3.13	3.17	3.10
query34	15.35	4.54	4.50
query35	4.52	4.50	4.51
query36	1.06	0.94	0.95
query37	0.07	0.05	0.05
query38	0.04	0.03	0.03
query39	0.02	0.02	0.02
query40	0.17	0.14	0.14
query41	0.07	0.02	0.02
query42	0.03	0.01	0.02
query43	0.03	0.02	0.02
Total cold run time: 104.9 s
Total hot run time: 31.09 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit 40ed5fcf2e4d6a2cf1b7feb4f8bee0d8ae17f0be with default session variables
Stream load json:         19 seconds loaded 2358488459 Bytes, about 118 MB/s
Stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
Stream load parquet:      31 seconds loaded 861443392 Bytes, about 26 MB/s
Insert into select:       13.5 seconds inserted 10000000 Rows, about 740K ops/s

@morningman
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 41318 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c6ee55a5c79f73c0f3eb7bcaae069c74e943fd1d, data reload: false

------ Round 1 ----------------------------------
q1	19480	5149	4756	4756
q2	2052	141	137	137
q3	10582	1023	1020	1020
q4	4647	980	990	980
q5	7700	3183	3283	3183
q6	194	130	127	127
q7	1240	778	767	767
q8	9228	2082	2060	2060
q9	7679	6699	6669	6669
q10	8325	2649	2642	2642
q11	418	200	197	197
q12	784	324	328	324
q13	17979	3686	3649	3649
q14	295	265	259	259
q15	604	506	508	506
q16	474	422	427	422
q17	933	877	879	877
q18	7290	6609	6702	6609
q19	1546	1489	1483	1483
q20	622	350	337	337
q21	7154	4028	3975	3975
q22	865	357	339	339
Total cold run time: 110091 ms
Total hot run time: 41318 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4786	4748	4745	4745
q2	298	188	182	182
q3	3589	3586	3581	3581
q4	2530	2531	2487	2487
q5	5724	5712	5767	5712
q6	208	128	123	123
q7	2241	1678	1625	1625
q8	3003	3047	3089	3047
q9	8700	8640	8642	8640
q10	6893	4250	4225	4225
q11	520	362	373	362
q12	775	534	538	534
q13	4324	3402	3380	3380
q14	271	239	224	224
q15	628	533	533	533
q16	476	472	442	442
q17	1654	1587	1584	1584
q18	8270	7683	7558	7558
q19	1634	1627	1624	1624
q20	2111	1848	1837	1837
q21	6534	6158	6101	6101
q22	562	498	497	497
Total cold run time: 65731 ms
Total hot run time: 59043 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185756 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c6ee55a5c79f73c0f3eb7bcaae069c74e943fd1d, data reload: false

query1	926	349	350	349
query2	6508	1781	1852	1781
query3	6692	206	209	206
query4	23220	21159	21094	21094
query5	4348	455	464	455
query6	267	170	174	170
query7	4609	295	289	289
query8	245	204	194	194
query9	8436	2807	2779	2779
query10	421	217	224	217
query11	15115	14477	14445	14445
query12	142	82	80	80
query13	1693	411	413	411
query14	8910	7663	7494	7494
query15	217	190	189	189
query16	7126	254	250	250
query17	1426	554	551	551
query18	2089	268	259	259
query19	190	142	143	142
query20	84	85	88	85
query21	192	120	114	114
query22	4964	4764	4783	4764
query23	32626	31643	31461	31461
query24	12873	3441	3382	3382
query25	646	371	361	361
query26	1884	147	160	147
query27	3076	318	317	317
query28	6654	1841	1814	1814
query29	1239	616	608	608
query30	287	135	142	135
query31	935	754	753	753
query32	101	59	59	59
query33	726	240	236	236
query34	1049	499	501	499
query35	944	826	832	826
query36	992	880	884	880
query37	268	60	65	60
query38	3314	3165	3159	3159
query39	1363	1331	1328	1328
query40	290	108	104	104
query41	37	36	34	34
query42	111	101	95	95
query43	469	443	440	440
query44	1085	681	691	681
query45	194	187	176	176
query46	1047	761	765	761
query47	1605	1547	1564	1547
query48	419	345	347	345
query49	1247	301	304	301
query50	767	374	375	374
query51	5369	5176	5152	5152
query52	114	90	91	90
query53	391	308	288	288
query54	308	222	222	222
query55	99	83	90	83
query56	218	214	210	210
query57	1018	940	949	940
query58	221	200	196	196
query59	2198	2197	2149	2149
query60	259	221	224	221
query61	83	84	82	82
query62	577	376	362	362
query63	322	283	283	283
query64	6488	3074	3190	3074
query65	3304	3289	3246	3246
query66	1366	322	332	322
query67	14680	14330	14320	14320
query68	5009	552	543	543
query69	511	348	370	348
query70	1234	1210	1272	1210
query71	371	258	252	252
query72	6305	2739	2604	2604
query73	718	312	313	312
query74	6823	6465	6536	6465
query75	3237	2578	2548	2548
query76	3264	1137	1213	1137
query77	358	239	226	226
query78	9501	8823	8762	8762
query79	956	504	509	504
query80	524	358	339	339
query81	435	204	204	204
query82	237	87	93	87
query83	139	122	119	119
query84	236	79	78	78
query85	1022	342	339	339
query86	292	319	300	300
query87	3435	3312	3317	3312
query88	2752	2281	2280	2280
query89	437	358	355	355
query90	1925	165	164	164
query91	151	127	127	127
query92	57	50	46	46
query93	1010	518	473	473
query94	1089	182	182	182
query95	488	8643	8515	8515
query96	578	264	265	264
query97	4455	4306	4297	4297
query98	212	206	212	206
query99	1061	717	731	717
Total cold run time: 269905 ms
Total hot run time: 185756 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.26 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c6ee55a5c79f73c0f3eb7bcaae069c74e943fd1d, data reload: false

query1	0.03	0.02	0.03
query2	0.07	0.03	0.03
query3	0.22	0.08	0.07
query4	1.63	0.09	0.07
query5	0.48	0.48	0.49
query6	1.35	0.61	0.62
query7	0.02	0.01	0.01
query8	0.04	0.03	0.03
query9	0.53	0.44	0.44
query10	0.48	0.49	0.50
query11	0.12	0.09	0.10
query12	0.12	0.10	0.10
query13	0.59	0.60	0.58
query14	0.77	0.77	0.79
query15	0.81	0.80	0.82
query16	0.34	0.32	0.33
query17	0.90	0.92	0.92
query18	0.18	0.16	0.16
query19	1.74	1.70	1.67
query20	0.01	0.02	0.01
query21	15.42	0.65	0.57
query22	2.66	4.28	2.77
query23	17.15	1.10	0.99
query24	2.15	0.28	0.89
query25	0.57	0.10	0.07
query26	0.17	0.15	0.15
query27	0.07	0.06	0.05
query28	11.56	0.86	0.81
query29	12.52	3.30	3.35
query30	0.57	0.49	0.50
query31	2.78	0.36	0.38
query32	3.31	0.47	0.47
query33	3.17	3.11	3.13
query34	15.35	4.51	4.49
query35	4.49	4.50	4.48
query36	1.07	0.94	0.96
query37	0.07	0.04	0.05
query38	0.04	0.03	0.03
query39	0.02	0.02	0.01
query40	0.18	0.14	0.15
query41	0.08	0.02	0.01
query42	0.02	0.02	0.01
query43	0.02	0.02	0.02
Total cold run time: 103.87 s
Total hot run time: 31.26 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit c6ee55a5c79f73c0f3eb7bcaae069c74e943fd1d with default session variables
Stream load json:         19 seconds loaded 2358488459 Bytes, about 118 MB/s
Stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
Stream load parquet:      31 seconds loaded 861443392 Bytes, about 26 MB/s
Insert into select:       13.5 seconds inserted 10000000 Rows, about 740K ops/s

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 19, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@yiguolei yiguolei merged commit 4e8470d into apache:master Feb 19, 2024
morningman pushed a commit that referenced this pull request Apr 22, 2024
… bucket mechanism and support different uri styles by flags. (#33858)

Many domestic cloud vendors are compatible with the s3 protocol. However, early versions of s3 client will only generate path style http requests (aws/aws-sdk-java-v2#763) when encountering endpoints that do not start with s3, while some cloud vendors only support virtual host style http request.

Therefore, Doris used `forceVirtualHosted` in `S3URI` to convert it into a virtual hosted path and implemented it through path style.
For example:
For s3 uri `s3://my-bucket/data/file.txt`, It will eventually be parsed into:
- virtualBucket: my-bucket
- Bucket: data (bucket must be set, otherwise the s3 client will report an error) Especially this step is particularly tricky because of the limitations of the s3 client.
- Key: file.txt

 The path style mode is used to generate an http request similar to the virtual host by setting the endpoint to virtualBucket + original endpoint, setting the bucket and key.
**However, the bucket and key here are inconsistent with the original concepts of s3, but the aws client happens to be able to generate an http request similar to the virtual host through the path style mode.**

However, after #30799 we have upgrade the aws sdk version from 2.17.257 to 2.20.131. The current aws s3 client can already generate a virtual host by third party by default style of http request. So in #31111 need to set the path style option, let the s3 client use doris' virtual bucket mechanism to continue working.

**Finally, the virtual bucket mechanism is too confusing and tricky, and we no longer need it with the new version of s3 client.**

### Resolution:

Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags.

This class represents a fully qualified location in S3 for input/output operations expressed as as URI.
 #### For AWS S3, URI common styles:
  - AWS Client Style(Hadoop S3 Style): `s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88`
  - Virtual Host Style: `https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path Style: `https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
 
  Regarding the above-mentioned common styles, we can use <code>isPathStyle</code> to control whether to use path style
  or virtual host style.
  "Virtual host style" is the currently mainstream and recommended approach to use, so the default value of
  <code>isPathStyle</code> is false.
 
  #### Other Styles:
  - Virtual Host AWS Client (Hadoop S3) Mixed Style:
    `s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path AWS Client (Hadoop S3) Mixed Style:
     `s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
 
  For these two styles, we can use <code>isPathStyle</code> and <code>forceParsingByStandardUri</code>
  to control whether to use.
  Virtual Host AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = false && forceParsingByStandardUri = true</code>
  Path AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = true && forceParsingByStandardUri = true</code>
 
  When the incoming location is url encoded, the encoded string will be returned.
  For <code>getKey()</code>, <code>getQueryParams()</code> will return the encoding string
yiguolei pushed a commit that referenced this pull request Apr 22, 2024
… bucket mechanism and support different uri styles by flags. (#33858)

Many domestic cloud vendors are compatible with the s3 protocol. However, early versions of s3 client will only generate path style http requests (aws/aws-sdk-java-v2#763) when encountering endpoints that do not start with s3, while some cloud vendors only support virtual host style http request.

Therefore, Doris used `forceVirtualHosted` in `S3URI` to convert it into a virtual hosted path and implemented it through path style.
For example:
For s3 uri `s3://my-bucket/data/file.txt`, It will eventually be parsed into:
- virtualBucket: my-bucket
- Bucket: data (bucket must be set, otherwise the s3 client will report an error) Especially this step is particularly tricky because of the limitations of the s3 client.
- Key: file.txt

 The path style mode is used to generate an http request similar to the virtual host by setting the endpoint to virtualBucket + original endpoint, setting the bucket and key.
**However, the bucket and key here are inconsistent with the original concepts of s3, but the aws client happens to be able to generate an http request similar to the virtual host through the path style mode.**

However, after #30799 we have upgrade the aws sdk version from 2.17.257 to 2.20.131. The current aws s3 client can already generate a virtual host by third party by default style of http request. So in #31111 need to set the path style option, let the s3 client use doris' virtual bucket mechanism to continue working.

**Finally, the virtual bucket mechanism is too confusing and tricky, and we no longer need it with the new version of s3 client.**

### Resolution:

Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags.

This class represents a fully qualified location in S3 for input/output operations expressed as as URI.
 #### For AWS S3, URI common styles:
  - AWS Client Style(Hadoop S3 Style): `s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88`
  - Virtual Host Style: `https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path Style: `https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
 
  Regarding the above-mentioned common styles, we can use <code>isPathStyle</code> to control whether to use path style
  or virtual host style.
  "Virtual host style" is the currently mainstream and recommended approach to use, so the default value of
  <code>isPathStyle</code> is false.
 
  #### Other Styles:
  - Virtual Host AWS Client (Hadoop S3) Mixed Style:
    `s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path AWS Client (Hadoop S3) Mixed Style:
     `s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
 
  For these two styles, we can use <code>isPathStyle</code> and <code>forceParsingByStandardUri</code>
  to control whether to use.
  Virtual Host AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = false && forceParsingByStandardUri = true</code>
  Path AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = true && forceParsingByStandardUri = true</code>
 
  When the incoming location is url encoded, the encoded string will be returned.
  For <code>getKey()</code>, <code>getQueryParams()</code> will return the encoding string
morningman pushed a commit to morningman/doris that referenced this pull request Apr 30, 2024
… bucket mechanism and support different uri styles by flags. (apache#33858)

Many domestic cloud vendors are compatible with the s3 protocol. However, early versions of s3 client will only generate path style http requests (aws/aws-sdk-java-v2#763) when encountering endpoints that do not start with s3, while some cloud vendors only support virtual host style http request.

Therefore, Doris used `forceVirtualHosted` in `S3URI` to convert it into a virtual hosted path and implemented it through path style.
For example:
For s3 uri `s3://my-bucket/data/file.txt`, It will eventually be parsed into:
- virtualBucket: my-bucket
- Bucket: data (bucket must be set, otherwise the s3 client will report an error) Especially this step is particularly tricky because of the limitations of the s3 client.
- Key: file.txt

 The path style mode is used to generate an http request similar to the virtual host by setting the endpoint to virtualBucket + original endpoint, setting the bucket and key.
**However, the bucket and key here are inconsistent with the original concepts of s3, but the aws client happens to be able to generate an http request similar to the virtual host through the path style mode.**

However, after apache#30799 we have upgrade the aws sdk version from 2.17.257 to 2.20.131. The current aws s3 client can already generate a virtual host by third party by default style of http request. So in apache#31111 need to set the path style option, let the s3 client use doris' virtual bucket mechanism to continue working.

**Finally, the virtual bucket mechanism is too confusing and tricky, and we no longer need it with the new version of s3 client.**

### Resolution:

Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags.

This class represents a fully qualified location in S3 for input/output operations expressed as as URI.
 #### For AWS S3, URI common styles:
  - AWS Client Style(Hadoop S3 Style): `s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88`
  - Virtual Host Style: `https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path Style: `https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
 
  Regarding the above-mentioned common styles, we can use <code>isPathStyle</code> to control whether to use path style
  or virtual host style.
  "Virtual host style" is the currently mainstream and recommended approach to use, so the default value of
  <code>isPathStyle</code> is false.
 
  #### Other Styles:
  - Virtual Host AWS Client (Hadoop S3) Mixed Style:
    `s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path AWS Client (Hadoop S3) Mixed Style:
     `s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
 
  For these two styles, we can use <code>isPathStyle</code> and <code>forceParsingByStandardUri</code>
  to control whether to use.
  Virtual Host AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = false && forceParsingByStandardUri = true</code>
  Path AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = true && forceParsingByStandardUri = true</code>
 
  When the incoming location is url encoded, the encoded string will be returned.
  For <code>getKey()</code>, <code>getQueryParams()</code> will return the encoding string
dataroaring pushed a commit that referenced this pull request May 1, 2024
…4.0 (#34371)

* [feature](insert)use optional location and add hive regression test (#33153)

* [feature](iceberg)The new DDL syntax is added to create iceberg partitioned tables (#33338)

support partition by :

```
create table tb1 (c1 string, ts datetime) engine = iceberg partition by (c1, day(ts)) () properties ("a"="b")
```

* [Enhancement](hive-writer) Adjust table sink exchange rebalancer params. (#33397)

Issue Number:  #31442

Change table sink exchange rebalancer params to node level and adjust these params to improve write performance by better balance.

rebalancer params:
```
DEFINE_mInt64(table_sink_partition_write_min_data_processed_rebalance_threshold,
              "26214400"); // 25MB
// Minimum partition data processed to rebalance writers in exchange when partition writing
DEFINE_mInt64(table_sink_partition_write_min_partition_data_processed_rebalance_threshold,
              "15728640"); // 15MB
```

* [feature](profile) add transaction statistics for profile (#33488)

1. commit total time
2. fs operator total time
     rename file count
     rename dir count
     delete dir count
3. add partition total time
    add partition count
4. update partition total time
    update partition count
like:
```
      -  Transaction  Commit  Time:  906ms
          -  FileSystem  Operator  Time:  833ms
              -  Rename  File  Count:  4
              -  Rename  Dir  Count:  0
              -  Delete  Dir  Count:  0
          -  HMS  Add  Partition  Time:  0ms
              -  HMS  Add  Partition  Count:  0
          -  HMS  Update  Partition  Time:  68ms
              -  HMS  Update  Partition  Count:  4
```

* [feature](iceberg) add iceberg transaction implement (#33629)

Issue #31442

add iceberg transaction

* [feature](insert)support default value when create hive table (#33666)

Issue Number: #31442

hive3 support create table with column's default value
if use hive3, we can write default value to table

* [refactor](filesystem)refactor `filesystem` interface (#33361)

1. Remame`list` to `globList` . The path of this `list` needs to have a wildcard character, and the corresponding hdfs interface is `globStatus`, so the modified name is `globList`.
2. If you only need to view files based on paths, you can use the `listFiles` operation.
3. Merge `listLocatedFiles` function into `listFiles` function.

* [opt](meta-cache) refine the meta cache (#33449)

1. Use `caffeine` instead of `guava cache` to get better performace
2. Add a new class `CacheFactory`

    All (Async)LoadingCache should be built from `CacheFactory`

3. Use separator executor for different caches

    1. rowCountRefreshExecutor
      For row count cache.
      Row count cache is an async loading cache, and we can ignore the result
      if cache missing or thread pool is full.
      So use a separate executor for this cache.

    2.  commonRefreshExecutor
      For other caches. Other caches are sync loading cache.
      But commonRefreshExecutor will be used for async refresh.
      That is, if cache entry is missing, the cache value will be loaded in caller thread, sychronously.
      if cache entry need refresh, it will be reloaded in commonRefreshExecutor.

    3. fileListingExecutor
      File listing is a heavy operation, so use a separate executor for it.
      For fileCache, the refresh operation will still use commonRefreshExecutor to trigger refresh.
      And fileListingExecutor will be used to list file.

4. Change the refresh and expire logic of caches

    For most of caches, set `refreshAfterWrite` strategy, so that
    even if the cache entry is expired, the old entry can still be
    used while new entry is being loaded.

5. Add new global variable `enable_get_row_count_from_file_list`

    Default is true, if false, will disable getting row count from file list

* [bugfix](hive)delete write path after hive insert (#33798)

Issue #31442

1. delete file according query id
2. delete write path after insert

* [Enhancement](multi-catalog) Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags. (#33858)

Many domestic cloud vendors are compatible with the s3 protocol. However, early versions of s3 client will only generate path style http requests (aws/aws-sdk-java-v2#763) when encountering endpoints that do not start with s3, while some cloud vendors only support virtual host style http request.

Therefore, Doris used `forceVirtualHosted` in `S3URI` to convert it into a virtual hosted path and implemented it through path style.
For example:
For s3 uri `s3://my-bucket/data/file.txt`, It will eventually be parsed into:
- virtualBucket: my-bucket
- Bucket: data (bucket must be set, otherwise the s3 client will report an error) Especially this step is particularly tricky because of the limitations of the s3 client.
- Key: file.txt

 The path style mode is used to generate an http request similar to the virtual host by setting the endpoint to virtualBucket + original endpoint, setting the bucket and key.
**However, the bucket and key here are inconsistent with the original concepts of s3, but the aws client happens to be able to generate an http request similar to the virtual host through the path style mode.**

However, after #30799 we have upgrade the aws sdk version from 2.17.257 to 2.20.131. The current aws s3 client can already generate a virtual host by third party by default style of http request. So in #31111 need to set the path style option, let the s3 client use doris' virtual bucket mechanism to continue working.

**Finally, the virtual bucket mechanism is too confusing and tricky, and we no longer need it with the new version of s3 client.**

### Resolution:

Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags.

This class represents a fully qualified location in S3 for input/output operations expressed as as URI.
 #### For AWS S3, URI common styles:
  - AWS Client Style(Hadoop S3 Style): `s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88`
  - Virtual Host Style: `https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path Style: `https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
 
  Regarding the above-mentioned common styles, we can use <code>isPathStyle</code> to control whether to use path style
  or virtual host style.
  "Virtual host style" is the currently mainstream and recommended approach to use, so the default value of
  <code>isPathStyle</code> is false.
 
  #### Other Styles:
  - Virtual Host AWS Client (Hadoop S3) Mixed Style:
    `s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path AWS Client (Hadoop S3) Mixed Style:
     `s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
 
  For these two styles, we can use <code>isPathStyle</code> and <code>forceParsingByStandardUri</code>
  to control whether to use.
  Virtual Host AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = false && forceParsingByStandardUri = true</code>
  Path AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = true && forceParsingByStandardUri = true</code>
 
  When the incoming location is url encoded, the encoded string will be returned.
  For <code>getKey()</code>, <code>getQueryParams()</code> will return the encoding string

* [improvement](hive)add the `queryid` to the temporary file path (#34278)

`_temp_<table_name>` to `_temp_<queryid>_<table_name>`.
Prevent users from having a table with the name `_temp_<table_name>`.

So as to partition temp dir

* [feature](Cloud) Load index data into index cache when writing data (#34046)

* [Feature](hive-writer) Implements s3 file committer. (#33937)

Issue Number: #31442

[Feature] (hive-writer) Implements s3 file committer. 

S3 committer will start multipart uploading all files on BE side, and then complete multipart upload these files on FE side. If you do not complete multi parts of a file, the file will not be visible. So in this way, the atomicity of a single file can be guaranteed. But it still cannot guarantee the atomicity of multiple files. Because hive committers have best-effort semantics, this shortens the inconsistent time window.

## ChangeList:
- Add `used_by_s3_committer` in `FileWriterOptions` on BE side to start multi-part uploading files, then complete multi-part uploading files on FE side.
- `cosn://`use s3 client on FE side, because it need to complete multi-part uploading files on FE side.
-  Add `Status directoryExists(String dir)` and `Status deleteDirectory` in `FileSystem`.

---------

Co-authored-by: slothever <18522955+wsjz@users.noreply.github.com>
Co-authored-by: wuwenchi <wuwenchihdu@hotmail.com>
Co-authored-by: Qi Chen <kaka11.chen@gmail.com>
Co-authored-by: AlexYue <yj976240184@gmail.com>
w41ter pushed a commit to w41ter/incubator-doris that referenced this pull request Jul 18, 2024
… bucket mechanism and support different uri styles by flags. (apache#33858)

Many domestic cloud vendors are compatible with the s3 protocol. However, early versions of s3 client will only generate path style http requests (aws/aws-sdk-java-v2#763) when encountering endpoints that do not start with s3, while some cloud vendors only support virtual host style http request.

Therefore, Doris used `forceVirtualHosted` in `S3URI` to convert it into a virtual hosted path and implemented it through path style.
For example:
For s3 uri `s3://my-bucket/data/file.txt`, It will eventually be parsed into:
- virtualBucket: my-bucket
- Bucket: data (bucket must be set, otherwise the s3 client will report an error) Especially this step is particularly tricky because of the limitations of the s3 client.
- Key: file.txt

 The path style mode is used to generate an http request similar to the virtual host by setting the endpoint to virtualBucket + original endpoint, setting the bucket and key.
**However, the bucket and key here are inconsistent with the original concepts of s3, but the aws client happens to be able to generate an http request similar to the virtual host through the path style mode.**

However, after apache#30799 we have upgrade the aws sdk version from 2.17.257 to 2.20.131. The current aws s3 client can already generate a virtual host by third party by default style of http request. So in apache#31111 need to set the path style option, let the s3 client use doris' virtual bucket mechanism to continue working.

**Finally, the virtual bucket mechanism is too confusing and tricky, and we no longer need it with the new version of s3 client.**

Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags.

This class represents a fully qualified location in S3 for input/output operations expressed as as URI.
 #### For AWS S3, URI common styles:
  - AWS Client Style(Hadoop S3 Style): `s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88`
  - Virtual Host Style: `https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path Style: `https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`

  Regarding the above-mentioned common styles, we can use <code>isPathStyle</code> to control whether to use path style
  or virtual host style.
  "Virtual host style" is the currently mainstream and recommended approach to use, so the default value of
  <code>isPathStyle</code> is false.

  #### Other Styles:
  - Virtual Host AWS Client (Hadoop S3) Mixed Style:
    `s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
  - Path AWS Client (Hadoop S3) Mixed Style:
     `s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`

  For these two styles, we can use <code>isPathStyle</code> and <code>forceParsingByStandardUri</code>
  to control whether to use.
  Virtual Host AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = false && forceParsingByStandardUri = true</code>
  Path AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = true && forceParsingByStandardUri = true</code>

  When the incoming location is url encoded, the encoded string will be returned.
  For <code>getKey()</code>, <code>getQueryParams()</code> will return the encoding string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants