Skip to content

Conversation

@BePPPower
Copy link
Contributor

@BePPPower BePPPower commented Mar 13, 2025

Problem Summary:

The output format of complex data types are different between Hive and Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so that they don't need to modify their business code.

This PR mainly changes:

Add a new option to session variable serde_dialect: If set to hive,
the output format returned to MySQL client of some datatypes will be changed:

Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]

Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}

Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}

Related #37039

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Hive support for the serde dialect, allowing users to obtain a Hive-compatible output for complex data types. Key changes include:

  • Adding a new case for "hive" in the NereidsPlanner to set format options.
  • Extending the allowed values and return mapping for serde dialect in SessionVariable.

Reviewed Changes

Copilot reviewed 2 out of 9 changed files in this pull request and generated no comments.

File Description
fe/fe-core/src/main/java/org/apache/doris/nereids/NereidsPlanner.java Added hive case in switch block to set format options for hive.
fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java Updated serde dialect validation to include hive and return enum.
Files not reviewed (7)
  • be/src/vec/data_types/serde/data_type_array_serde.cpp: Language not supported
  • be/src/vec/data_types/serde/data_type_map_serde.cpp: Language not supported
  • be/src/vec/data_types/serde/data_type_number_serde.cpp: Language not supported
  • be/src/vec/data_types/serde/data_type_serde.h: Language not supported
  • be/src/vec/data_types/serde/data_type_struct_serde.cpp: Language not supported
  • be/src/vec/sink/vmysql_result_writer.cpp: Language not supported
  • gensrc/thrift/PaloInternalService.thrift: Language not supported
Comments suppressed due to low confidence (2)

fe/fe-core/src/main/java/org/apache/doris/nereids/NereidsPlanner.java:804

  • Ensure that using FormatOptions.getDefault() for the hive dialect produces Hive-compatible output as expected; if additional formatting adjustments are needed for Hive, consider using or creating a dedicated FormatOptions method.
case "hive":

fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java:4519

  • The error message mistakenly refers to 'sqlDialect' instead of 'serdeDialect', which could confuse users; please update it for accuracy.
throw new UnsupportedOperationException("sqlDialect value is invalid, the invalid value is " + serdeDialect);

@BePPPower BePPPower force-pushed the supportHiveComplexFormat branch from 81cedc1 to eb5ed02 Compare April 1, 2025 07:26
morningman
morningman previously approved these changes Apr 2, 2025
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 2, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Apr 2, 2025

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 2, 2025

PR approved by anyone and no changes requested.

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Apr 2, 2025
@BePPPower
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 83.05% (1088/1310)
Line Coverage: 66.22% (18146/27401)
Region Coverage: 65.52% (8919/13612)
Branch Coverage: 55.38% (4805/8676)
Coverage Report: http://coverage.selectdb-in.cc/coverage/c3c12b10c79e9cce98b82a7b454beb82f56e7eb2_c3c12b10c79e9cce98b82a7b454beb82f56e7eb2_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 34258 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c3c12b10c79e9cce98b82a7b454beb82f56e7eb2, data reload: false

------ Round 1 ----------------------------------
q1	25998	5061	5023	5023
q2	2059	270	185	185
q3	10614	1260	696	696
q4	10245	1013	528	528
q5	7846	3001	2365	2365
q6	208	170	138	138
q7	918	760	637	637
q8	9337	1308	1079	1079
q9	6805	5135	5160	5135
q10	6831	2296	1876	1876
q11	503	302	271	271
q12	360	355	230	230
q13	17790	3683	3101	3101
q14	231	242	211	211
q15	527	480	474	474
q16	630	612	574	574
q17	619	850	388	388
q18	7526	7297	7099	7099
q19	1802	972	575	575
q20	328	334	228	228
q21	4201	3411	2504	2504
q22	1045	1001	941	941
Total cold run time: 116423 ms
Total hot run time: 34258 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5194	5131	5155	5131
q2	247	325	233	233
q3	2216	2667	2276	2276
q4	1453	1966	1559	1559
q5	4561	4375	4324	4324
q6	216	161	125	125
q7	2007	1926	1792	1792
q8	2627	2580	2601	2580
q9	7180	7236	7055	7055
q10	2954	3172	2769	2769
q11	560	499	500	499
q12	687	786	633	633
q13	3571	3882	3298	3298
q14	281	317	276	276
q15	519	473	476	473
q16	667	680	644	644
q17	1145	1560	1409	1409
q18	7805	7335	7523	7335
q19	856	829	931	829
q20	1976	1953	1854	1854
q21	5178	4776	4670	4670
q22	1087	1007	990	990
Total cold run time: 52987 ms
Total hot run time: 50754 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 186192 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c3c12b10c79e9cce98b82a7b454beb82f56e7eb2, data reload: false

query1	1006	472	468	468
query2	6556	1902	1873	1873
query3	6744	228	220	220
query4	26145	23623	23162	23162
query5	4426	631	458	458
query6	299	219	210	210
query7	4620	506	285	285
query8	312	258	240	240
query9	8615	2579	2576	2576
query10	496	316	257	257
query11	15689	15050	14801	14801
query12	177	109	103	103
query13	1652	506	395	395
query14	9445	6126	6189	6126
query15	206	188	165	165
query16	7294	623	476	476
query17	1179	733	561	561
query18	1970	391	298	298
query19	190	187	158	158
query20	126	129	121	121
query21	209	120	104	104
query22	4149	4169	4045	4045
query23	33741	32888	33020	32888
query24	8500	2405	2403	2403
query25	549	452	393	393
query26	1227	265	147	147
query27	2765	503	329	329
query28	4374	2406	2390	2390
query29	761	561	442	442
query30	290	217	189	189
query31	946	891	814	814
query32	77	69	63	63
query33	573	377	324	324
query34	791	860	503	503
query35	814	895	747	747
query36	949	960	866	866
query37	115	101	75	75
query38	4154	4101	4134	4101
query39	1456	1398	1380	1380
query40	213	119	105	105
query41	67	55	68	55
query42	122	110	109	109
query43	479	504	482	482
query44	1290	802	797	797
query45	176	172	175	172
query46	930	1014	629	629
query47	1767	1816	1705	1705
query48	388	411	300	300
query49	792	525	417	417
query50	639	669	416	416
query51	4200	4159	4130	4130
query52	112	117	101	101
query53	233	255	184	184
query54	577	583	501	501
query55	86	82	77	77
query56	319	287	300	287
query57	1144	1159	1065	1065
query58	301	258	261	258
query59	2564	2659	2456	2456
query60	332	333	321	321
query61	166	129	128	128
query62	788	730	667	667
query63	232	189	182	182
query64	4373	1040	714	714
query65	4290	4246	4280	4246
query66	1128	427	317	317
query67	15782	15377	15361	15361
query68	5991	829	524	524
query69	478	309	254	254
query70	1208	1085	1126	1085
query71	414	325	299	299
query72	5964	4881	4958	4881
query73	654	649	369	369
query74	8845	8993	8715	8715
query75	3208	3207	2710	2710
query76	3240	1181	747	747
query77	476	388	290	290
query78	9999	10153	9310	9310
query79	1919	820	561	561
query80	660	512	537	512
query81	493	262	226	226
query82	371	128	95	95
query83	258	262	245	245
query84	254	111	86	86
query85	760	357	321	321
query86	381	292	303	292
query87	4398	4424	4295	4295
query88	2882	2276	2265	2265
query89	383	316	279	279
query90	1876	218	216	216
query91	147	145	116	116
query92	75	67	59	59
query93	1766	948	583	583
query94	671	423	305	305
query95	384	352	290	290
query96	489	560	282	282
query97	3127	3208	3119	3119
query98	229	217	207	207
query99	1340	1402	1316	1316
Total cold run time: 270417 ms
Total hot run time: 186192 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.1 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c3c12b10c79e9cce98b82a7b454beb82f56e7eb2, data reload: false

query1	0.04	0.03	0.03
query2	0.13	0.11	0.10
query3	0.25	0.19	0.19
query4	1.59	0.19	0.19
query5	0.59	0.57	0.59
query6	1.20	0.72	0.72
query7	0.02	0.02	0.02
query8	0.04	0.04	0.03
query9	0.58	0.51	0.51
query10	0.58	0.58	0.57
query11	0.16	0.11	0.11
query12	0.15	0.11	0.11
query13	0.61	0.60	0.59
query14	2.73	2.81	2.68
query15	0.92	0.84	0.84
query16	0.37	0.38	0.39
query17	1.06	1.06	1.03
query18	0.21	0.20	0.20
query19	1.93	1.92	1.88
query20	0.01	0.01	0.01
query21	15.35	0.91	0.56
query22	0.74	1.16	0.63
query23	15.00	1.38	0.62
query24	7.12	0.78	1.22
query25	0.47	0.24	0.08
query26	0.59	0.16	0.14
query27	0.05	0.05	0.04
query28	9.41	0.87	0.43
query29	12.54	3.89	3.30
query30	0.24	0.09	0.07
query31	2.82	0.58	0.38
query32	3.23	0.54	0.48
query33	2.95	3.08	3.05
query34	15.92	5.11	4.49
query35	4.52	4.50	4.50
query36	0.66	0.50	0.48
query37	0.09	0.07	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.16	0.13	0.12
query41	0.08	0.03	0.02
query42	0.04	0.03	0.03
query43	0.04	0.03	0.02
Total cold run time: 105.27 s
Total hot run time: 31.1 s

@BePPPower
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 83.05% (1088/1310)
Line Coverage: 66.08% (18106/27401)
Region Coverage: 65.44% (8908/13612)
Branch Coverage: 55.23% (4792/8676)
Coverage Report: http://coverage.selectdb-in.cc/coverage/c3c12b10c79e9cce98b82a7b454beb82f56e7eb2_c3c12b10c79e9cce98b82a7b454beb82f56e7eb2_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 35167 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c3c12b10c79e9cce98b82a7b454beb82f56e7eb2, data reload: false

------ Round 1 ----------------------------------
q1	26115	5081	5053	5053
q2	2063	270	186	186
q3	10594	1254	706	706
q4	10254	1013	557	557
q5	7872	2415	2361	2361
q6	197	163	134	134
q7	920	754	614	614
q8	9324	1359	1155	1155
q9	6857	5105	5071	5071
q10	6830	2351	1887	1887
q11	488	294	277	277
q12	349	360	226	226
q13	17818	3731	3060	3060
q14	239	223	218	218
q15	546	496	474	474
q16	620	616	572	572
q17	640	870	381	381
q18	7719	7102	7200	7102
q19	1708	951	563	563
q20	342	334	228	228
q21	4477	3460	3373	3373
q22	1033	1004	969	969
Total cold run time: 117005 ms
Total hot run time: 35167 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5188	5134	5113	5113
q2	241	328	228	228
q3	2152	2623	2286	2286
q4	1463	2027	1544	1544
q5	4515	4430	4417	4417
q6	220	166	128	128
q7	2010	1917	1820	1820
q8	2663	2551	2540	2540
q9	7322	6979	7239	6979
q10	3012	3152	2753	2753
q11	594	489	500	489
q12	689	800	661	661
q13	3507	3942	3390	3390
q14	279	285	285	285
q15	532	470	498	470
q16	645	678	647	647
q17	1168	1613	1384	1384
q18	7845	7615	7485	7485
q19	862	806	854	806
q20	1968	1984	1904	1904
q21	5285	5092	5019	5019
q22	1083	1095	1012	1012
Total cold run time: 53243 ms
Total hot run time: 51360 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193735 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c3c12b10c79e9cce98b82a7b454beb82f56e7eb2, data reload: false

query1	1432	1063	1074	1063
query2	6175	1934	1890	1890
query3	11079	4562	4688	4562
query4	52211	25003	23605	23605
query5	5192	568	458	458
query6	360	216	209	209
query7	4940	507	284	284
query8	324	255	249	249
query9	5948	2614	2612	2612
query10	422	314	263	263
query11	15431	15199	15047	15047
query12	161	114	102	102
query13	1081	534	399	399
query14	10162	6467	6356	6356
query15	198	200	186	186
query16	7071	652	506	506
query17	1092	698	558	558
query18	1519	409	312	312
query19	192	190	177	177
query20	130	129	125	125
query21	207	127	106	106
query22	4461	4490	4349	4349
query23	34038	33394	33476	33394
query24	6568	2491	2418	2418
query25	463	476	390	390
query26	710	278	147	147
query27	2423	497	340	340
query28	3136	2460	2472	2460
query29	593	564	425	425
query30	286	223	204	204
query31	874	894	766	766
query32	72	64	67	64
query33	456	346	323	323
query34	762	859	517	517
query35	814	829	765	765
query36	944	1013	919	919
query37	127	101	76	76
query38	4160	4295	4114	4114
query39	1484	1466	1432	1432
query40	222	126	112	112
query41	60	57	56	56
query42	132	113	107	107
query43	469	507	492	492
query44	1306	897	812	812
query45	183	179	166	166
query46	844	1034	639	639
query47	1821	1929	1802	1802
query48	404	424	299	299
query49	721	526	442	442
query50	672	703	411	411
query51	4175	4280	4389	4280
query52	111	105	99	99
query53	242	268	187	187
query54	576	625	500	500
query55	86	83	81	81
query56	313	294	291	291
query57	1177	1187	1136	1136
query58	260	269	281	269
query59	2769	2872	2624	2624
query60	334	336	313	313
query61	129	137	134	134
query62	715	747	674	674
query63	237	186	180	180
query64	1749	1040	716	716
query65	4450	4290	4377	4290
query66	729	393	307	307
query67	15954	15611	15432	15432
query68	6704	896	521	521
query69	524	301	257	257
query70	1226	1128	1155	1128
query71	503	309	281	281
query72	5791	4726	4840	4726
query73	1322	620	350	350
query74	9006	9444	8860	8860
query75	3935	3184	2739	2739
query76	4219	1190	753	753
query77	688	381	290	290
query78	10047	10085	9242	9242
query79	2607	793	568	568
query80	651	512	439	439
query81	490	269	227	227
query82	517	126	97	97
query83	255	246	307	246
query84	299	107	84	84
query85	762	343	309	309
query86	376	318	299	299
query87	4425	4401	4364	4364
query88	3500	2199	2203	2199
query89	412	311	276	276
query90	1807	209	207	207
query91	139	140	109	109
query92	80	58	61	58
query93	2162	950	574	574
query94	688	411	301	301
query95	364	291	288	288
query96	494	562	272	272
query97	3185	3283	3169	3169
query98	229	209	199	199
query99	1380	1424	1280	1280
Total cold run time: 297325 ms
Total hot run time: 193735 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.73 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c3c12b10c79e9cce98b82a7b454beb82f56e7eb2, data reload: false

query1	0.04	0.04	0.03
query2	0.13	0.10	0.10
query3	0.24	0.20	0.20
query4	1.59	0.20	0.11
query5	0.59	0.55	0.57
query6	1.19	0.73	0.73
query7	0.02	0.01	0.02
query8	0.04	0.03	0.04
query9	0.58	0.51	0.53
query10	0.58	0.62	0.58
query11	0.16	0.10	0.10
query12	0.15	0.11	0.11
query13	0.62	0.61	0.61
query14	2.87	2.70	2.69
query15	0.95	0.88	0.89
query16	0.39	0.40	0.40
query17	1.03	1.05	1.07
query18	0.19	0.18	0.19
query19	2.08	1.81	1.86
query20	0.02	0.02	0.01
query21	15.36	0.90	0.54
query22	0.75	1.16	0.63
query23	15.06	1.42	0.63
query24	7.45	1.20	0.32
query25	0.42	0.24	0.11
query26	0.62	0.16	0.14
query27	0.06	0.05	0.05
query28	9.16	0.92	0.43
query29	12.54	3.97	3.30
query30	0.26	0.08	0.06
query31	2.82	0.63	0.39
query32	3.24	0.59	0.49
query33	3.05	3.05	3.05
query34	15.67	5.29	4.55
query35	4.62	4.55	4.59
query36	0.66	0.51	0.49
query37	0.08	0.06	0.07
query38	0.06	0.04	0.04
query39	0.03	0.02	0.02
query40	0.18	0.13	0.13
query41	0.08	0.03	0.02
query42	0.03	0.03	0.02
query43	0.03	0.03	0.02
Total cold run time: 105.69 s
Total hot run time: 30.73 s

@morningman morningman merged commit b43120b into apache:master Apr 7, 2025
32 of 34 checks passed
github-actions bot pushed a commit that referenced this pull request Apr 7, 2025
Problem Summary:

The output format of complex data types are different between Hive and
Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so
that they don't need to modify their business code.

This PR mainly changes:

Add a new option to  session variable `serde_dialect`: If set to hive,
the output format returned to MySQL client of some datatypes will be
changed:

Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]

Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}

Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}

Related #37039
github-actions bot pushed a commit that referenced this pull request Apr 7, 2025
Problem Summary:

The output format of complex data types are different between Hive and
Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so
that they don't need to modify their business code.

This PR mainly changes:

Add a new option to  session variable `serde_dialect`: If set to hive,
the output format returned to MySQL client of some datatypes will be
changed:

Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]

Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}

Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}

Related #37039
yiguolei pushed a commit that referenced this pull request Apr 11, 2025
…#49831)

Cherry-picked from #49036

Co-authored-by: Tiewei Fang <fangtiewei@selectdb.com>
morningman added a commit that referenced this pull request Apr 11, 2025
morningman added a commit that referenced this pull request Apr 16, 2025
### What problem does this PR solve?

1.
In #49036, we only support hive serde dialect in BE side.
But some constant expr will be evaluated and output in FE side, need to
support it too.

2.
Refactor the method of getting string format value for all type of
literals in FE side.

There are 2 kind of string format value for literal. One is for Query,
the other is for Stream Load.
Here is some difference:

- NullLiteral
    For query, it should be `null`. For load, it should be `\N`.

- StructLiteral
For query, it should be `{"k1":"v1", "k2":null, "k3":"", "k4":"a"}`. For
load, it should be `{"v1", null, "", "a"}`

So we need 2 different methods to distinguish them:
`getStringValueForQuery` and `getStringValueForStreamLoad`.
And I removed or renamed some old and messy methods.

**Exmples**

- `Doris/Hive/Presto` means when setting `serde_dialect` to these types,
the format of query result for different column types.
- `Stream Load ` means what format should be like in csv format when
loading to the table

| Type | Doris | Hive | Presto | Stream Load | Comment |
| --- | --- | --- | --- | --- | --- |
| Bool | `1`, `0` | `1`, `0` | `1`, `0` | `1|0`, `true|false` ||
| Integer | `1`, `1000` | `1`, `1000` | `1`, `1000` | `1|1000` | |
| Float/Decimal | `1.2`, `3.00` | `1.2`, `3.00` | `1.2`, `3.00` |
`1.2|3.00` | |
| Date/Datetime | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`,
`2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` |
`2025-01-01|2025-01-01 10:11:11` | |
| String | `abc`, `中国` | `abc`, `中国` | `abc`, `中国` | `abc,中国` | |
| Null | `null` | `null` | `NULL` | `\N` ||
| Array<bool> | `[1, 0]` | `[true,false]` | `[1, 0]` | `[1, 0]`, `[true,
false]` ||
| Array<int> | `[1, 1000]` | `[1,1000]` | `[1, 1000]` | `[1, 1000]` ||
| Array<string> | `["abc", "中国"]` | `["abc","中国"]` | `["abc", "中国"]` |
`["abc", "中国"]` | |
| Array<date/datetime> | `["2025-01-01", "2025-01-01 10:11:11"]` |
`["2025-01-01","2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01
10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` ||
| Array<null> | `[null]` | `[null]` | `[NULL]` | `[null]` | |
| Map<int, string> | `{1:"abc", 2:"中国"}` |`{1:"abc",2:"中国"}` |`{1=abc,
2=中国}` | `{1:"abc", 2:"中国"}` | |
| Map<string, date/datetime> | `{"k1":"2022-10-01", "k2":"2022-10-01
10:10:10"}` | `{"k1":"2022-10-01","k2":"2022-10-01 10:10:10"}` |
`{k1=2022-10-01, k2=2022-10-01 10:10:10}` | `{"k1":"2022-10-01",
"k2":"2022-10-01 10:10:10"}` | |
| Map<int, null> | `{1:null, 2:null}` | `{1:null,2:null}` | `{1=NULL,
2=NULL}` | `{1:null, 2:null}` | |
| Struct<> | Same as map | Same as map | Same as map | Same as map | |


3. Fix a bug that for batch insert transaction, the `trim_double_quotas`
should be set to false
seawinde pushed a commit to seawinde/doris that referenced this pull request Apr 17, 2025
### What problem does this PR solve?

1.
In apache#49036, we only support hive serde dialect in BE side.
But some constant expr will be evaluated and output in FE side, need to
support it too.

2.
Refactor the method of getting string format value for all type of
literals in FE side.

There are 2 kind of string format value for literal. One is for Query,
the other is for Stream Load.
Here is some difference:

- NullLiteral
    For query, it should be `null`. For load, it should be `\N`.

- StructLiteral
For query, it should be `{"k1":"v1", "k2":null, "k3":"", "k4":"a"}`. For
load, it should be `{"v1", null, "", "a"}`

So we need 2 different methods to distinguish them:
`getStringValueForQuery` and `getStringValueForStreamLoad`.
And I removed or renamed some old and messy methods.

**Exmples**

- `Doris/Hive/Presto` means when setting `serde_dialect` to these types,
the format of query result for different column types.
- `Stream Load ` means what format should be like in csv format when
loading to the table

| Type | Doris | Hive | Presto | Stream Load | Comment |
| --- | --- | --- | --- | --- | --- |
| Bool | `1`, `0` | `1`, `0` | `1`, `0` | `1|0`, `true|false` ||
| Integer | `1`, `1000` | `1`, `1000` | `1`, `1000` | `1|1000` | |
| Float/Decimal | `1.2`, `3.00` | `1.2`, `3.00` | `1.2`, `3.00` |
`1.2|3.00` | |
| Date/Datetime | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`,
`2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` |
`2025-01-01|2025-01-01 10:11:11` | |
| String | `abc`, `中国` | `abc`, `中国` | `abc`, `中国` | `abc,中国` | |
| Null | `null` | `null` | `NULL` | `\N` ||
| Array<bool> | `[1, 0]` | `[true,false]` | `[1, 0]` | `[1, 0]`, `[true,
false]` ||
| Array<int> | `[1, 1000]` | `[1,1000]` | `[1, 1000]` | `[1, 1000]` ||
| Array<string> | `["abc", "中国"]` | `["abc","中国"]` | `["abc", "中国"]` |
`["abc", "中国"]` | |
| Array<date/datetime> | `["2025-01-01", "2025-01-01 10:11:11"]` |
`["2025-01-01","2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01
10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` ||
| Array<null> | `[null]` | `[null]` | `[NULL]` | `[null]` | |
| Map<int, string> | `{1:"abc", 2:"中国"}` |`{1:"abc",2:"中国"}` |`{1=abc,
2=中国}` | `{1:"abc", 2:"中国"}` | |
| Map<string, date/datetime> | `{"k1":"2022-10-01", "k2":"2022-10-01
10:10:10"}` | `{"k1":"2022-10-01","k2":"2022-10-01 10:10:10"}` |
`{k1=2022-10-01, k2=2022-10-01 10:10:10}` | `{"k1":"2022-10-01",
"k2":"2022-10-01 10:10:10"}` | |
| Map<int, null> | `{1:null, 2:null}` | `{1:null,2:null}` | `{1=NULL,
2=NULL}` | `{1:null, 2:null}` | |
| Struct<> | Same as map | Same as map | Same as map | Same as map | |


3. Fix a bug that for batch insert transaction, the `trim_double_quotas`
should be set to false
BePPPower added a commit to BePPPower/doris that referenced this pull request Apr 27, 2025
Problem Summary:

The output format of complex data types are different between Hive and
Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so
that they don't need to modify their business code.

This PR mainly changes:

Add a new option to  session variable `serde_dialect`: If set to hive,
the output format returned to MySQL client of some datatypes will be
changed:

Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]

Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}

Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}

Related apache#37039
@morningman morningman added the usercase Important user case type label label Apr 28, 2025
dataroaring pushed a commit that referenced this pull request Apr 30, 2025
Problem Summary:

The output format of complex data types are different between Hive and
Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so
that they don't need to modify their business code.

This PR mainly changes:

Add a new option to  session variable `serde_dialect`: If set to hive,
the output format returned to MySQL client of some datatypes will be
changed:

Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]

Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}

Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}

Related #37039
dataroaring pushed a commit that referenced this pull request May 6, 2025
morningman added a commit to morningman/doris that referenced this pull request May 14, 2025
1.
In apache#49036, we only support hive serde dialect in BE side.
But some constant expr will be evaluated and output in FE side, need to
support it too.

2.
Refactor the method of getting string format value for all type of
literals in FE side.

There are 2 kind of string format value for literal. One is for Query,
the other is for Stream Load.
Here is some difference:

- NullLiteral
    For query, it should be `null`. For load, it should be `\N`.

- StructLiteral
For query, it should be `{"k1":"v1", "k2":null, "k3":"", "k4":"a"}`. For
load, it should be `{"v1", null, "", "a"}`

So we need 2 different methods to distinguish them:
`getStringValueForQuery` and `getStringValueForStreamLoad`.
And I removed or renamed some old and messy methods.

**Exmples**

- `Doris/Hive/Presto` means when setting `serde_dialect` to these types,
the format of query result for different column types.
- `Stream Load ` means what format should be like in csv format when
loading to the table

| Type | Doris | Hive | Presto | Stream Load | Comment |
| --- | --- | --- | --- | --- | --- |
| Bool | `1`, `0` | `1`, `0` | `1`, `0` | `1|0`, `true|false` ||
| Integer | `1`, `1000` | `1`, `1000` | `1`, `1000` | `1|1000` | |
| Float/Decimal | `1.2`, `3.00` | `1.2`, `3.00` | `1.2`, `3.00` |
`1.2|3.00` | |
| Date/Datetime | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`,
`2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` |
`2025-01-01|2025-01-01 10:11:11` | |
| String | `abc`, `中国` | `abc`, `中国` | `abc`, `中国` | `abc,中国` | |
| Null | `null` | `null` | `NULL` | `\N` ||
| Array<bool> | `[1, 0]` | `[true,false]` | `[1, 0]` | `[1, 0]`, `[true,
false]` ||
| Array<int> | `[1, 1000]` | `[1,1000]` | `[1, 1000]` | `[1, 1000]` ||
| Array<string> | `["abc", "中国"]` | `["abc","中国"]` | `["abc", "中国"]` |
`["abc", "中国"]` | |
| Array<date/datetime> | `["2025-01-01", "2025-01-01 10:11:11"]` |
`["2025-01-01","2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01
10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` ||
| Array<null> | `[null]` | `[null]` | `[NULL]` | `[null]` | |
| Map<int, string> | `{1:"abc", 2:"中国"}` |`{1:"abc",2:"中国"}` |`{1=abc,
2=中国}` | `{1:"abc", 2:"中国"}` | |
| Map<string, date/datetime> | `{"k1":"2022-10-01", "k2":"2022-10-01
10:10:10"}` | `{"k1":"2022-10-01","k2":"2022-10-01 10:10:10"}` |
`{k1=2022-10-01, k2=2022-10-01 10:10:10}` | `{"k1":"2022-10-01",
"k2":"2022-10-01 10:10:10"}` | |
| Map<int, null> | `{1:null, 2:null}` | `{1:null,2:null}` | `{1=NULL,
2=NULL}` | `{1:null, 2:null}` | |
| Struct<> | Same as map | Same as map | Same as map | Same as map | |

3. Fix a bug that for batch insert transaction, the `trim_double_quotas`
should be set to false
BiteTheDDDDt pushed a commit that referenced this pull request May 21, 2025
…and use _nesting_level. (#50977)

We don't need to maintain a separate level; we can achieve the
functionality of this #49036 by
directly using _nesting_level.

```C++
    // This parameter indicates what level the serde belongs to and is mainly used for complex types
    // The default level is 1, and each time you nest, the level increases by 1,
    // for example: struct<string>
    // The _nesting_level of StructSerde is 1
    // The _nesting_level of StringSerde is 2
    int _nesting_level = 1;
```
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
Problem Summary:

The output format of complex data types are different between Hive and
Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so
that they don't need to modify their business code.

This PR mainly changes:

Add a new option to  session variable `serde_dialect`: If set to hive,
the output format returned to MySQL client of some datatypes will be
changed:

Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]

Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}

Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}

Related apache#37039
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
### What problem does this PR solve?

1.
In apache#49036, we only support hive serde dialect in BE side.
But some constant expr will be evaluated and output in FE side, need to
support it too.

2.
Refactor the method of getting string format value for all type of
literals in FE side.

There are 2 kind of string format value for literal. One is for Query,
the other is for Stream Load.
Here is some difference:

- NullLiteral
    For query, it should be `null`. For load, it should be `\N`.

- StructLiteral
For query, it should be `{"k1":"v1", "k2":null, "k3":"", "k4":"a"}`. For
load, it should be `{"v1", null, "", "a"}`

So we need 2 different methods to distinguish them:
`getStringValueForQuery` and `getStringValueForStreamLoad`.
And I removed or renamed some old and messy methods.

**Exmples**

- `Doris/Hive/Presto` means when setting `serde_dialect` to these types,
the format of query result for different column types.
- `Stream Load ` means what format should be like in csv format when
loading to the table

| Type | Doris | Hive | Presto | Stream Load | Comment |
| --- | --- | --- | --- | --- | --- |
| Bool | `1`, `0` | `1`, `0` | `1`, `0` | `1|0`, `true|false` ||
| Integer | `1`, `1000` | `1`, `1000` | `1`, `1000` | `1|1000` | |
| Float/Decimal | `1.2`, `3.00` | `1.2`, `3.00` | `1.2`, `3.00` |
`1.2|3.00` | |
| Date/Datetime | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`,
`2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` |
`2025-01-01|2025-01-01 10:11:11` | |
| String | `abc`, `中国` | `abc`, `中国` | `abc`, `中国` | `abc,中国` | |
| Null | `null` | `null` | `NULL` | `\N` ||
| Array<bool> | `[1, 0]` | `[true,false]` | `[1, 0]` | `[1, 0]`, `[true,
false]` ||
| Array<int> | `[1, 1000]` | `[1,1000]` | `[1, 1000]` | `[1, 1000]` ||
| Array<string> | `["abc", "中国"]` | `["abc","中国"]` | `["abc", "中国"]` |
`["abc", "中国"]` | |
| Array<date/datetime> | `["2025-01-01", "2025-01-01 10:11:11"]` |
`["2025-01-01","2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01
10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` ||
| Array<null> | `[null]` | `[null]` | `[NULL]` | `[null]` | |
| Map<int, string> | `{1:"abc", 2:"中国"}` |`{1:"abc",2:"中国"}` |`{1=abc,
2=中国}` | `{1:"abc", 2:"中国"}` | |
| Map<string, date/datetime> | `{"k1":"2022-10-01", "k2":"2022-10-01
10:10:10"}` | `{"k1":"2022-10-01","k2":"2022-10-01 10:10:10"}` |
`{k1=2022-10-01, k2=2022-10-01 10:10:10}` | `{"k1":"2022-10-01",
"k2":"2022-10-01 10:10:10"}` | |
| Map<int, null> | `{1:null, 2:null}` | `{1:null,2:null}` | `{1=NULL,
2=NULL}` | `{1:null, 2:null}` | |
| Struct<> | Same as map | Same as map | Same as map | Same as map | |


3. Fix a bug that for batch insert transaction, the `trim_double_quotas`
should be set to false
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…and use _nesting_level. (apache#50977)

We don't need to maintain a separate level; we can achieve the
functionality of this apache#49036 by
directly using _nesting_level.

```C++
    // This parameter indicates what level the serde belongs to and is mainly used for complex types
    // The default level is 1, and each time you nest, the level increases by 1,
    // for example: struct<string>
    // The _nesting_level of StructSerde is 1
    // The _nesting_level of StringSerde is 2
    int _nesting_level = 1;
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.6-merged reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants