-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Fix](Serde) Support hive compatible output format #49036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Hive support for the serde dialect, allowing users to obtain a Hive-compatible output for complex data types. Key changes include:
- Adding a new case for "hive" in the NereidsPlanner to set format options.
- Extending the allowed values and return mapping for serde dialect in SessionVariable.
Reviewed Changes
Copilot reviewed 2 out of 9 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/nereids/NereidsPlanner.java | Added hive case in switch block to set format options for hive. |
| fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java | Updated serde dialect validation to include hive and return enum. |
Files not reviewed (7)
- be/src/vec/data_types/serde/data_type_array_serde.cpp: Language not supported
- be/src/vec/data_types/serde/data_type_map_serde.cpp: Language not supported
- be/src/vec/data_types/serde/data_type_number_serde.cpp: Language not supported
- be/src/vec/data_types/serde/data_type_serde.h: Language not supported
- be/src/vec/data_types/serde/data_type_struct_serde.cpp: Language not supported
- be/src/vec/sink/vmysql_result_writer.cpp: Language not supported
- gensrc/thrift/PaloInternalService.thrift: Language not supported
Comments suppressed due to low confidence (2)
fe/fe-core/src/main/java/org/apache/doris/nereids/NereidsPlanner.java:804
- Ensure that using FormatOptions.getDefault() for the hive dialect produces Hive-compatible output as expected; if additional formatting adjustments are needed for Hive, consider using or creating a dedicated FormatOptions method.
case "hive":
fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java:4519
- The error message mistakenly refers to 'sqlDialect' instead of 'serdeDialect', which could confuse users; please update it for accuracy.
throw new UnsupportedOperationException("sqlDialect value is invalid, the invalid value is " + serdeDialect);
81cedc1 to
eb5ed02
Compare
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 34258 ms |
TPC-DS: Total hot run time: 186192 ms |
ClickBench: Total hot run time: 31.1 s |
|
run buildall |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 35167 ms |
TPC-DS: Total hot run time: 193735 ms |
ClickBench: Total hot run time: 30.73 s |
Problem Summary:
The output format of complex data types are different between Hive and
Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so
that they don't need to modify their business code.
This PR mainly changes:
Add a new option to session variable `serde_dialect`: If set to hive,
the output format returned to MySQL client of some datatypes will be
changed:
Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]
Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}
Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}
Related #37039
Problem Summary:
The output format of complex data types are different between Hive and
Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so
that they don't need to modify their business code.
This PR mainly changes:
Add a new option to session variable `serde_dialect`: If set to hive,
the output format returned to MySQL client of some datatypes will be
changed:
Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]
Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}
Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}
Related #37039
### What problem does this PR solve? 1. In #49036, we only support hive serde dialect in BE side. But some constant expr will be evaluated and output in FE side, need to support it too. 2. Refactor the method of getting string format value for all type of literals in FE side. There are 2 kind of string format value for literal. One is for Query, the other is for Stream Load. Here is some difference: - NullLiteral For query, it should be `null`. For load, it should be `\N`. - StructLiteral For query, it should be `{"k1":"v1", "k2":null, "k3":"", "k4":"a"}`. For load, it should be `{"v1", null, "", "a"}` So we need 2 different methods to distinguish them: `getStringValueForQuery` and `getStringValueForStreamLoad`. And I removed or renamed some old and messy methods. **Exmples** - `Doris/Hive/Presto` means when setting `serde_dialect` to these types, the format of query result for different column types. - `Stream Load ` means what format should be like in csv format when loading to the table | Type | Doris | Hive | Presto | Stream Load | Comment | | --- | --- | --- | --- | --- | --- | | Bool | `1`, `0` | `1`, `0` | `1`, `0` | `1|0`, `true|false` || | Integer | `1`, `1000` | `1`, `1000` | `1`, `1000` | `1|1000` | | | Float/Decimal | `1.2`, `3.00` | `1.2`, `3.00` | `1.2`, `3.00` | `1.2|3.00` | | | Date/Datetime | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01|2025-01-01 10:11:11` | | | String | `abc`, `中国` | `abc`, `中国` | `abc`, `中国` | `abc,中国` | | | Null | `null` | `null` | `NULL` | `\N` || | Array<bool> | `[1, 0]` | `[true,false]` | `[1, 0]` | `[1, 0]`, `[true, false]` || | Array<int> | `[1, 1000]` | `[1,1000]` | `[1, 1000]` | `[1, 1000]` || | Array<string> | `["abc", "中国"]` | `["abc","中国"]` | `["abc", "中国"]` | `["abc", "中国"]` | | | Array<date/datetime> | `["2025-01-01", "2025-01-01 10:11:11"]` | `["2025-01-01","2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` || | Array<null> | `[null]` | `[null]` | `[NULL]` | `[null]` | | | Map<int, string> | `{1:"abc", 2:"中国"}` |`{1:"abc",2:"中国"}` |`{1=abc, 2=中国}` | `{1:"abc", 2:"中国"}` | | | Map<string, date/datetime> | `{"k1":"2022-10-01", "k2":"2022-10-01 10:10:10"}` | `{"k1":"2022-10-01","k2":"2022-10-01 10:10:10"}` | `{k1=2022-10-01, k2=2022-10-01 10:10:10}` | `{"k1":"2022-10-01", "k2":"2022-10-01 10:10:10"}` | | | Map<int, null> | `{1:null, 2:null}` | `{1:null,2:null}` | `{1=NULL, 2=NULL}` | `{1:null, 2:null}` | | | Struct<> | Same as map | Same as map | Same as map | Same as map | | 3. Fix a bug that for batch insert transaction, the `trim_double_quotas` should be set to false
### What problem does this PR solve? 1. In apache#49036, we only support hive serde dialect in BE side. But some constant expr will be evaluated and output in FE side, need to support it too. 2. Refactor the method of getting string format value for all type of literals in FE side. There are 2 kind of string format value for literal. One is for Query, the other is for Stream Load. Here is some difference: - NullLiteral For query, it should be `null`. For load, it should be `\N`. - StructLiteral For query, it should be `{"k1":"v1", "k2":null, "k3":"", "k4":"a"}`. For load, it should be `{"v1", null, "", "a"}` So we need 2 different methods to distinguish them: `getStringValueForQuery` and `getStringValueForStreamLoad`. And I removed or renamed some old and messy methods. **Exmples** - `Doris/Hive/Presto` means when setting `serde_dialect` to these types, the format of query result for different column types. - `Stream Load ` means what format should be like in csv format when loading to the table | Type | Doris | Hive | Presto | Stream Load | Comment | | --- | --- | --- | --- | --- | --- | | Bool | `1`, `0` | `1`, `0` | `1`, `0` | `1|0`, `true|false` || | Integer | `1`, `1000` | `1`, `1000` | `1`, `1000` | `1|1000` | | | Float/Decimal | `1.2`, `3.00` | `1.2`, `3.00` | `1.2`, `3.00` | `1.2|3.00` | | | Date/Datetime | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01|2025-01-01 10:11:11` | | | String | `abc`, `中国` | `abc`, `中国` | `abc`, `中国` | `abc,中国` | | | Null | `null` | `null` | `NULL` | `\N` || | Array<bool> | `[1, 0]` | `[true,false]` | `[1, 0]` | `[1, 0]`, `[true, false]` || | Array<int> | `[1, 1000]` | `[1,1000]` | `[1, 1000]` | `[1, 1000]` || | Array<string> | `["abc", "中国"]` | `["abc","中国"]` | `["abc", "中国"]` | `["abc", "中国"]` | | | Array<date/datetime> | `["2025-01-01", "2025-01-01 10:11:11"]` | `["2025-01-01","2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` || | Array<null> | `[null]` | `[null]` | `[NULL]` | `[null]` | | | Map<int, string> | `{1:"abc", 2:"中国"}` |`{1:"abc",2:"中国"}` |`{1=abc, 2=中国}` | `{1:"abc", 2:"中国"}` | | | Map<string, date/datetime> | `{"k1":"2022-10-01", "k2":"2022-10-01 10:10:10"}` | `{"k1":"2022-10-01","k2":"2022-10-01 10:10:10"}` | `{k1=2022-10-01, k2=2022-10-01 10:10:10}` | `{"k1":"2022-10-01", "k2":"2022-10-01 10:10:10"}` | | | Map<int, null> | `{1:null, 2:null}` | `{1:null,2:null}` | `{1=NULL, 2=NULL}` | `{1:null, 2:null}` | | | Struct<> | Same as map | Same as map | Same as map | Same as map | | 3. Fix a bug that for batch insert transaction, the `trim_double_quotas` should be set to false
Problem Summary:
The output format of complex data types are different between Hive and
Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so
that they don't need to modify their business code.
This PR mainly changes:
Add a new option to session variable `serde_dialect`: If set to hive,
the output format returned to MySQL client of some datatypes will be
changed:
Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]
Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}
Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}
Related apache#37039
Problem Summary:
The output format of complex data types are different between Hive and
Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so
that they don't need to modify their business code.
This PR mainly changes:
Add a new option to session variable `serde_dialect`: If set to hive,
the output format returned to MySQL client of some datatypes will be
changed:
Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]
Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}
Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}
Related #37039
1. In apache#49036, we only support hive serde dialect in BE side. But some constant expr will be evaluated and output in FE side, need to support it too. 2. Refactor the method of getting string format value for all type of literals in FE side. There are 2 kind of string format value for literal. One is for Query, the other is for Stream Load. Here is some difference: - NullLiteral For query, it should be `null`. For load, it should be `\N`. - StructLiteral For query, it should be `{"k1":"v1", "k2":null, "k3":"", "k4":"a"}`. For load, it should be `{"v1", null, "", "a"}` So we need 2 different methods to distinguish them: `getStringValueForQuery` and `getStringValueForStreamLoad`. And I removed or renamed some old and messy methods. **Exmples** - `Doris/Hive/Presto` means when setting `serde_dialect` to these types, the format of query result for different column types. - `Stream Load ` means what format should be like in csv format when loading to the table | Type | Doris | Hive | Presto | Stream Load | Comment | | --- | --- | --- | --- | --- | --- | | Bool | `1`, `0` | `1`, `0` | `1`, `0` | `1|0`, `true|false` || | Integer | `1`, `1000` | `1`, `1000` | `1`, `1000` | `1|1000` | | | Float/Decimal | `1.2`, `3.00` | `1.2`, `3.00` | `1.2`, `3.00` | `1.2|3.00` | | | Date/Datetime | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01|2025-01-01 10:11:11` | | | String | `abc`, `中国` | `abc`, `中国` | `abc`, `中国` | `abc,中国` | | | Null | `null` | `null` | `NULL` | `\N` || | Array<bool> | `[1, 0]` | `[true,false]` | `[1, 0]` | `[1, 0]`, `[true, false]` || | Array<int> | `[1, 1000]` | `[1,1000]` | `[1, 1000]` | `[1, 1000]` || | Array<string> | `["abc", "中国"]` | `["abc","中国"]` | `["abc", "中国"]` | `["abc", "中国"]` | | | Array<date/datetime> | `["2025-01-01", "2025-01-01 10:11:11"]` | `["2025-01-01","2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` || | Array<null> | `[null]` | `[null]` | `[NULL]` | `[null]` | | | Map<int, string> | `{1:"abc", 2:"中国"}` |`{1:"abc",2:"中国"}` |`{1=abc, 2=中国}` | `{1:"abc", 2:"中国"}` | | | Map<string, date/datetime> | `{"k1":"2022-10-01", "k2":"2022-10-01 10:10:10"}` | `{"k1":"2022-10-01","k2":"2022-10-01 10:10:10"}` | `{k1=2022-10-01, k2=2022-10-01 10:10:10}` | `{"k1":"2022-10-01", "k2":"2022-10-01 10:10:10"}` | | | Map<int, null> | `{1:null, 2:null}` | `{1:null,2:null}` | `{1=NULL, 2=NULL}` | `{1:null, 2:null}` | | | Struct<> | Same as map | Same as map | Same as map | Same as map | | 3. Fix a bug that for batch insert transaction, the `trim_double_quotas` should be set to false
…and use _nesting_level. (#50977) We don't need to maintain a separate level; we can achieve the functionality of this #49036 by directly using _nesting_level. ```C++ // This parameter indicates what level the serde belongs to and is mainly used for complex types // The default level is 1, and each time you nest, the level increases by 1, // for example: struct<string> // The _nesting_level of StructSerde is 1 // The _nesting_level of StringSerde is 2 int _nesting_level = 1; ```
Problem Summary:
The output format of complex data types are different between Hive and
Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so
that they don't need to modify their business code.
This PR mainly changes:
Add a new option to session variable `serde_dialect`: If set to hive,
the output format returned to MySQL client of some datatypes will be
changed:
Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]
Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}
Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}
Related apache#37039
### What problem does this PR solve? 1. In apache#49036, we only support hive serde dialect in BE side. But some constant expr will be evaluated and output in FE side, need to support it too. 2. Refactor the method of getting string format value for all type of literals in FE side. There are 2 kind of string format value for literal. One is for Query, the other is for Stream Load. Here is some difference: - NullLiteral For query, it should be `null`. For load, it should be `\N`. - StructLiteral For query, it should be `{"k1":"v1", "k2":null, "k3":"", "k4":"a"}`. For load, it should be `{"v1", null, "", "a"}` So we need 2 different methods to distinguish them: `getStringValueForQuery` and `getStringValueForStreamLoad`. And I removed or renamed some old and messy methods. **Exmples** - `Doris/Hive/Presto` means when setting `serde_dialect` to these types, the format of query result for different column types. - `Stream Load ` means what format should be like in csv format when loading to the table | Type | Doris | Hive | Presto | Stream Load | Comment | | --- | --- | --- | --- | --- | --- | | Bool | `1`, `0` | `1`, `0` | `1`, `0` | `1|0`, `true|false` || | Integer | `1`, `1000` | `1`, `1000` | `1`, `1000` | `1|1000` | | | Float/Decimal | `1.2`, `3.00` | `1.2`, `3.00` | `1.2`, `3.00` | `1.2|3.00` | | | Date/Datetime | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01`, `2025-01-01 10:11:11` | `2025-01-01|2025-01-01 10:11:11` | | | String | `abc`, `中国` | `abc`, `中国` | `abc`, `中国` | `abc,中国` | | | Null | `null` | `null` | `NULL` | `\N` || | Array<bool> | `[1, 0]` | `[true,false]` | `[1, 0]` | `[1, 0]`, `[true, false]` || | Array<int> | `[1, 1000]` | `[1,1000]` | `[1, 1000]` | `[1, 1000]` || | Array<string> | `["abc", "中国"]` | `["abc","中国"]` | `["abc", "中国"]` | `["abc", "中国"]` | | | Array<date/datetime> | `["2025-01-01", "2025-01-01 10:11:11"]` | `["2025-01-01","2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` | `["2025-01-01", "2025-01-01 10:11:11"]` || | Array<null> | `[null]` | `[null]` | `[NULL]` | `[null]` | | | Map<int, string> | `{1:"abc", 2:"中国"}` |`{1:"abc",2:"中国"}` |`{1=abc, 2=中国}` | `{1:"abc", 2:"中国"}` | | | Map<string, date/datetime> | `{"k1":"2022-10-01", "k2":"2022-10-01 10:10:10"}` | `{"k1":"2022-10-01","k2":"2022-10-01 10:10:10"}` | `{k1=2022-10-01, k2=2022-10-01 10:10:10}` | `{"k1":"2022-10-01", "k2":"2022-10-01 10:10:10"}` | | | Map<int, null> | `{1:null, 2:null}` | `{1:null,2:null}` | `{1=NULL, 2=NULL}` | `{1:null, 2:null}` | | | Struct<> | Same as map | Same as map | Same as map | Same as map | | 3. Fix a bug that for batch insert transaction, the `trim_double_quotas` should be set to false
…and use _nesting_level. (apache#50977) We don't need to maintain a separate level; we can achieve the functionality of this apache#49036 by directly using _nesting_level. ```C++ // This parameter indicates what level the serde belongs to and is mainly used for complex types // The default level is 1, and each time you nest, the level increases by 1, // for example: struct<string> // The _nesting_level of StructSerde is 1 // The _nesting_level of StringSerde is 2 int _nesting_level = 1; ```
Problem Summary:
The output format of complex data types are different between Hive and Doris, such as array, map and struct.
When user migrate from Hive to Doris, they expect the same format so that they don't need to modify their business code.
This PR mainly changes:
Add a new option to session variable
serde_dialect: If set to hive,the output format returned to MySQL client of some datatypes will be changed:
Array
Doris: ["abc", "def", "", null, 1]
Hive: ["abc","def","",null,true]
Map
Doris: {"k1":null, "k2":"v3"}
Hive: {"k1":null,"k2":"v3"}
Struct
Doris: {"s_id":100, "s_name":"abc , "", "s_address":null}
Hive: {"s_id":100,"s_name":"abc ,"","s_address":null}
Related #37039
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)