-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](schema change) unified schema change for parquet and orc reader #32873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
| const char* startptr, const int buffer_size, | ||
| PrimitiveTypeTraits<TYPE_LARGEINT>::ColumnType::value_type* value) { | ||
| int64 cast_to_int = 0; | ||
| bool can_cast = safe_strto64(startptr, buffer_size, &cast_to_int); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still using 64 for largeint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use read_int_text_impl instead.
| } | ||
|
|
||
| String DataTypeStruct::get_name_by_position(size_t i) const { | ||
| if (i == 0 || i > names.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is wrong, and there's no check like Block::get_by_position
|
|
||
| PrimitiveType src_type = OrcReader::convert_to_doris_type(type).type; | ||
| if (src_type != primitive_type) { | ||
| if (!(is_string_type(src_type) && is_string_type(primitive_type))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Altering column type from string to varchar can still use push-down predicates.
|
run buildall |
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
TeamCity be ut coverage result: |
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
run buildall |
|
TeamCity be ut coverage result: |
|
run buildall |
|
TeamCity be ut coverage result: |
|
PR approved by at least one committer and no changes requested. |
kaka11chen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ader (#32873) Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well. Unified schema change interface for all format readers: - First, read the data according to the column type of the file into source column; - Second, convert source column to the destination column with type planned by FE.
…pache#33546) follow apache#32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:
…pache#33546) follow apache#32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:
…ader (#32873) (#38408) ## Proposed changes bp #32873 Scenario: Reading a hive table after adding fields to a struct column Since there are still problems with reading tables in parquet and text formats on the master in this scenario, only tables in orc format are picked here and some cases are added.
…ader (apache#32873) (apache#38408) ## Proposed changes bp apache#32873 Scenario: Reading a hive table after adding fields to a struct column Since there are still problems with reading tables in parquet and text formats on the master in this scenario, only tables in orc format are picked here and some cases are added.
…ma change. (#47471) ### What problem does this PR solve? Related PR: #32873 Problem Summary: Explicitly defines the behavior of column type conversions. <img width="935" alt="image" src="https://github.com/user-attachments/assets/1e5afcf6-fbcf-4c36-b44e-82843feacb05" /> Special notes are as follows: `String => boolean`: In Parquet, only "false", "off", "no", "0", and an empty string ("") are considered false; otherwise, it is true. In Orc, a string can be parsed as a number, and if that number is 0, it is considered false; otherwise, it is true. If parsing the number fails, it results in null. Conversion between `Int/smallint/tinyint/bigint`: Unless the conversion can be perfectly represented, an error will be reported. For example: Bigint column => smallint column Reason: [INTERNAL_ERROR] Failed to cast value '9223372036854775807' to Nullable(Int16) column. Conversion between `Decimal`: Unless the conversion can be perfectly done, an error will be reported. `String => Int/smallint/tinyint/bigint`: It can be successfully converted to a number, and the number can be correctly stored. Otherwise, the result is null. `Int/smallint/tinyint/bigint => float`: The conversion is successful only if abs(number type) < 2^23. `Int/smallint/tinyint/bigint => double`: The conversion is successful only if abs(number type) < 2^52. `Decimal => Int/smallint/tinyint/bigint`: If the integer part of the decimal can be perfectly stored, only the integer part will be shown; otherwise, it will result in null. `Float => double`: Refer to the C++ static_cast<double>(float). `Decimal => float/double`: Attempt to store the approximate value. `Boolean => string`: The conversion will result in “TRUE” or “FALSE”. TODO: conversion to `char/varchar` type requires truncation.
…ma change. (apache#47471) ### What problem does this PR solve? Related PR: apache#32873 Problem Summary: Explicitly defines the behavior of column type conversions. <img width="935" alt="image" src="https://github.com/user-attachments/assets/1e5afcf6-fbcf-4c36-b44e-82843feacb05" /> Special notes are as follows: `String => boolean`: In Parquet, only "false", "off", "no", "0", and an empty string ("") are considered false; otherwise, it is true. In Orc, a string can be parsed as a number, and if that number is 0, it is considered false; otherwise, it is true. If parsing the number fails, it results in null. Conversion between `Int/smallint/tinyint/bigint`: Unless the conversion can be perfectly represented, an error will be reported. For example: Bigint column => smallint column Reason: [INTERNAL_ERROR] Failed to cast value '9223372036854775807' to Nullable(Int16) column. Conversion between `Decimal`: Unless the conversion can be perfectly done, an error will be reported. `String => Int/smallint/tinyint/bigint`: It can be successfully converted to a number, and the number can be correctly stored. Otherwise, the result is null. `Int/smallint/tinyint/bigint => float`: The conversion is successful only if abs(number type) < 2^23. `Int/smallint/tinyint/bigint => double`: The conversion is successful only if abs(number type) < 2^52. `Decimal => Int/smallint/tinyint/bigint`: If the integer part of the decimal can be perfectly stored, only the integer part will be shown; otherwise, it will result in null. `Float => double`: Refer to the C++ static_cast<double>(float). `Decimal => float/double`: Attempt to store the approximate value. `Boolean => string`: The conversion will result in “TRUE” or “FALSE”. TODO: conversion to `char/varchar` type requires truncation.
…ma change. (apache#47471) ### What problem does this PR solve? Related PR: apache#32873 Problem Summary: Explicitly defines the behavior of column type conversions. <img width="935" alt="image" src="https://github.com/user-attachments/assets/1e5afcf6-fbcf-4c36-b44e-82843feacb05" /> Special notes are as follows: `String => boolean`: In Parquet, only "false", "off", "no", "0", and an empty string ("") are considered false; otherwise, it is true. In Orc, a string can be parsed as a number, and if that number is 0, it is considered false; otherwise, it is true. If parsing the number fails, it results in null. Conversion between `Int/smallint/tinyint/bigint`: Unless the conversion can be perfectly represented, an error will be reported. For example: Bigint column => smallint column Reason: [INTERNAL_ERROR] Failed to cast value '9223372036854775807' to Nullable(Int16) column. Conversion between `Decimal`: Unless the conversion can be perfectly done, an error will be reported. `String => Int/smallint/tinyint/bigint`: It can be successfully converted to a number, and the number can be correctly stored. Otherwise, the result is null. `Int/smallint/tinyint/bigint => float`: The conversion is successful only if abs(number type) < 2^23. `Int/smallint/tinyint/bigint => double`: The conversion is successful only if abs(number type) < 2^52. `Decimal => Int/smallint/tinyint/bigint`: If the integer part of the decimal can be perfectly stored, only the integer part will be shown; otherwise, it will result in null. `Float => double`: Refer to the C++ static_cast<double>(float). `Decimal => float/double`: Attempt to store the approximate value. `Boolean => string`: The conversion will result in “TRUE” or “FALSE”. TODO: conversion to `char/varchar` type requires truncation.
…ma change. (apache#47471) ### What problem does this PR solve? Related PR: apache#32873 Problem Summary: Explicitly defines the behavior of column type conversions. <img width="935" alt="image" src="https://github.com/user-attachments/assets/1e5afcf6-fbcf-4c36-b44e-82843feacb05" /> Special notes are as follows: `String => boolean`: In Parquet, only "false", "off", "no", "0", and an empty string ("") are considered false; otherwise, it is true. In Orc, a string can be parsed as a number, and if that number is 0, it is considered false; otherwise, it is true. If parsing the number fails, it results in null. Conversion between `Int/smallint/tinyint/bigint`: Unless the conversion can be perfectly represented, an error will be reported. For example: Bigint column => smallint column Reason: [INTERNAL_ERROR] Failed to cast value '9223372036854775807' to Nullable(Int16) column. Conversion between `Decimal`: Unless the conversion can be perfectly done, an error will be reported. `String => Int/smallint/tinyint/bigint`: It can be successfully converted to a number, and the number can be correctly stored. Otherwise, the result is null. `Int/smallint/tinyint/bigint => float`: The conversion is successful only if abs(number type) < 2^23. `Int/smallint/tinyint/bigint => double`: The conversion is successful only if abs(number type) < 2^52. `Decimal => Int/smallint/tinyint/bigint`: If the integer part of the decimal can be perfectly stored, only the integer part will be shown; otherwise, it will result in null. `Float => double`: Refer to the C++ static_cast<double>(float). `Decimal => float/double`: Attempt to store the approximate value. `Boolean => string`: The conversion will result in “TRUE” or “FALSE”. TODO: conversion to `char/varchar` type requires truncation.
…ma change. (apache#47471) Related PR: apache#32873 Problem Summary: Explicitly defines the behavior of column type conversions. <img width="935" alt="image" src="https://github.com/user-attachments/assets/1e5afcf6-fbcf-4c36-b44e-82843feacb05" /> Special notes are as follows: `String => boolean`: In Parquet, only "false", "off", "no", "0", and an empty string ("") are considered false; otherwise, it is true. In Orc, a string can be parsed as a number, and if that number is 0, it is considered false; otherwise, it is true. If parsing the number fails, it results in null. Conversion between `Int/smallint/tinyint/bigint`: Unless the conversion can be perfectly represented, an error will be reported. For example: Bigint column => smallint column Reason: [INTERNAL_ERROR] Failed to cast value '9223372036854775807' to Nullable(Int16) column. Conversion between `Decimal`: Unless the conversion can be perfectly done, an error will be reported. `String => Int/smallint/tinyint/bigint`: It can be successfully converted to a number, and the number can be correctly stored. Otherwise, the result is null. `Int/smallint/tinyint/bigint => float`: The conversion is successful only if abs(number type) < 2^23. `Int/smallint/tinyint/bigint => double`: The conversion is successful only if abs(number type) < 2^52. `Decimal => Int/smallint/tinyint/bigint`: If the integer part of the decimal can be perfectly stored, only the integer part will be shown; otherwise, it will result in null. `Float => double`: Refer to the C++ static_cast<double>(float). `Decimal => float/double`: Attempt to store the approximate value. `Boolean => string`: The conversion will result in “TRUE” or “FALSE”. TODO: conversion to `char/varchar` type requires truncation.
Proposed changes
Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well.
Supported Type Changes
More type changes are supported:

ColumnTypeConverter
Unified schema change interface for all format readers:
PhysicalToLogicalConverter
Convert parquet physical column to logical column
In parquet document(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md),
Logical or converted type is the data type of column, physical type is the stored type of column chunk. eg, decimal type can be stored as INT32, INT64, BYTE_ARRAY, FIXED_LENGTH_BYTE_ARRAY, so there is a convert process from physical type to logical type. In addition, schema change will bring about a change in logical type.
In previous implementations, physical and logical conversion were mixed together, resulting in severe code complexity and bloating.
PhysicalToLogicalConverterstrips away the conversion of logical type, and reuseColumnTypeConverterto resolve schema change, allowing parquet reader to only focus on the conversion of physical types.Therefore, tow layers converters are designed:
Ultimate performance optimization:
Remaining Issues
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...