Skip to content

Conversation

@AshinGau
Copy link
Member

@AshinGau AshinGau commented Mar 27, 2024

Proposed changes

Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well.

Supported Type Changes

More type changes are supported:
image

ColumnTypeConverter

Unified schema change interface for all format readers:

  • First, read the data according to the column type of the file into source column;
  • Second, convert source column to the destination column with type planned by FE

PhysicalToLogicalConverter

Convert parquet physical column to logical column
In parquet document(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md),
Logical or converted type is the data type of column, physical type is the stored type of column chunk. eg, decimal type can be stored as INT32, INT64, BYTE_ARRAY, FIXED_LENGTH_BYTE_ARRAY, so there is a convert process from physical type to logical type. In addition, schema change will bring about a change in logical type.

In previous implementations, physical and logical conversion were mixed together, resulting in severe code complexity and bloating.
PhysicalToLogicalConverter strips away the conversion of logical type, and reuse ColumnTypeConverter to resolve schema change, allowing parquet reader to only focus on the conversion of physical types.

Therefore, tow layers converters are designed:

  • First, read parquet data with the physical type
  • Second, convert physical type to logical type
  • Third, convert logical type to the final type planned by FE(schema change)

Ultimate performance optimization:

  1. If process of (First => Second) is consistent, eg. from BYTE_ARRAY to string, no additional copies and conversions will be introduced;
  2. If process of (Second => Third) is consistent, eg. from decimal(12, 4) to decimal(8, 2), no additional copies and conversions will be introduced;
  3. Null map is share among all processes, no additional copies and conversions will be introduced in null map;
  4. Only create one physical column in physical conversion, and reused in each loop;
  5. Only create one logical column in logical conversion, and reused in each loop;
  6. FIXED_LENGTH_BYTE_ARRAY is read as ColumnUInt8 instead of ColumnString, so the underlying decoder has no process to decode string and use memory copy to read the data as a whole, and the conversion has no need to resolve the Offsets in ColumnString.

Remaining Issues

  1. Not support paimon timestamp type, because it has no logical or converted type, and is conflict with column type change from bigint to timestamp. Reference paimon issue: [format]Fix orc and parquet writer about timestamp not contains [local timezone] and [is_adjust_to_utc] paimon#2739,

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

const char* startptr, const int buffer_size,
PrimitiveTypeTraits<TYPE_LARGEINT>::ColumnType::value_type* value) {
int64 cast_to_int = 0;
bool can_cast = safe_strto64(startptr, buffer_size, &cast_to_int);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still using 64 for largeint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use read_int_text_impl instead.

}

String DataTypeStruct::get_name_by_position(size_t i) const {
if (i == 0 || i > names.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing this check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is wrong, and there's no check like Block::get_by_position


PrimitiveType src_type = OrcReader::convert_to_doris_type(type).type;
if (src_type != primitive_type) {
if (!(is_string_type(src_type) && is_string_type(primitive_type))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Altering column type from string to varchar can still use push-down predicates.

@AshinGau
Copy link
Member Author

AshinGau commented Apr 8, 2024

run buildall

@AshinGau
Copy link
Member Author

AshinGau commented Apr 8, 2024

run buildall

morningman
morningman previously approved these changes Apr 8, 2024
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 8, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Apr 8, 2024

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 8, 2024

PR approved by anyone and no changes requested.

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Apr 9, 2024
@AshinGau
Copy link
Member Author

AshinGau commented Apr 9, 2024

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.59% (8907/25028)
Line Coverage: 27.32% (73075/267483)
Region Coverage: 26.44% (37788/142898)
Branch Coverage: 23.21% (19256/82982)
Coverage Report: http://coverage.selectdb-in.cc/coverage/407efed4c318485f1b7a140a52be2f152f7fb32c_407efed4c318485f1b7a140a52be2f152f7fb32c/report/index.html

morningman
morningman previously approved these changes Apr 9, 2024
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Apr 9, 2024

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 9, 2024
@morningman
Copy link
Contributor

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.63% (8911/25011)
Line Coverage: 27.33% (73148/267607)
Region Coverage: 26.46% (37828/142941)
Branch Coverage: 23.21% (19276/83038)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6a3e4cb0b8088af4aee75de27968a8a6156b4a16_6a3e4cb0b8088af4aee75de27968a8a6156b4a16/report/index.html

@AshinGau
Copy link
Member Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.62% (8910/25011)
Line Coverage: 27.33% (73134/267607)
Region Coverage: 26.45% (37811/142941)
Branch Coverage: 23.21% (19269/83038)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6c5b607f09b8e259b1650ee56aafb444f0360541_6c5b607f09b8e259b1650ee56aafb444f0360541/report/index.html

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 10, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AshinGau AshinGau merged commit 165b0a2 into apache:master Apr 10, 2024
yiguolei pushed a commit that referenced this pull request Apr 12, 2024
…ader (#32873)

Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well.
Unified schema change interface for all format readers:
- First, read the data according to the column type of the file into source column;
- Second, convert source column to the destination column with type planned by FE.
morningman pushed a commit that referenced this pull request Apr 12, 2024
…33546)

follow #32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:
yiguolei pushed a commit that referenced this pull request Apr 12, 2024
…33546)

follow #32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:
yiguolei pushed a commit that referenced this pull request Apr 13, 2024
…33546)

follow #32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:
yiguolei pushed a commit that referenced this pull request Apr 13, 2024
…33546)

follow #32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:
seawinde pushed a commit to seawinde/doris that referenced this pull request Apr 15, 2024
…pache#33546)

follow apache#32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:
seawinde pushed a commit to seawinde/doris that referenced this pull request Apr 17, 2024
…pache#33546)

follow apache#32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:
yiguolei pushed a commit that referenced this pull request Apr 17, 2024
…33546)

follow #32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:
morningman pushed a commit that referenced this pull request Jul 30, 2024
…ader (#32873) (#38408)

## Proposed changes
bp #32873 

Scenario: Reading a hive table after adding fields to a struct column
Since there are still problems with reading tables in parquet and text
formats on the master in this scenario, only tables in orc format are
picked here and some cases are added.
mongo360 pushed a commit to mongo360/doris that referenced this pull request Dec 11, 2024
…ader (apache#32873) (apache#38408)

## Proposed changes
bp apache#32873 

Scenario: Reading a hive table after adding fields to a struct column
Since there are still problems with reading tables in parquet and text
formats on the master in this scenario, only tables in orc format are
picked here and some cases are added.
morningman pushed a commit that referenced this pull request Feb 28, 2025
…ma change. (#47471)

### What problem does this PR solve?
Related PR: #32873 

Problem Summary:

Explicitly defines the behavior of column type conversions.

<img width="935" alt="image"
src="https://github.com/user-attachments/assets/1e5afcf6-fbcf-4c36-b44e-82843feacb05"
/>

Special notes are as follows:

`String => boolean`: In Parquet, only "false", "off", "no", "0", and an
empty string ("") are considered false; otherwise, it is true. In Orc, a
string can be parsed as a number, and if that number is 0, it is
considered false; otherwise, it is true. If parsing the number fails, it
results in null.

Conversion between `Int/smallint/tinyint/bigint`:
Unless the conversion can be perfectly represented, an error will be
reported.
For example:
Bigint column => smallint column
Reason: [INTERNAL_ERROR] Failed to cast value '9223372036854775807' to
Nullable(Int16) column.

Conversion between `Decimal`:
Unless the conversion can be perfectly done, an error will be reported.

`String => Int/smallint/tinyint/bigint`:
It can be successfully converted to a number, and the number can be
correctly stored. Otherwise, the result is null.

`Int/smallint/tinyint/bigint => float`:
The conversion is successful only if abs(number type) < 2^23.

`Int/smallint/tinyint/bigint => double`:
The conversion is successful only if abs(number type) < 2^52.

`Decimal => Int/smallint/tinyint/bigint`:
If the integer part of the decimal can be perfectly stored, only the
integer part will be shown; otherwise, it will result in null.

`Float => double`:
Refer to the C++ static_cast<double>(float).

`Decimal => float/double`:
Attempt to store the approximate value.

`Boolean => string`:
The conversion will result in “TRUE” or “FALSE”.

TODO:  conversion to `char/varchar` type requires truncation.
seawinde pushed a commit to seawinde/doris that referenced this pull request Feb 28, 2025
…ma change. (apache#47471)

### What problem does this PR solve?
Related PR: apache#32873 

Problem Summary:

Explicitly defines the behavior of column type conversions.

<img width="935" alt="image"
src="https://github.com/user-attachments/assets/1e5afcf6-fbcf-4c36-b44e-82843feacb05"
/>

Special notes are as follows:

`String => boolean`: In Parquet, only "false", "off", "no", "0", and an
empty string ("") are considered false; otherwise, it is true. In Orc, a
string can be parsed as a number, and if that number is 0, it is
considered false; otherwise, it is true. If parsing the number fails, it
results in null.

Conversion between `Int/smallint/tinyint/bigint`:
Unless the conversion can be perfectly represented, an error will be
reported.
For example:
Bigint column => smallint column
Reason: [INTERNAL_ERROR] Failed to cast value '9223372036854775807' to
Nullable(Int16) column.

Conversion between `Decimal`:
Unless the conversion can be perfectly done, an error will be reported.

`String => Int/smallint/tinyint/bigint`:
It can be successfully converted to a number, and the number can be
correctly stored. Otherwise, the result is null.

`Int/smallint/tinyint/bigint => float`:
The conversion is successful only if abs(number type) < 2^23.

`Int/smallint/tinyint/bigint => double`:
The conversion is successful only if abs(number type) < 2^52.

`Decimal => Int/smallint/tinyint/bigint`:
If the integer part of the decimal can be perfectly stored, only the
integer part will be shown; otherwise, it will result in null.

`Float => double`:
Refer to the C++ static_cast<double>(float).

`Decimal => float/double`:
Attempt to store the approximate value.

`Boolean => string`:
The conversion will result in “TRUE” or “FALSE”.

TODO:  conversion to `char/varchar` type requires truncation.
mymeiyi pushed a commit to mymeiyi/doris that referenced this pull request Mar 4, 2025
…ma change. (apache#47471)

### What problem does this PR solve?
Related PR: apache#32873 

Problem Summary:

Explicitly defines the behavior of column type conversions.

<img width="935" alt="image"
src="https://github.com/user-attachments/assets/1e5afcf6-fbcf-4c36-b44e-82843feacb05"
/>

Special notes are as follows:

`String => boolean`: In Parquet, only "false", "off", "no", "0", and an
empty string ("") are considered false; otherwise, it is true. In Orc, a
string can be parsed as a number, and if that number is 0, it is
considered false; otherwise, it is true. If parsing the number fails, it
results in null.

Conversion between `Int/smallint/tinyint/bigint`:
Unless the conversion can be perfectly represented, an error will be
reported.
For example:
Bigint column => smallint column
Reason: [INTERNAL_ERROR] Failed to cast value '9223372036854775807' to
Nullable(Int16) column.

Conversion between `Decimal`:
Unless the conversion can be perfectly done, an error will be reported.

`String => Int/smallint/tinyint/bigint`:
It can be successfully converted to a number, and the number can be
correctly stored. Otherwise, the result is null.

`Int/smallint/tinyint/bigint => float`:
The conversion is successful only if abs(number type) < 2^23.

`Int/smallint/tinyint/bigint => double`:
The conversion is successful only if abs(number type) < 2^52.

`Decimal => Int/smallint/tinyint/bigint`:
If the integer part of the decimal can be perfectly stored, only the
integer part will be shown; otherwise, it will result in null.

`Float => double`:
Refer to the C++ static_cast<double>(float).

`Decimal => float/double`:
Attempt to store the approximate value.

`Boolean => string`:
The conversion will result in “TRUE” or “FALSE”.

TODO:  conversion to `char/varchar` type requires truncation.
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…ma change. (apache#47471)

### What problem does this PR solve?
Related PR: apache#32873 

Problem Summary:

Explicitly defines the behavior of column type conversions.

<img width="935" alt="image"
src="https://github.com/user-attachments/assets/1e5afcf6-fbcf-4c36-b44e-82843feacb05"
/>

Special notes are as follows:

`String => boolean`: In Parquet, only "false", "off", "no", "0", and an
empty string ("") are considered false; otherwise, it is true. In Orc, a
string can be parsed as a number, and if that number is 0, it is
considered false; otherwise, it is true. If parsing the number fails, it
results in null.

Conversion between `Int/smallint/tinyint/bigint`:
Unless the conversion can be perfectly represented, an error will be
reported.
For example:
Bigint column => smallint column
Reason: [INTERNAL_ERROR] Failed to cast value '9223372036854775807' to
Nullable(Int16) column.

Conversion between `Decimal`:
Unless the conversion can be perfectly done, an error will be reported.

`String => Int/smallint/tinyint/bigint`:
It can be successfully converted to a number, and the number can be
correctly stored. Otherwise, the result is null.

`Int/smallint/tinyint/bigint => float`:
The conversion is successful only if abs(number type) < 2^23.

`Int/smallint/tinyint/bigint => double`:
The conversion is successful only if abs(number type) < 2^52.

`Decimal => Int/smallint/tinyint/bigint`:
If the integer part of the decimal can be perfectly stored, only the
integer part will be shown; otherwise, it will result in null.

`Float => double`:
Refer to the C++ static_cast<double>(float).

`Decimal => float/double`:
Attempt to store the approximate value.

`Boolean => string`:
The conversion will result in “TRUE” or “FALSE”.

TODO:  conversion to `char/varchar` type requires truncation.
hubgeter added a commit to hubgeter/doris that referenced this pull request Jun 23, 2025
…ma change. (apache#47471)

Related PR: apache#32873

Problem Summary:

Explicitly defines the behavior of column type conversions.

<img width="935" alt="image"
src="https://github.com/user-attachments/assets/1e5afcf6-fbcf-4c36-b44e-82843feacb05"
/>

Special notes are as follows:

`String => boolean`: In Parquet, only "false", "off", "no", "0", and an
empty string ("") are considered false; otherwise, it is true. In Orc, a
string can be parsed as a number, and if that number is 0, it is
considered false; otherwise, it is true. If parsing the number fails, it
results in null.

Conversion between `Int/smallint/tinyint/bigint`:
Unless the conversion can be perfectly represented, an error will be
reported.
For example:
Bigint column => smallint column
Reason: [INTERNAL_ERROR] Failed to cast value '9223372036854775807' to
Nullable(Int16) column.

Conversion between `Decimal`:
Unless the conversion can be perfectly done, an error will be reported.

`String => Int/smallint/tinyint/bigint`:
It can be successfully converted to a number, and the number can be
correctly stored. Otherwise, the result is null.

`Int/smallint/tinyint/bigint => float`:
The conversion is successful only if abs(number type) < 2^23.

`Int/smallint/tinyint/bigint => double`:
The conversion is successful only if abs(number type) < 2^52.

`Decimal => Int/smallint/tinyint/bigint`:
If the integer part of the decimal can be perfectly stored, only the
integer part will be shown; otherwise, it will result in null.

`Float => double`:
Refer to the C++ static_cast<double>(float).

`Decimal => float/double`:
Attempt to store the approximate value.

`Boolean => string`:
The conversion will result in “TRUE” or “FALSE”.

TODO:  conversion to `char/varchar` type requires truncation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.14-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants