-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Fix](ORC) Not push down fixed char type in orc reader #45484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
The code of the current 2.1 branch also has this problem. I will submit a separate PR to fix it. |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
TeamCity be ut coverage result: |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
TeamCity be ut coverage result: |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
run buildall |
|
TeamCity be ut coverage result: |
|
run buildall |
|
TeamCity be ut coverage result: |
|
PR approved by at least one committer and no changes requested. |
### What problem does this PR solve? Problem Summary: In Hive, the ORC file format supports fixed-length CHAR types (CHAR(n)) by padding strings with spaces to ensure the fixed length. When data is written into ORC tables, the actual stored value includes additional trailing spaces to meet the defined length. These padded spaces are also considered during the computation of statistics. However, in Doris, fixed-length CHAR types (CHAR(n)) and variable-length VARCHAR types are internally represented as the same type. Doris does not pad CHAR values with spaces and treats them as regular strings. As a result, when Doris reads ORC files generated by Hive and parses the statistics, the differences in the handling of CHAR types between the two systems can lead to inconsistencies or incorrect statistics. ```sql create table fixed_char_table ( i int, c char(2) ) stored as orc; insert into fixed_char_table values(1,'a'),(2,'b '), (3,'cd'); select * from fixed_char_table where c = 'a'; ``` before ```text empty ``` after ```text 1 a ``` If a Hive table undergoes a schema change, such as a column’s type being modified from INT to STRING, predicate pushdown should be disabled in such cases. Performing predicate pushdown under these circumstances may lead to incorrect filtering, as the type mismatch can cause errors or unexpected behavior during query execution. ```sql create table type_changed_table ( id int, name string ) stored as orc; insert into type_changed_table values (1, 'Alice'), (2, 'Bob'), (3, 'Charlie'); ALTER TABLE type_changed_table CHANGE COLUMN id id STRING; select * from type_changed_table where id = '1'; select ``` before ```text empty ``` after ```text 1 a ``` ### Release note [fix](orc) Not push down fixed char type in orc reader #45484
…eader apache#45484 (apache#45776)" This reverts commit d94ff8f.
revert: branch-3.0: [fix](orc) ignore null values when the literals of in_predicate contains #45104 (#45586) [fix](orc) check all the cases before build_search_argument (#44615) (#44802) branch-3.0: [enhance](orc) Optimize ORC Predicate Pushdown for OR-connected Predicate #43255 (#44436) re-pick: branch-3.0: [Fix](ORC) Not push down fixed char type in orc reader #45484 (#45525) --------- Co-authored-by: Socrates <suyiteng@selectdb.com>
What problem does this PR solve?
Problem Summary:
In Hive, the ORC file format supports fixed-length CHAR types (CHAR(n)) by padding strings with spaces to ensure the fixed length. When data is written into ORC tables, the actual stored value includes additional trailing spaces to meet the defined length. These padded spaces are also considered during the computation of statistics.
However, in Doris, fixed-length CHAR types (CHAR(n)) and variable-length VARCHAR types are internally represented as the same type. Doris does not pad CHAR values with spaces and treats them as regular strings. As a result, when Doris reads ORC files generated by Hive and parses the statistics, the differences in the handling of CHAR types between the two systems can lead to inconsistencies or incorrect statistics.
before
after
If a Hive table undergoes a schema change, such as a column’s type being modified from INT to STRING, predicate pushdown should be disabled in such cases. Performing predicate pushdown under these circumstances may lead to incorrect filtering, as the type mismatch can cause errors or unexpected behavior during query execution.
before
after
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)