Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Dec 16, 2024

What problem does this PR solve?

Problem Summary:
In Hive, the ORC file format supports fixed-length CHAR types (CHAR(n)) by padding strings with spaces to ensure the fixed length. When data is written into ORC tables, the actual stored value includes additional trailing spaces to meet the defined length. These padded spaces are also considered during the computation of statistics.

However, in Doris, fixed-length CHAR types (CHAR(n)) and variable-length VARCHAR types are internally represented as the same type. Doris does not pad CHAR values with spaces and treats them as regular strings. As a result, when Doris reads ORC files generated by Hive and parses the statistics, the differences in the handling of CHAR types between the two systems can lead to inconsistencies or incorrect statistics.

create table fixed_char_table (
  i int,
  c char(2)
) stored as orc;

insert into fixed_char_table values(1,'a'),(2,'b '), (3,'cd');
select * from fixed_char_table where c = 'a';

before

empty

after

1	a

If a Hive table undergoes a schema change, such as a column’s type being modified from INT to STRING, predicate pushdown should be disabled in such cases. Performing predicate pushdown under these circumstances may lead to incorrect filtering, as the type mismatch can cause errors or unexpected behavior during query execution.

create table type_changed_table (
  id int,
  name string 
) stored as orc;
insert into type_changed_table values (1, 'Alice'), (2, 'Bob'), (3, 'Charlie');
ALTER TABLE type_changed_table CHANGE COLUMN id id STRING;
select * from type_changed_table where id = '1';
select

before

empty

after

1	a

Release note

  1. Don't push down the fixed char type to orc reader
  2. If the column type is schema changed to string, it will not be pushed down.

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223 suxiaogang223 marked this pull request as draft December 16, 2024 11:09
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@suxiaogang223 suxiaogang223 marked this pull request as ready for review December 16, 2024 12:01
@suxiaogang223
Copy link
Contributor Author

run buildall

@suxiaogang223
Copy link
Contributor Author

The code of the current 2.1 branch also has this problem. I will submit a separate PR to fix it.

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@suxiaogang223
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.86% (10129/26064)
Line Coverage: 29.79% (85171/285917)
Region Coverage: 28.83% (43678/151496)
Branch Coverage: 25.39% (22208/87456)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6f4ea1571e04b6eddcec584cd7bcbc5a46ce4e19_6f4ea1571e04b6eddcec584cd7bcbc5a46ce4e19/report/index.html

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.87% (10131/26066)
Line Coverage: 29.78% (85166/285943)
Region Coverage: 28.83% (43675/151510)
Branch Coverage: 25.38% (22196/87466)
Coverage Report: http://coverage.selectdb-in.cc/coverage/76cc07ad17b2041fc1687552ffe6318228a74ac8_76cc07ad17b2041fc1687552ffe6318228a74ac8/report/index.html

morningman
morningman previously approved these changes Dec 18, 2024
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Dec 18, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@suxiaogang223 suxiaogang223 marked this pull request as ready for review December 18, 2024 09:33
@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.88% (10135/26066)
Line Coverage: 29.79% (85196/285965)
Region Coverage: 28.85% (43721/151521)
Branch Coverage: 25.40% (22220/87480)
Coverage Report: http://coverage.selectdb-in.cc/coverage/7955b9af7ff08c160c7202ad8365b6585279e7fb_7955b9af7ff08c160c7202ad8365b6585279e7fb/report/index.html

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.88% (10135/26066)
Line Coverage: 29.80% (85225/285965)
Region Coverage: 28.85% (43708/151521)
Branch Coverage: 25.41% (22229/87480)
Coverage Report: http://coverage.selectdb-in.cc/coverage/f9ed3054939ffe7ba2c85c0cecfb148ef7a826c4_f9ed3054939ffe7ba2c85c0cecfb148ef7a826c4/report/index.html

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Dec 20, 2024
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Dec 23, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@morningman morningman merged commit f01f759 into apache:master Dec 23, 2024
26 of 28 checks passed
github-actions bot pushed a commit that referenced this pull request Dec 23, 2024
### What problem does this PR solve?
Problem Summary:
In Hive, the ORC file format supports fixed-length CHAR types (CHAR(n))
by padding strings with spaces to ensure the fixed length. When data is
written into ORC tables, the actual stored value includes additional
trailing spaces to meet the defined length. These padded spaces are also
considered during the computation of statistics.

However, in Doris, fixed-length CHAR types (CHAR(n)) and variable-length
VARCHAR types are internally represented as the same type. Doris does
not pad CHAR values with spaces and treats them as regular strings. As a
result, when Doris reads ORC files generated by Hive and parses the
statistics, the differences in the handling of CHAR types between the
two systems can lead to inconsistencies or incorrect statistics.
```sql
create table fixed_char_table (
  i int,
  c char(2)
) stored as orc;

insert into fixed_char_table values(1,'a'),(2,'b '), (3,'cd');
select * from fixed_char_table where c = 'a';
```
before
```text
empty
```
after
```text
1	a
```

If a Hive table undergoes a schema change, such as a column’s type being
modified from INT to STRING, predicate pushdown should be disabled in
such cases. Performing predicate pushdown under these circumstances may
lead to incorrect filtering, as the type mismatch can cause errors or
unexpected behavior during query execution.
```sql
create table type_changed_table (
  id int,
  name string 
) stored as orc;
insert into type_changed_table values (1, 'Alice'), (2, 'Bob'), (3, 'Charlie');
ALTER TABLE type_changed_table CHANGE COLUMN id id STRING;
select * from type_changed_table where id = '1';
select
```
before
```text
empty
```
after
```text
1	a
```
### Release note
[fix](orc) Not push down fixed char type in orc reader #45484
morningman pushed a commit that referenced this pull request Dec 25, 2024
…5484 (#45776)

Cherry-picked from #45484

Co-authored-by: Socrates <suyiteng@selectdb.com>
@yiguolei yiguolei mentioned this pull request Jan 19, 2025
morningman added a commit to morningman/doris that referenced this pull request Feb 8, 2025
morningman added a commit that referenced this pull request Feb 9, 2025
revert:
branch-3.0: [fix](orc) ignore null values when the literals of
in_predicate contains #45104 (#45586)
[fix](orc) check all the cases before build_search_argument (#44615)
(#44802)
branch-3.0: [enhance](orc) Optimize ORC Predicate Pushdown for
OR-connected Predicate #43255 (#44436)

re-pick:
branch-3.0: [Fix](ORC) Not push down fixed char type in orc reader
#45484 (#45525)

---------

Co-authored-by: Socrates <suyiteng@selectdb.com>
morningman added a commit that referenced this pull request Feb 17, 2025
@gavinchou gavinchou mentioned this pull request Feb 18, 2025
@suxiaogang223 suxiaogang223 deleted the fix_orc_push_down branch March 11, 2025 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.x-lh dev/2.1.8-merged dev/3.0.4-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants