-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](orc-reader) Fixed issue with top level struct column having present stream failing to access repeatedly when late materialization occurs. #50358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
b867139 to
6099567
Compare
|
run buildall |
TPC-H: Total hot run time: 33766 ms |
TPC-DS: Total hot run time: 191304 ms |
ClickBench: Total hot run time: 29.89 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression P0 && UT Coverage ReportIncrement line coverage Increment coverage report
|
6099567 to
8a3f306
Compare
|
run buildall |
1 similar comment
|
run buildall |
8a3f306 to
327c7df
Compare
|
run buildall |
TPC-H: Total hot run time: 33969 ms |
TPC-DS: Total hot run time: 192460 ms |
ClickBench: Total hot run time: 29.68 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression P0 && UT Coverage ReportIncrement line coverage Increment coverage report
|
327c7df to
f4cb1d4
Compare
|
run buildall |
TPC-H: Total hot run time: 34261 ms |
TPC-DS: Total hot run time: 186247 ms |
ClickBench: Total hot run time: 30.03 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression P0 && UT Coverage ReportIncrement line coverage Increment coverage report
|
f4cb1d4 to
05e5511
Compare
|
run buildall |
TPC-H: Total hot run time: 34301 ms |
TPC-DS: Total hot run time: 192765 ms |
TPC-H: Total hot run time: 33706 ms |
TPC-DS: Total hot run time: 192638 ms |
ClickBench: Total hot run time: 29.83 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression P0 && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
…sent stream failing to access repeatedly when late materialization occurs. (#50651) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note Cherry-pick #50358 None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
…sent stream failing to access repeatedly when late materialization occurs. (apache#50358) ### What problem does this PR solve? Related PR: apache/doris-thirdparty#309 apache/doris-thirdparty#310 Problem Summary: When using an older version of pyorc (e.g., pyorc-0.3.0), If there are null values in the data, a present stream will be generated for the top level struct column. However, this behavior does not occur in newer versions of pyorc (e.g., pyorc-0.10.0) or in ORC files generated by tools like Hive or Spark. Therefore, the present stream generated by the older version causes the ORC file to be read twice during late materialization, resulting in an error 'bad read in next buffer' during the second read. The current solution is to avoid reading the present stream if it is in the top level struct column.
What problem does this PR solve?
Related PR: apache/doris-thirdparty#309 apache/doris-thirdparty#310
Problem Summary:
When using an older version of pyorc (e.g., pyorc-0.3.0), If there are null values in the data, a present stream will be generated for the top level struct column.
However, this behavior does not occur in newer versions of pyorc (e.g., pyorc-0.10.0) or in ORC files generated by tools like Hive or Spark.
Therefore, the present stream generated by the older version causes the ORC file to be read twice during late materialization, resulting in an error 'bad read in next buffer' during the second read. The current solution is to avoid reading the present stream if it is in the top level struct column.
Release note
Fixed an issue where repeated access to the present stream within a top-level struct column would fail during late materialization. This was addressed by avoiding the unnecessary reading of the present stream when it is part of the top-level struct column.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)