Skip to content

Conversation

@kaka11chen
Copy link

@kaka11chen kaka11chen commented Apr 23, 2025

When using an older version of pyorc (e.g., pyorc-0.3.0), If there are null values in the data, a present stream will be generated for the top level struct column.
However, this behavior does not occur in newer versions of pyorc (e.g., pyorc-0.10.0) or in ORC files generated by tools like Hive or Spark.
Therefore, the present stream generated by the older version causes the ORC file to be read twice during late materialization, resulting in an error 'bad read in next buffer' during the second read. The current solution is to avoid reading the present stream if it is in the top level struct column.

image

…failing to access repeatedly when deferred materialization occurs.
@kaka11chen kaka11chen changed the title [Fix] Fixed issue with top level struct column having present stream failing to access repeatedly when deferred materialization occurs. [Fix] Fixed issue with top level struct column having present stream failing to access repeatedly when late materialization occurs. Apr 23, 2025
@morningman morningman merged commit 4512393 into apache:orc Apr 23, 2025
kaka11chen added a commit to kaka11chen/doris-thirdparty that referenced this pull request Apr 29, 2025
…failing to access repeatedly when deferred materialization occurs. (apache#309)
morningman pushed a commit to apache/doris that referenced this pull request May 7, 2025
…sent stream failing to access repeatedly when late materialization occurs. (#50358)

### What problem does this PR solve?

Related PR: apache/doris-thirdparty#309
apache/doris-thirdparty#310

Problem Summary:
When using an older version of pyorc (e.g., pyorc-0.3.0), If there are
null values in the data, a present stream will be generated for the top
level struct column.
However, this behavior does not occur in newer versions of pyorc (e.g.,
pyorc-0.10.0) or in ORC files generated by tools like Hive or Spark.
Therefore, the present stream generated by the older version causes the
ORC file to be read twice during late materialization, resulting in an
error 'bad read in next buffer' during the second read. The current
solution is to avoid reading the present stream if it is in the top
level struct column.
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…sent stream failing to access repeatedly when late materialization occurs. (apache#50358)

### What problem does this PR solve?

Related PR: apache/doris-thirdparty#309
apache/doris-thirdparty#310

Problem Summary:
When using an older version of pyorc (e.g., pyorc-0.3.0), If there are
null values in the data, a present stream will be generated for the top
level struct column.
However, this behavior does not occur in newer versions of pyorc (e.g.,
pyorc-0.10.0) or in ORC files generated by tools like Hive or Spark.
Therefore, the present stream generated by the older version causes the
ORC file to be read twice during late materialization, resulting in an
error 'bad read in next buffer' during the second read. The current
solution is to avoid reading the present stream if it is in the top
level struct column.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants