-
Notifications
You must be signed in to change notification settings - Fork 3.7k
branch-3.1:[feature](external) Support reading Hudi/Paimon/Iceberg tables after schema changes. (#51341) #53170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 40014 ms |
TPC-DS: Total hot run time: 196469 ms |
ClickBench: Total hot run time: 31.63 s |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 39991 ms |
TPC-DS: Total hot run time: 197099 ms |
ClickBench: Total hot run time: 31.82 s |
…schema changes. (apache#51341) Related PR: apache#49051 Problem Summary: Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed. 1. Introduce `hive_reader` to avoid confusion between `hive` and `parquet/orc` reader 2. Before this, support for reading tables after schema changes of ordinary columns relied on changing the column name in block, so that parquet/orc reader can read specific file columns when `get_next_block`, and `hudi/iceberg/paimon reader` will mix `file column names` with `table column names` when using parquet/orc reader. This pr clarifies that all calls to `parquet/orc reader` are based on the concept of `table column names`, and then introduces `TableSchemaChangeHelper::Node` to help `parquet/orc reader` find the specific file columns to be read.
…e#52964) Related PR: apache#51341 Problem Summary: In pr apache#51341, hudiOrcReader was deleted, and this pr reintroduced it to read hudi orc table. Although I encountered this error when testing spark-hudi to read orc, the orc file was indeed generated by spark-hudi. ``` java.lang.UnsupportedOperationException: Base file format is not currently supported (ORC) at org.apache.hudi.HoodieBaseRelation.createBaseFileReader(HoodieBaseRelation.scala:574) ~[hudi-spark3.4-bundle_2.12-0.14.0-1.jar:0.14.0-1] at org.apache.hudi.BaseFileOnlyRelation.composeRDD(BaseFileOnlyRelation.scala:96) ~[hudi-spark3.4-bundle_2.12-0.14.0-1.jar:0.14.0-1] at org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:381) ~[hudi-spark3.4-bundle_2.12-0.14.0-1.jar:0.14.0-1] at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:329) ~[spark-sql_2.12-3.4.2.jar:0.14.0-1] ```
… to bigint. (apache#52954) Related PR: apache#47471 Problem Summary: This pr is a supplement to apache#47471. This pr is used to support reading hive tables that convert timestamp columns to bigint columns and display them in `ms` precision. (parquet/orc hive table.)
…on version. (apache#53055) ### What problem does this PR solve? Related PR: apache#51341 Problem Summary: In PR apache#51341, the Docker Paimon was upgraded from version 0.8 to 1.0.1. Since the required JAR files are pulled from a Maven repository, some machines may not be able to access the repository. To fix this, the JAR file has been uploaded to object storage, ensuring that it can be reliably accessed across different environments.
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 40013 ms |
TPC-DS: Total hot run time: 196703 ms |
ClickBench: Total hot run time: 31.07 s |
What problem does this PR solve?
bp #51341 : support read hudi/paimon/iceberg schema change
bp #52964: add hudi orc reader
bp #52954 : support timestamp to bigint
bp #53055:fix paimon docker version
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)