Conversation
|
@snazy @nastra @rdblue Could you please take a look at this PR? If there is anything I can do, please let me know (cc @zzzzming95 ) |
| old: "method void org.apache.iceberg.PositionDeletesTable.PositionDeletesBatchScan::<init>(org.apache.iceberg.Table,\ | ||
| \ org.apache.iceberg.Schema, org.apache.iceberg.TableScanContext)" | ||
| justification: "Removing deprecated code" | ||
| org.apache.iceberg:iceberg-orc: |
There was a problem hiding this comment.
we can't break existing APIs but you could add an overloaded version of that method that takes an additional parameter. That way it won't break the existing API. See also https://iceberg.apache.org/contribute/#adding-new-functionality-without-breaking-apis
|
can you please rebase the PR and resolve conflicts? I'll try to take a closer look this week |
snazy
left a comment
There was a problem hiding this comment.
Can't say much about the general approach wrt type mapping, but left some comments.
| import java.util.Objects; | ||
| import java.util.Optional; | ||
| import java.util.stream.Collectors; | ||
| import org.apache.hadoop.conf.Configuration; |
There was a problem hiding this comment.
Wonder whether the "whole Configuration spiel" is necessary here. Hadoop dependencies are already compileOnly to eventually get rid of those entirely. Wouldn't a simple boolean flag do the same thing in this class?
| config.getBoolean(ConfigProperties.ORC_CONVERT_TIMESTAMPTZ, false); | ||
|
|
||
| if (convertTimestampTZ | ||
| && type.typeId() == Type.TypeID.TIMESTAMP |
There was a problem hiding this comment.
Guess this deserves a new case TIMESTAMP, not an if here.
There was a problem hiding this comment.
Some of the processing logic here involves more than just the timestamp type, so maybe an if is more appropriate
There was a problem hiding this comment.
Why is this logic is not part of the getPromotedType method?
|
@zzzzming95 |
|
@tanvn yeah, i will continue this issue this week~ |
fa7d6a6 to
77d0b9b
Compare
|
@nastra @snazy |
snazy
left a comment
There was a problem hiding this comment.
On #9784 you mentioned:
I think this is because hive and spark treat
timestampdata type as timestamp with time zone and the orc file format is also stored as orctimestamptype. But in fact the hivetimestampdata type should be stored astimestamp_instantin the orc file.
This sounds like Hive and Spark treat the timestamp type in a "less flexible way". IIRC Spark 3.4 introduced timestamp_ntz, whereas Hive uses no timezone at all.
The iceberg.orc.convert.timestamptz option introduced with this PR seems to be a global setting, so it affects all tables. I wonder whether this should rather be an ORC type property.
| import static org.assertj.core.api.Assertions.assertThat; | ||
| import static org.assertj.core.api.Assertions.assertThatThrownBy; | ||
|
|
||
| import org.apache.hadoop.conf.Configuration; |
| Type type, | ||
| boolean isRequired, | ||
| Map<Integer, OrcField> mapping, | ||
| Boolean convertTimestampTZ) { |
| config.getBoolean(ConfigProperties.ORC_CONVERT_TIMESTAMPTZ, false); | ||
|
|
||
| if (convertTimestampTZ | ||
| && type.typeId() == Type.TypeID.TIMESTAMP |
Although the The purpose of I am not quite sure what |
|
@deniskuzZ: How this is solved in Hive? |
|
@raunaqmorarka please take a look |
i think it's done in |
#1897 does not have any ORC specific part. I still miss some part of the puzzle... @zzzzming95: What happens if someone writes to this ORC table some new rows after migration? |
|
@zzzzming95 |
Sorry I've had a recent job change, so I don't have an environment to test this case right now. But I understand that there shouldn't be a problem, because 'shouldAdjustToUTC()' will find the correct timestamp type. |
|
|
||
| @Test | ||
| public void testOriginalSchemaNameMapping() { | ||
| Configuration config = new Configuration(); |
|
|
||
| @Test | ||
| public void testModifiedSimpleSchemaNameMapping() { | ||
| Configuration config = new Configuration(); |
|
|
||
| @Test | ||
| public void testModifiedComplexSchemaNameMapping() { | ||
| Configuration config = new Configuration(); |
Could we add a test with actual data files with the new, old, mixed format? |
Yes, I also think such a UT is needed, I will try to add it later. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
@zzzzming95 |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
@zzzzming95 I was facing this original issue and went about solving in the way you did, but it didn't work out for me,it was giving incorrect timestamp (future +5:30 (IST)). |
What changes were proposed in this pull request?
A user was attempting to convert an ORC backed external table in hive to a Iceberg table using the migrate command but was immediately met with a "Can not promote TIMESTAMP to TIMESTAMP" error. This occurs because our Spark -> Iceberg conversion code always converts to a Timestamp.withZone.
it relate issue : #9784
add a config ,
iceberg.orc.convert.timestamptz, when set is as true , it will auto convert orcTIMESTAMPas orcTIMESTAMP_INSTANT, so we can fix this issue.Why are the changes needed?
fix issue.
Does this PR introduce any user-facing change?
No
How was this patch tested?
add a UT