Support convert orc timestamptz by zzzzming95 · Pull Request #9905 · apache/iceberg

zzzzming95 · 2024-03-09T10:32:03Z

What changes were proposed in this pull request?

A user was attempting to convert an ORC backed external table in hive to a Iceberg table using the migrate command but was immediately met with a "Can not promote TIMESTAMP to TIMESTAMP" error. This occurs because our Spark -> Iceberg conversion code always converts to a Timestamp.withZone.

it relate issue : #9784

spark.sql("CREATE EXTERNAL TABLE mytable (foo timestamp) STORED AS orc LOCATION '/Users/russellspitzer/Temp/foo'") 
spark.sql("INSERT INTO mytable VALUES (now())")
 spark.sql("CALL spark_catalog.system.migrate('mytable')") 
 spark.sql("SELECT * FROM mytable")

java.lang.IllegalArgumentException: Can not promote TIMESTAMP type to TIMESTAMP
	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:441)
	at org.apache.iceberg.orc.ORCSchemaUtil.buildOrcProjection(ORCSchemaUtil.java:301)
	at org.apache.iceberg.orc.ORCSchemaUtil.buildOrcProjection(ORCSchemaUtil.java:275)
	at org.apache.iceberg.orc.ORCSchemaUtil.buildOrcProjection(ORCSchemaUtil.java:258)

add a config , iceberg.orc.convert.timestamptz , when set is as true , it will auto convert orc TIMESTAMP as orc TIMESTAMP_INSTANT , so we can fix this issue.

Why are the changes needed?

fix issue.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add a UT

tanvn · 2024-07-09T04:17:08Z

@snazy @nastra @rdblue
I confirmed that this issue is happening on my env (Spark 3.4, Iceberg 1.3.1) as well and this is blocking my team from migrating our Hive tables to Iceberg.

Could you please take a look at this PR? If there is anything I can do, please let me know (cc @zzzzming95 )

nastra · 2024-07-09T07:46:22Z

      old: "method void org.apache.iceberg.PositionDeletesTable.PositionDeletesBatchScan::<init>(org.apache.iceberg.Table,\
        \ org.apache.iceberg.Schema, org.apache.iceberg.TableScanContext)"
      justification: "Removing deprecated code"
+    org.apache.iceberg:iceberg-orc:


we can't break existing APIs but you could add an overloaded version of that method that takes an additional parameter. That way it won't break the existing API. See also https://iceberg.apache.org/contribute/#adding-new-functionality-without-breaking-apis

nastra · 2024-07-09T07:48:29Z

can you please rebase the PR and resolve conflicts? I'll try to take a closer look this week

snazy

Can't say much about the general approach wrt type mapping, but left some comments.

snazy · 2024-07-09T11:45:58Z

 import java.util.Objects;
 import java.util.Optional;
 import java.util.stream.Collectors;
+import org.apache.hadoop.conf.Configuration;


Wonder whether the "whole Configuration spiel" is necessary here. Hadoop dependencies are already compileOnly to eventually get rid of those entirely. Wouldn't a simple boolean flag do the same thing in this class?

fix as boolean~

snazy · 2024-07-09T11:46:55Z

+              config.getBoolean(ConfigProperties.ORC_CONVERT_TIMESTAMPTZ, false);
+
+          if (convertTimestampTZ
+              && type.typeId() == Type.TypeID.TIMESTAMP


Guess this deserves a new case TIMESTAMP, not an if here.

Some of the processing logic here involves more than just the timestamp type, so maybe an if is more appropriate

Fair enough

Why is this logic is not part of the getPromotedType method?

tanvn · 2024-07-10T07:19:14Z

@zzzzming95
Do you have time and effort for this issue? (I would appreciate if you could)
If not, I might create a new PR based from this.

zzzzming95 · 2024-07-10T07:25:15Z

@tanvn yeah, i will continue this issue this week~

tanvn · 2024-07-17T02:07:18Z

@nastra @snazy
I think your comments have been addressed by @zzzzming95
Could you take another look please?

snazy

On #9784 you mentioned:

I think this is because hive and spark treat timestamp data type as timestamp with time zone and the orc file format is also stored as orc timestamp type. But in fact the hive timestamp data type should be stored as timestamp_instant in the orc file.

This sounds like Hive and Spark treat the timestamp type in a "less flexible way". IIRC Spark 3.4 introduced timestamp_ntz, whereas Hive uses no timezone at all.

The iceberg.orc.convert.timestamptz option introduced with this PR seems to be a global setting, so it affects all tables. I wonder whether this should rather be an ORC type property.

snazy · 2024-07-17T06:34:50Z

 import static org.assertj.core.api.Assertions.assertThat;
 import static org.assertj.core.api.Assertions.assertThatThrownBy;

+import org.apache.hadoop.conf.Configuration;


Please clean those up

snazy · 2024-07-17T06:38:43Z

+      Type type,
+      boolean isRequired,
+      Map<Integer, OrcField> mapping,
+      Boolean convertTimestampTZ) {


This shouldn't be a boxed type

snazy · 2024-07-17T06:41:26Z

+              config.getBoolean(ConfigProperties.ORC_CONVERT_TIMESTAMPTZ, false);
+
+          if (convertTimestampTZ
+              && type.typeId() == Type.TypeID.TIMESTAMP


Fair enough

zzzzming95 · 2024-07-18T13:07:08Z

On #9784 you mentioned:

I think this is because hive and spark treat timestamp data type as timestamp with time zone and the orc file format is also stored as orc timestamp type. But in fact the hive timestamp data type should be stored as timestamp_instant in the orc file.

This sounds like Hive and Spark treat the timestamp type in a "less flexible way". IIRC Spark 3.4 introduced timestamp_ntz, whereas Hive uses no timezone at all.

The iceberg.orc.convert.timestamptz option introduced with this PR seems to be a global setting, so it affects all tables. I wonder whether this should rather be an ORC type property.

Although the timestamp_ntz type is introduced in Spark 3.4+, this type is actually stored as the array<bigint>data type in orc file.

The purpose of iceberg.orc.convert.timestamptz is to provide hive spark with a compatible way to access the orc timestamp data type.

I am not quite sure what I wonder whether this should rather be an ORC type property means. If it means adding a new type to orc, such as timestamptz, this still cannot solve the incompatible access of the historical orc timestamp type.

pvary · 2024-07-26T05:54:30Z

@deniskuzZ: How this is solved in Hive?

findepi · 2024-07-26T14:36:57Z

@raunaqmorarka please take a look

deniskuzZ · 2024-07-26T14:56:57Z

@deniskuzZ: How this is solved in Hive?

i think it's done in Hive: Support timestamp with local zone in Hive3 (#1897)

pvary · 2024-07-29T07:29:22Z

@deniskuzZ: How this is solved in Hive?

i think it's done in Hive: Support timestamp with local zone in Hive3 (#1897)

#1897 does not have any ORC specific part. I still miss some part of the puzzle...

@zzzzming95: What happens if someone writes to this ORC table some new rows after migration?

tanvn · 2024-08-10T12:09:43Z

@zzzzming95
Could you take another look at the comments when you have time? 🙇

zzzzming95 · 2024-08-13T00:54:47Z

@zzzzming95: What happens if someone writes to this ORC table some new rows after migration?
@pvary

Sorry I've had a recent job change, so I don't have an environment to test this case right now. But I understand that there shouldn't be a problem, because 'shouldAdjustToUTC()' will find the correct timestamp type.

pvary · 2024-08-13T05:32:48Z


  @Test
  public void testOriginalSchemaNameMapping() {
+    Configuration config = new Configuration();


Is this used anywhere?

pvary · 2024-08-13T05:33:34Z


  @Test
  public void testModifiedSimpleSchemaNameMapping() {
+    Configuration config = new Configuration();


Is this used anywhere?

pvary · 2024-08-13T05:33:54Z


  @Test
  public void testModifiedComplexSchemaNameMapping() {
+    Configuration config = new Configuration();


Same as above

pvary · 2024-08-13T05:36:34Z

@zzzzming95: What happens if someone writes to this ORC table some new rows after migration?
@pvary

Sorry I've had a recent job change, so I don't have an environment to test this case right now. But I understand that there shouldn't be a problem, because 'shouldAdjustToUTC()' will find the correct timestamp type.

Could we add a test with actual data files with the new, old, mixed format?

zzzzming95 · 2024-08-13T05:44:17Z

@zzzzming95: What happens if someone writes to this ORC table some new rows after migration?
@pvary

Sorry I've had a recent job change, so I don't have an environment to test this case right now. But I understand that there shouldn't be a problem, because 'shouldAdjustToUTC()' will find the correct timestamp type.

Could we add a test with actual data files with the new, old, mixed format?

Yes, I also think such a UT is needed, I will try to add it later.

github-actions · 2024-10-22T00:15:49Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

tanvn · 2024-10-22T07:23:12Z

@zzzzming95
Sorry for bothering you again, may I ask if you could dedicate some effort to this?

github-actions · 2024-11-22T00:16:08Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-11-30T00:15:40Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

pravin1406 · 2025-01-07T21:16:52Z

@zzzzming95 I was facing this original issue and went about solving in the way you did, but it didn't work out for me,it was giving incorrect timestamp (future +5:30 (IST)).
I see my solution is almost exactly as yours. Can you confirm if you tested your PR ? Did it work ?

github-actions Bot added core ORC labels Mar 9, 2024

nastra reviewed Jul 9, 2024

View reviewed changes

snazy reviewed Jul 9, 2024

View reviewed changes

zzzzming95 and others added 5 commits July 14, 2024 16:14

support convert orc timestamptz

06a27a7

update ORCSchemaUtil.java

c4b7da6

update TestBuildOrcProjection.java

48fd91a

update orc file

1f0f7db

formatted code

77d0b9b

zzzzming95 force-pushed the support_convert_orc_timestamptz branch from fa7d6a6 to 77d0b9b Compare July 14, 2024 11:51

snazy reviewed Jul 17, 2024

View reviewed changes

ORCSchemaUtil.java and UT

44cd618

zzzzming95 requested review from nastra and snazy July 24, 2024 02:33

nastra requested review from findepi and pvary and removed request for nastra and snazy July 25, 2024 15:46

pvary reviewed Aug 13, 2024

View reviewed changes

github-actions Bot added the stale label Oct 22, 2024

github-actions Bot removed the stale label Oct 23, 2024

github-actions Bot added the stale label Nov 22, 2024

github-actions Bot closed this Nov 30, 2024

Conversation

zzzzming95 commented Mar 9, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

tanvn commented Jul 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra commented Jul 9, 2024

Uh oh!

snazy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanvn commented Jul 10, 2024

Uh oh!

zzzzming95 commented Jul 10, 2024

Uh oh!

tanvn commented Jul 17, 2024

Uh oh!

snazy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zzzzming95 commented Jul 18, 2024

Uh oh!

pvary commented Jul 26, 2024

Uh oh!

findepi commented Jul 26, 2024

Uh oh!

deniskuzZ commented Jul 26, 2024

Uh oh!

pvary commented Jul 29, 2024

Uh oh!

tanvn commented Aug 10, 2024

Uh oh!

zzzzming95 commented Aug 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary commented Aug 13, 2024

Uh oh!

zzzzming95 commented Aug 13, 2024

Uh oh!

github-actions Bot commented Oct 22, 2024

Uh oh!

tanvn commented Oct 22, 2024

Uh oh!

github-actions Bot commented Nov 22, 2024

Uh oh!

github-actions Bot commented Nov 30, 2024

Uh oh!

pravin1406 commented Jan 7, 2025

Uh oh!

Reviewers

Assignees

Labels