Skip to content

Support convert orc timestamptz#9905

Closed
zzzzming95 wants to merge 6 commits intoapache:mainfrom
zzzzming95:support_convert_orc_timestamptz
Closed

Support convert orc timestamptz#9905
zzzzming95 wants to merge 6 commits intoapache:mainfrom
zzzzming95:support_convert_orc_timestamptz

Conversation

@zzzzming95
Copy link
Copy Markdown

What changes were proposed in this pull request?

A user was attempting to convert an ORC backed external table in hive to a Iceberg table using the migrate command but was immediately met with a "Can not promote TIMESTAMP to TIMESTAMP" error. This occurs because our Spark -> Iceberg conversion code always converts to a Timestamp.withZone.

it relate issue : #9784

spark.sql("CREATE EXTERNAL TABLE mytable (foo timestamp) STORED AS orc LOCATION '/Users/russellspitzer/Temp/foo'")

spark.sql("INSERT INTO mytable VALUES (now())")

spark.sql("CALL spark_catalog.system.migrate('mytable')")


spark.sql("SELECT * FROM mytable")
java.lang.IllegalArgumentException: Can not promote TIMESTAMP type to TIMESTAMP
	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:441)
	at org.apache.iceberg.orc.ORCSchemaUtil.buildOrcProjection(ORCSchemaUtil.java:301)
	at org.apache.iceberg.orc.ORCSchemaUtil.buildOrcProjection(ORCSchemaUtil.java:275)
	at org.apache.iceberg.orc.ORCSchemaUtil.buildOrcProjection(ORCSchemaUtil.java:258)

add a config , iceberg.orc.convert.timestamptz , when set is as true , it will auto convert orc TIMESTAMP as orc TIMESTAMP_INSTANT , so we can fix this issue.

Why are the changes needed?

fix issue.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add a UT

@tanvn
Copy link
Copy Markdown

tanvn commented Jul 9, 2024

@snazy @nastra @rdblue
I confirmed that this issue is happening on my env (Spark 3.4, Iceberg 1.3.1) as well and this is blocking my team from migrating our Hive tables to Iceberg.

Could you please take a look at this PR? If there is anything I can do, please let me know (cc @zzzzming95 )

Comment thread .palantir/revapi.yml Outdated
old: "method void org.apache.iceberg.PositionDeletesTable.PositionDeletesBatchScan::<init>(org.apache.iceberg.Table,\
\ org.apache.iceberg.Schema, org.apache.iceberg.TableScanContext)"
justification: "Removing deprecated code"
org.apache.iceberg:iceberg-orc:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't break existing APIs but you could add an overloaded version of that method that takes an additional parameter. That way it won't break the existing API. See also https://iceberg.apache.org/contribute/#adding-new-functionality-without-breaking-apis

@nastra
Copy link
Copy Markdown
Contributor

nastra commented Jul 9, 2024

can you please rebase the PR and resolve conflicts? I'll try to take a closer look this week

Copy link
Copy Markdown
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't say much about the general approach wrt type mapping, but left some comments.

import java.util.Objects;
import java.util.Optional;
import java.util.stream.Collectors;
import org.apache.hadoop.conf.Configuration;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder whether the "whole Configuration spiel" is necessary here. Hadoop dependencies are already compileOnly to eventually get rid of those entirely. Wouldn't a simple boolean flag do the same thing in this class?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix as boolean~

config.getBoolean(ConfigProperties.ORC_CONVERT_TIMESTAMPTZ, false);

if (convertTimestampTZ
&& type.typeId() == Type.TypeID.TIMESTAMP
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this deserves a new case TIMESTAMP, not an if here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the processing logic here involves more than just the timestamp type, so maybe an if is more appropriate

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this logic is not part of the getPromotedType method?

@tanvn
Copy link
Copy Markdown

tanvn commented Jul 10, 2024

@zzzzming95
Do you have time and effort for this issue? (I would appreciate if you could)
If not, I might create a new PR based from this.

@zzzzming95
Copy link
Copy Markdown
Author

@tanvn yeah, i will continue this issue this week~

@zzzzming95 zzzzming95 force-pushed the support_convert_orc_timestamptz branch from fa7d6a6 to 77d0b9b Compare July 14, 2024 11:51
@tanvn
Copy link
Copy Markdown

tanvn commented Jul 17, 2024

@nastra @snazy
I think your comments have been addressed by @zzzzming95
Could you take another look please?

Copy link
Copy Markdown
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On #9784 you mentioned:

I think this is because hive and spark treat timestamp data type as timestamp with time zone and the orc file format is also stored as orc timestamp type. But in fact the hive timestamp data type should be stored as timestamp_instant in the orc file.

This sounds like Hive and Spark treat the timestamp type in a "less flexible way". IIRC Spark 3.4 introduced timestamp_ntz, whereas Hive uses no timezone at all.

The iceberg.orc.convert.timestamptz option introduced with this PR seems to be a global setting, so it affects all tables. I wonder whether this should rather be an ORC type property.

import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.assertThatThrownBy;

import org.apache.hadoop.conf.Configuration;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clean those up

Type type,
boolean isRequired,
Map<Integer, OrcField> mapping,
Boolean convertTimestampTZ) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a boxed type

config.getBoolean(ConfigProperties.ORC_CONVERT_TIMESTAMPTZ, false);

if (convertTimestampTZ
&& type.typeId() == Type.TypeID.TIMESTAMP
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

@zzzzming95
Copy link
Copy Markdown
Author

On #9784 you mentioned:

I think this is because hive and spark treat timestamp data type as timestamp with time zone and the orc file format is also stored as orc timestamp type. But in fact the hive timestamp data type should be stored as timestamp_instant in the orc file.

This sounds like Hive and Spark treat the timestamp type in a "less flexible way". IIRC Spark 3.4 introduced timestamp_ntz, whereas Hive uses no timezone at all.

The iceberg.orc.convert.timestamptz option introduced with this PR seems to be a global setting, so it affects all tables. I wonder whether this should rather be an ORC type property.

Although the timestamp_ntz type is introduced in Spark 3.4+, this type is actually stored as the array<bigint>data type in orc file.

The purpose of iceberg.orc.convert.timestamptz is to provide hive spark with a compatible way to access the orc timestamp data type.

I am not quite sure what I wonder whether this should rather be an ORC type property means. If it means adding a new type to orc, such as timestamptz, this still cannot solve the incompatible access of the historical orc timestamp type.

@zzzzming95 zzzzming95 requested review from nastra and snazy July 24, 2024 02:33
@nastra nastra requested review from findepi and pvary and removed request for nastra and snazy July 25, 2024 15:46
@pvary
Copy link
Copy Markdown
Contributor

pvary commented Jul 26, 2024

@deniskuzZ: How this is solved in Hive?

@findepi
Copy link
Copy Markdown
Member

findepi commented Jul 26, 2024

@raunaqmorarka please take a look

@deniskuzZ
Copy link
Copy Markdown
Member

@deniskuzZ: How this is solved in Hive?

i think it's done in Hive: Support timestamp with local zone in Hive3 (#1897)

@pvary
Copy link
Copy Markdown
Contributor

pvary commented Jul 29, 2024

@deniskuzZ: How this is solved in Hive?

i think it's done in Hive: Support timestamp with local zone in Hive3 (#1897)

#1897 does not have any ORC specific part. I still miss some part of the puzzle...

@zzzzming95: What happens if someone writes to this ORC table some new rows after migration?

@tanvn
Copy link
Copy Markdown

tanvn commented Aug 10, 2024

@zzzzming95
Could you take another look at the comments when you have time? 🙇

@zzzzming95
Copy link
Copy Markdown
Author

@zzzzming95: What happens if someone writes to this ORC table some new rows after migration?
@pvary

Sorry I've had a recent job change, so I don't have an environment to test this case right now. But I understand that there shouldn't be a problem, because 'shouldAdjustToUTC()' will find the correct timestamp type.


@Test
public void testOriginalSchemaNameMapping() {
Configuration config = new Configuration();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?


@Test
public void testModifiedSimpleSchemaNameMapping() {
Configuration config = new Configuration();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?


@Test
public void testModifiedComplexSchemaNameMapping() {
Configuration config = new Configuration();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@pvary
Copy link
Copy Markdown
Contributor

pvary commented Aug 13, 2024

@zzzzming95: What happens if someone writes to this ORC table some new rows after migration?
@pvary

Sorry I've had a recent job change, so I don't have an environment to test this case right now. But I understand that there shouldn't be a problem, because 'shouldAdjustToUTC()' will find the correct timestamp type.

Could we add a test with actual data files with the new, old, mixed format?

@zzzzming95
Copy link
Copy Markdown
Author

@zzzzming95: What happens if someone writes to this ORC table some new rows after migration?
@pvary

Sorry I've had a recent job change, so I don't have an environment to test this case right now. But I understand that there shouldn't be a problem, because 'shouldAdjustToUTC()' will find the correct timestamp type.

Could we add a test with actual data files with the new, old, mixed format?

Yes, I also think such a UT is needed, I will try to add it later.

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Oct 22, 2024
@tanvn
Copy link
Copy Markdown

tanvn commented Oct 22, 2024

@zzzzming95
Sorry for bothering you again, may I ask if you could dedicate some effort to this?

@github-actions github-actions Bot removed the stale label Oct 23, 2024
@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Nov 22, 2024
@github-actions
Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this Nov 30, 2024
@pravin1406
Copy link
Copy Markdown

@zzzzming95 I was facing this original issue and went about solving in the way you did, but it didn't work out for me,it was giving incorrect timestamp (future +5:30 (IST)).
I see my solution is almost exactly as yours. Can you confirm if you tested your PR ? Did it work ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants