-
Notifications
You must be signed in to change notification settings - Fork 3k
Hive: Using Hive schema to create tables and partition specification #1612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rdblue, @pvary, @marton-bod: If you have time, could you please review this one? |
|
I'm planning on spending most of Friday reviewing. Sorry for the delay! |
|
@rdblue: I will be on PTO next week and I will do my best to not open my laptop during that time 😄. So it is absolutely ok if you review the patch only sometime next week. It would be good to have the main points identified here and in #1407 as well, so after the PTO I can start fixing those with fresh energies 😄 I start to feel more-more convinced that we should not reuse the Hive
If you agree, I would throw an error in the next iteration of this PR when the user tries to create a partitioned table with the |
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveSchemaUtil.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveSchemaUtil.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveSchemaUtil.java
Outdated
Show resolved
Hide resolved
|
I'm not quite convinced that not supporting the Hive In the long term, I think we do want Iceberg partitioning to be exposed in the normal way for Hive because it would be confusing for a partitioned Iceberg table to show up as unpartitioned. That said, there are significant differences between the two partitioning approaches:
The differences may be significant enough that it would cause problems to expose even Iceberg identity partitions to Hive. For example, if Hive expects to get a partition key and fill in data values, then that would be a problem. What are the chances of integrating Iceberg into Hive itself and solving some of these limitations? |
This PR throws exception when scale specified in filter is different with iceberg. Even if 100.1 is cast as decimal(7, 2), hive still throws the same exception. |
|
Thanks for letting us know, @qphien. Would you like to create an issue for that problem? It doesn't look related to this PR. |
Updated the patch to throw an exception if
If we want to change the Hive syntax we would need Hive to depend on Iceberg. Iceberg has 2 dependencies on Hive
That would mean that we have cyclic dependency which could be problematic |
|
@rdblue: The tests added by @lcspinter in #1740 highlighted issues with the timestamp handling.
I was thinking about separating schema creation and the Timestamp fix (since there are only related because the additional check I have added for table creation), but my first attempt ended up in a mess of half fixed test and such. Thanks, |
hive3/src/main/java/org/apache/iceberg/mr/hive/Hive3SchemaVisitor.java
Outdated
Show resolved
Hide resolved
|
+1 |
mr/src/main/java/org/apache/iceberg/mr/hive/HiveSchemaConverter.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveSchemaUtil.java
Outdated
Show resolved
Hide resolved
… specific catalog test classes
- CHAR - TypeInfo generation instead of NestedField in conversion
Also 2 small fixes
Addressed review comments
… Iceberg interpretation.
shardulm94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pvary for your patience through this. We should be very close to getting this in!
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
Outdated
Show resolved
Hide resolved
| tableSchema = Catalogs.loadTable(configuration, serDeProperties).schema(); | ||
| } catch (NoSuchTableException nte) { | ||
| throw new SerDeException("Please provide an existing table or a valid schema", nte); | ||
| if (Catalogs.hiveCatalog(configuration)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason behind handling HiveCatalogs separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HiveTableOperations converts the table schema to Hive columns / StorageDescriptor when any change is committed to the table. This means that the Iceberg schema and the Hive schema is always synchronized.
Since the above synchronization, I think it is better to use the "cached" schema instead of loading the table again and again. This might change when we clean up Timestamps / UUIDs since the mapping is not 1-on-1 there, but I would leave something for that new PR too 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought better of it.
Keep it simple and general. And branch only if we can have a clean benefit in the final solution.
…imal, but later we need that anyway
Also reverting some formatting only changes
shardulm94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pvary for working on this. Looks good to me! I understand that some places where we determine schema may be suboptimal given the lifecycle of Hive SerDe objects, but I think this code is much simpler to reason about. We can probably do future updates to make the code more efficient, but we can also take this opportunity to see if Hive's lifecycle for SerDe objects can be improved.
| case VARCHAR: | ||
| throw new IllegalArgumentException("Unsupported Hive type (" + | ||
| ((PrimitiveTypeInfo) typeInfo).getPrimitiveCategory() + | ||
| ") for Iceberg tables. Consider using STRING type instead."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be fine mapping these to string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this on he dev list where we are talking about the schema mapping.
If we decide that the Iceberg schema is the master and we always convert from there to Hive schema, then we can relax the 1-on-1 mapping restriction, and convert multiple Iceberg types to a single Hive type.
|
Thanks for all your work on this, @pvary! I'll merge it. Thanks for reviewing, @shardulm94! |
|
Big thanks @rdblue and @shardulm94 for following this through! |
As discussed on #1495 we should create the table specification from the columns in the table creation command.
This PR does this.
Here are the changes:
serDePropertiesCREATE TABLEcommandChanges which are worth to double check:
CRATE TABLE ... PARTITIONED BYcommand, then the resulting Iceberg table will be partitioned with identity partitions, but the Hive table itself will not be partitioned. This was needed since the read path is working with partitioned tables, and I do not see any good way to solve this since Hive wants to read the partitions one-by-onewithLocationif the provided location is set todefaultLocation, so I do not have to branch the code in Catalogs.HiveCatalogwill be using the default table location in Hive. When using other Catalogs theLOCATIONshould be provided in theCREATE TABLEcommand.