-
Notifications
You must be signed in to change notification settings - Fork 3k
HiveMetaHook implementation to enable CREATE TABLE and DROP TABLE from Hive queries #1495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
If you have time, could you please review and / or try it: |
|
@pvary, why does this set the Iceberg schema using a table property? |
|
I think I understand why the schema is used, but I'd like to use types from Hive DDL if possible. I'm not sure whether that would need to change in this PR or in #1481. |
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
Outdated
Show resolved
Hide resolved
It will be quite convoluted to access the Hive DDL columns and types. It will add another layer of complexity which I would like to address in another PR to keep the changes more reviewer friendly 😄 |
|
I did a little bit of testing with this on a distributed Hive cluster. Results below.
It might be good to have a HiveRunner test that creates a table, does an insert and then reads the values back to check that that all works end to end? |
Really appreciate your help here! Thanks!
I would expect this to work (with #1407 also applied) insert into foo.iceberg_customers select 999,"some first name";or insert into foo.iceberg_customers values(999,"some first name");To be honest, I have never used "named_struct" before 😄
That's the plan, but we need the writer code to get in first 😄 |
Ah yes, of course, I'm losing track of the different pull requests ;) The reason I tried I then set the execution engine to MR and got |
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
Outdated
Show resolved
Hide resolved
|
I think it is necessary to add a converter to transform Hive DDL schema to iceberg schema instead of specifying iceberg schema in TBLPROPERTIES. Maybe we can use // Managed Table
CREATE TABLE icebergTable (id int, day string)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
TBLPROPERTIES('iceberg.mr.table.partition.spec'='day:day')
// External Table
CREATE EXTERNAL TABLE icebergTable (id int, day string)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION 'hdfs://path/to/table'
TBLPROPERTIES('iceberg.mr.table.partition.spec'='id:identity')to create managed/external table, |
I agree that we want this to be the behavior. @pvary, what is the plan for using the DDL schema instead of a table property? |
I agree with you on this. It will be quite convoluted to access the Hive DDL columns and types. It will add another layer of complexity which I would like to address in another PR to keep the changes more reviewer friendly 😄 |
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
Outdated
Show resolved
Hide resolved
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
Outdated
Show resolved
Hide resolved
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
Outdated
Show resolved
Hide resolved
| // is created yet. | ||
| // - When we are compiling the Hive query on HiveServer2 side - We only have table information (location/name), | ||
| // and we have to read the schema using the table data. This is called multiple times so there is room for | ||
| // optimizing here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SchemaParser has a cache, so that should help some.
mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerBaseTest.java
Show resolved
Hide resolved
mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerBaseTest.java
Outdated
Show resolved
Hide resolved
mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerBaseTest.java
Outdated
Show resolved
Hide resolved
…m Hive queries (cherry picked from commit c242616)
(cherry picked from commit bc13a60)
- Production code change: For HiveCatalog do not try to load the table in preCreateTable - it should not exist anyway
- Test code:
- Remove stat related table properties when checking
- Remove needToCheckSnapshotLocation() use Catalogs.hiveCatalog instead
- Add new test to check Hive table creation above existing Iceberg table
- locationForCreateTable for HadoopTable should use the same as createIcebergTable method
…e HMS after every test
|
Merged! Thanks @pvary for all your work on this one! |
| public void commitCreateTable(org.apache.hadoop.hive.metastore.api.Table hmsTable) { | ||
| if (icebergTable == null) { | ||
| if (Catalogs.hiveCatalog(conf)) { | ||
| catalogProperties.put(TableProperties.ENGINE_HIVE_ENABLED, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pvary @rdblue @massdosage
In the iceberg documentation i could see that
"To enable Hive support globally for an application, set iceberg.engine.hive.enabled=true in its Hadoop configuration."
The value of iceberg.engine.hive.enabled must be true in order to enable hive support. However, I could still use Iceberg's hive support even if I set iceberg.engine.hive.enabled to false. If i understand correctly, the value "iceberg.engine.hive.enabled" is irrelevant.
Because, if it is a hive catalog we are setting engine.hive.enabled as true.
In
iceberg/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
Line 486 in 01bc864
| if (metadata.properties().get(TableProperties.ENGINE_HIVE_ENABLED) != null) { |
We are initially evaluating the 'engine.hive.enabled' value. In cases where it's a Hive catalog, this value is consistently 'True', rendering the 'iceberg.engine.hive.enabled' as unnecessary.
Let me know, if i missed anything in my understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivjha30: The code line you have highlighted is for overriding the global settings on table level. So you have 2 levels to enable the hive engine:
- Catalog level (ConfigProperties.ENGINE_HIVE_ENABLED) - you can set the config in the hadoop configuration of the HiveCatalog
- Table level (TableProperties.ENGINE_HIVE_ENABLED) - you can set the config in the table properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pvary for getting back.
I have comprehended the functionality of the two levels concerning the Hive engine activation. If I'm not mistaken, even when we set ConfigProperties.ENGINE_HIVE_ENABLED to FALSE in Hadoop configuration, we are passing the ConfigProperties.ENGINE_HIVE_ENABLED as TRUE in the HiveIcebergMetaHook#commitCreateTable if it's of HiveCatalog type. This action renders the table level configuration irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using the metahook if you are using Hive to create the table. We expect that you want to be able to read the table from Hive in this case, so you need the storage handlers to be set.
8n this case, you need the storage handlers on the classpath of the other readers as well.
This patch enables the following commands from Hive:
or
The backing Iceberg table is created/dropped automatically using the Catalogs new interface methods created by (#1481).
With the help of #1407 this patch will enable CREATE/INSERT/DROP path using Hive queries and backed by Iceberg tables.
The patch consists of the following main changes: