Hive: Using Hive schema to create tables and partition specification #1612

pvary · 2020-10-14T11:16:50Z

As discussed on #1495 we should create the table specification from the columns in the table creation command.

This PR does this.

Here are the changes:

Create the Iceberg schema using the serDeProperties
Create the Iceberg partitioning specification using the partition columns defined in the CREATE TABLE command
Added tests which are reading the tables after creating them.

Changes which are worth to double check:

If we are creating a Hive table with CRATE TABLE ... PARTITIONED BY command, then the resulting Iceberg table will be partitioned with identity partitions, but the Hive table itself will not be partitioned. This was needed since the read path is working with partitioned tables, and I do not see any good way to solve this since Hive wants to read the partitions one-by-one
The HadoopCatalog prevented setting the location when creating a new Iceberg table. Changed to allow calling withLocation if the provided location is set to defaultLocation, so I do not have to branch the code in Catalogs.
Only HiveCatalog will be using the default table location in Hive. When using other Catalogs the LOCATION should be provided in the CREATE TABLE command.

pvary · 2020-10-20T17:41:08Z

@rdblue, @pvary, @marton-bod: If you have time, could you please review this one?

rdblue · 2020-10-21T23:44:01Z

I'm planning on spending most of Friday reviewing. Sorry for the delay!

pvary · 2020-10-22T10:23:29Z

@rdblue: I will be on PTO next week and I will do my best to not open my laptop during that time 😄. So it is absolutely ok if you review the patch only sometime next week. It would be good to have the main points identified here and in #1407 as well, so after the PTO I can start fixing those with fresh energies 😄

I start to feel more-more convinced that we should not reuse the Hive PARTITIONED BY clause to create Iceberg partitioned tables because of the following reasons:

The resulting table will/should be non-partitioned in Hive and this is non-intuitive for the users (creating a partitioned table, and getting a non-partitioned one)
We can create only identity partition specifications this way which is very limited in contrast to the nice feature set already provided by Iceberg.

If you agree, I would throw an error in the next iteration of this PR when the user tries to create a partitioned table with the PARTITIONED BY clause. The TBLPROPERTIES still could be used to create the full range of Iceberg partitions. In one of my next PRs, I will try to find a better solution for creating Iceberg partitioned tables with a normal SQL statement without using JSON-s

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java

mr/src/main/java/org/apache/iceberg/mr/hive/HiveSchemaUtil.java

mr/src/test/java/org/apache/iceberg/mr/hive/TestTables.java

rdblue · 2020-10-23T21:12:26Z

I'm not quite convinced that not supporting the Hive PARTITIONED BY clause is the right way to go, but I think it is a reasonable step to get this patch done. We don't need to support it to support the Schema DDL, so it would be fine with me to throw an exception and reject its use for now.

In the long term, I think we do want Iceberg partitioning to be exposed in the normal way for Hive because it would be confusing for a partitioned Iceberg table to show up as unpartitioned. That said, there are significant differences between the two partitioning approaches:

Partitioning never changes the table schema, but Hive partition columns are always at the end
Hive partition columns can't be changed
Iceberg supports hidden partitions that can't be shown in Hive

The differences may be significant enough that it would cause problems to expose even Iceberg identity partitions to Hive. For example, if Hive expects to get a partition key and fill in data values, then that would be a problem.

What are the chances of integrating Iceberg into Hive itself and solving some of these limitations?

qphien · 2020-10-30T08:25:17Z

hive (iceberg)> desc decimal_table;
OK
col_name	data_type	comment
val             decimal(7,2)    from deserializer

hive (iceberg)> select * from decimal_table where val > 100.1;
OK
decimal_table.val
Failed with exception java.io.IOException:org.apache.iceberg.exceptions.ValidationException: Invalid value for conversion to type decimal(7, 2): 100.1 (java.math.BigDecimal)

This PR throws exception when scale specified in filter is different with iceberg.

hive (iceberg)> select * from decimal_table where val > cast(100.1 as decimal(7,2));
OK
decimal_table.val
Failed with exception java.io.IOException:org.apache.iceberg.exceptions.ValidationException: Invalid value for conversion to type decimal(7, 2): 100.1 (java.math.BigDecimal)

Even if 100.1 is cast as decimal(7, 2), hive still throws the same exception.

rdblue · 2020-10-30T16:15:03Z

Thanks for letting us know, @qphien. Would you like to create an issue for that problem? It doesn't look related to this PR.

qphien · 2020-11-02T02:19:46Z

@qphien OK, i have created another issue #1699

pvary · 2020-11-02T18:40:13Z

I'm not quite convinced that not supporting the Hive PARTITIONED BY clause is the right way to go, but I think it is a reasonable step to get this patch done. We don't need to support it to support the Schema DDL, so it would be fine with me to throw an exception and reject its use for now.

Updated the patch to throw an exception if PARTITIONED BY is used.

What are the chances of integrating Iceberg into Hive itself and solving some of these limitations?

If we want to change the Hive syntax we would need Hive to depend on Iceberg.

Iceberg has 2 dependencies on Hive

HMS API to store the snapshot
SerDe implementation

That would mean that we have cyclic dependency which could be problematic
(Edit) Maybe moving the SerDe implementation to the Hive repo could narrow down the Iceberg dependency on Hive to avoid serious headaches

mr/src/main/java/org/apache/iceberg/mr/hive/HiveSchemaVisitor.java

…hed commits of apache#1612)

pvary · 2020-11-12T14:48:48Z

@rdblue: The tests added by @lcspinter in #1740 highlighted issues with the timestamp handling.
Ended up with the following solution:

Hive2 - we have only TIMESTAMP in Hive, so both Iceberg Timestamps (with and without TZ) is converted to Hive TIMESTAMP
Hive3 - we have TIMESTAMP and TIMESTAMP WITH LOCAL TIMEZONE, so Iceberg Timestamp without TZ is converted to Hive TIMESTAMP, and Iceberg Timestamp with TZ is converted to TIMESTAMP WITH LOCAL TIMEZONE.

I was thinking about separating schema creation and the Timestamp fix (since there are only related because the additional check I have added for table creation), but my first attempt ended up in a mess of half fixed test and such.
Could you review the patch as it is, or I should spend some more cycles to separate them?

Thanks,
Peter

build.gradle

hive3/src/main/java/org/apache/iceberg/mr/hive/Hive3SchemaVisitor.java

marton-bod · 2020-11-18T16:23:45Z

+1

mr/src/main/java/org/apache/iceberg/mr/hive/HiveSchemaConverter.java

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java

mr/src/main/java/org/apache/iceberg/mr/hive/HiveSchemaUtil.java

… specific catalog test classes

- CHAR - TypeInfo generation instead of NestedField in conversion

…mand

Also 2 small fixes

Addressed review comments

… Iceberg interpretation.

shardulm94

Thanks @pvary for your patience through this. We should be very close to getting this in!

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java

shardulm94 · 2020-11-25T04:01:10Z

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java

-        tableSchema = Catalogs.loadTable(configuration, serDeProperties).schema();
-      } catch (NoSuchTableException nte) {
-        throw new SerDeException("Please provide an existing table or a valid schema", nte);
+      if (Catalogs.hiveCatalog(configuration)) {


What is the reason behind handling HiveCatalogs separately?

HiveTableOperations converts the table schema to Hive columns / StorageDescriptor when any change is committed to the table. This means that the Iceberg schema and the Hive schema is always synchronized.
Since the above synchronization, I think it is better to use the "cached" schema instead of loading the table again and again. This might change when we clean up Timestamps / UUIDs since the mapping is not 1-on-1 there, but I would leave something for that new PR too 😄

Thought better of it.
Keep it simple and general. And branch only if we can have a clean benefit in the final solution.

…imal, but later we need that anyway

Also reverting some formatting only changes

shardulm94

Thanks @pvary for working on this. Looks good to me! I understand that some places where we determine schema may be suboptimal given the lifecycle of Hive SerDe objects, but I think this code is much simpler to reason about. We can probably do future updates to make the code more efficient, but we can also take this opportunity to see if Hive's lifecycle for SerDe objects can be improved.

rdblue · 2020-11-26T00:30:46Z

mr/src/main/java/org/apache/iceberg/mr/hive/HiveSchemaConverter.java

+          case VARCHAR:
+            throw new IllegalArgumentException("Unsupported Hive type (" +
+                ((PrimitiveTypeInfo) typeInfo).getPrimitiveCategory() +
+                ") for Iceberg tables. Consider using STRING type instead.");


I would be fine mapping these to string.

Let's discuss this on he dev list where we are talking about the schema mapping.
If we decide that the Iceberg schema is the master and we always convert from there to Hive schema, then we can relax the 1-on-1 mapping restriction, and convert multiple Iceberg types to a single Hive type.

rdblue · 2020-11-26T00:34:50Z

Thanks for all your work on this, @pvary! I'll merge it.

Thanks for reviewing, @shardulm94!

pvary · 2020-11-26T07:51:41Z

Big thanks @rdblue and @shardulm94 for following this through!

pvary changed the title ~~Using Hive schema to create tables and partition specification~~ Hive: Using Hive schema to create tables and partition specification Oct 14, 2020