Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Oct 14, 2020

As discussed on #1495 we should create the table specification from the columns in the table creation command.

This PR does this.

Here are the changes:

  • Create the Iceberg schema using the serDeProperties
  • Create the Iceberg partitioning specification using the partition columns defined in the CREATE TABLE command
  • Added tests which are reading the tables after creating them.

Changes which are worth to double check:

  • If we are creating a Hive table with CRATE TABLE ... PARTITIONED BY command, then the resulting Iceberg table will be partitioned with identity partitions, but the Hive table itself will not be partitioned. This was needed since the read path is working with partitioned tables, and I do not see any good way to solve this since Hive wants to read the partitions one-by-one
  • The HadoopCatalog prevented setting the location when creating a new Iceberg table. Changed to allow calling withLocation if the provided location is set to defaultLocation, so I do not have to branch the code in Catalogs.
  • Only HiveCatalog will be using the default table location in Hive. When using other Catalogs the LOCATION should be provided in the CREATE TABLE command.

@pvary pvary changed the title Using Hive schema to create tables and partition specification Hive: Using Hive schema to create tables and partition specification Oct 14, 2020
@pvary
Copy link
Contributor Author

pvary commented Oct 20, 2020

@rdblue, @pvary, @marton-bod: If you have time, could you please review this one?

@rdblue
Copy link
Contributor

rdblue commented Oct 21, 2020

I'm planning on spending most of Friday reviewing. Sorry for the delay!

@pvary
Copy link
Contributor Author

pvary commented Oct 22, 2020

@rdblue: I will be on PTO next week and I will do my best to not open my laptop during that time 😄. So it is absolutely ok if you review the patch only sometime next week. It would be good to have the main points identified here and in #1407 as well, so after the PTO I can start fixing those with fresh energies 😄

I start to feel more-more convinced that we should not reuse the Hive PARTITIONED BY clause to create Iceberg partitioned tables because of the following reasons:

  • The resulting table will/should be non-partitioned in Hive and this is non-intuitive for the users (creating a partitioned table, and getting a non-partitioned one)
  • We can create only identity partition specifications this way which is very limited in contrast to the nice feature set already provided by Iceberg.

If you agree, I would throw an error in the next iteration of this PR when the user tries to create a partitioned table with the PARTITIONED BY clause. The TBLPROPERTIES still could be used to create the full range of Iceberg partitions. In one of my next PRs, I will try to find a better solution for creating Iceberg partitioned tables with a normal SQL statement without using JSON-s

@rdblue
Copy link
Contributor

rdblue commented Oct 23, 2020

I'm not quite convinced that not supporting the Hive PARTITIONED BY clause is the right way to go, but I think it is a reasonable step to get this patch done. We don't need to support it to support the Schema DDL, so it would be fine with me to throw an exception and reject its use for now.

In the long term, I think we do want Iceberg partitioning to be exposed in the normal way for Hive because it would be confusing for a partitioned Iceberg table to show up as unpartitioned. That said, there are significant differences between the two partitioning approaches:

  1. Partitioning never changes the table schema, but Hive partition columns are always at the end
  2. Hive partition columns can't be changed
  3. Iceberg supports hidden partitions that can't be shown in Hive

The differences may be significant enough that it would cause problems to expose even Iceberg identity partitions to Hive. For example, if Hive expects to get a partition key and fill in data values, then that would be a problem.

What are the chances of integrating Iceberg into Hive itself and solving some of these limitations?

@qphien
Copy link
Contributor

qphien commented Oct 30, 2020

hive (iceberg)> desc decimal_table;
OK
col_name	data_type	comment
val             decimal(7,2)    from deserializer

hive (iceberg)> select * from decimal_table where val > 100.1;
OK
decimal_table.val
Failed with exception java.io.IOException:org.apache.iceberg.exceptions.ValidationException: Invalid value for conversion to type decimal(7, 2): 100.1 (java.math.BigDecimal)

This PR throws exception when scale specified in filter is different with iceberg.

hive (iceberg)> select * from decimal_table where val > cast(100.1 as decimal(7,2));
OK
decimal_table.val
Failed with exception java.io.IOException:org.apache.iceberg.exceptions.ValidationException: Invalid value for conversion to type decimal(7, 2): 100.1 (java.math.BigDecimal)

Even if 100.1 is cast as decimal(7, 2), hive still throws the same exception.

@rdblue
Copy link
Contributor

rdblue commented Oct 30, 2020

Thanks for letting us know, @qphien. Would you like to create an issue for that problem? It doesn't look related to this PR.

@qphien
Copy link
Contributor

qphien commented Nov 2, 2020

@qphien OK, i have created another issue #1699

@pvary
Copy link
Contributor Author

pvary commented Nov 2, 2020

I'm not quite convinced that not supporting the Hive PARTITIONED BY clause is the right way to go, but I think it is a reasonable step to get this patch done. We don't need to support it to support the Schema DDL, so it would be fine with me to throw an exception and reject its use for now.

Updated the patch to throw an exception if PARTITIONED BY is used.

What are the chances of integrating Iceberg into Hive itself and solving some of these limitations?

If we want to change the Hive syntax we would need Hive to depend on Iceberg.

Iceberg has 2 dependencies on Hive

  • HMS API to store the snapshot
  • SerDe implementation

That would mean that we have cyclic dependency which could be problematic
(Edit) Maybe moving the SerDe implementation to the Hive repo could narrow down the Iceberg dependency on Hive to avoid serious headaches

@pvary
Copy link
Contributor Author

pvary commented Nov 12, 2020

@rdblue: The tests added by @lcspinter in #1740 highlighted issues with the timestamp handling.
Ended up with the following solution:

  • Hive2 - we have only TIMESTAMP in Hive, so both Iceberg Timestamps (with and without TZ) is converted to Hive TIMESTAMP
  • Hive3 - we have TIMESTAMP and TIMESTAMP WITH LOCAL TIMEZONE, so Iceberg Timestamp without TZ is converted to Hive TIMESTAMP, and Iceberg Timestamp with TZ is converted to TIMESTAMP WITH LOCAL TIMEZONE.

I was thinking about separating schema creation and the Timestamp fix (since there are only related because the additional check I have added for table creation), but my first attempt ended up in a mess of half fixed test and such.
Could you review the patch as it is, or I should spend some more cycles to separate them?

Thanks,
Peter

@marton-bod
Copy link
Collaborator

+1

Copy link
Contributor

@shardulm94 shardulm94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pvary for your patience through this. We should be very close to getting this in!

tableSchema = Catalogs.loadTable(configuration, serDeProperties).schema();
} catch (NoSuchTableException nte) {
throw new SerDeException("Please provide an existing table or a valid schema", nte);
if (Catalogs.hiveCatalog(configuration)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason behind handling HiveCatalogs separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveTableOperations converts the table schema to Hive columns / StorageDescriptor when any change is committed to the table. This means that the Iceberg schema and the Hive schema is always synchronized.
Since the above synchronization, I think it is better to use the "cached" schema instead of loading the table again and again. This might change when we clean up Timestamps / UUIDs since the mapping is not 1-on-1 there, but I would leave something for that new PR too 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought better of it.
Keep it simple and general. And branch only if we can have a clean benefit in the final solution.

Copy link
Contributor

@shardulm94 shardulm94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pvary for working on this. Looks good to me! I understand that some places where we determine schema may be suboptimal given the lifecycle of Hive SerDe objects, but I think this code is much simpler to reason about. We can probably do future updates to make the code more efficient, but we can also take this opportunity to see if Hive's lifecycle for SerDe objects can be improved.

case VARCHAR:
throw new IllegalArgumentException("Unsupported Hive type (" +
((PrimitiveTypeInfo) typeInfo).getPrimitiveCategory() +
") for Iceberg tables. Consider using STRING type instead.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be fine mapping these to string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this on he dev list where we are talking about the schema mapping.
If we decide that the Iceberg schema is the master and we always convert from there to Hive schema, then we can relax the 1-on-1 mapping restriction, and convert multiple Iceberg types to a single Hive type.

@rdblue rdblue merged commit e69e521 into apache:master Nov 26, 2020
@rdblue
Copy link
Contributor

rdblue commented Nov 26, 2020

Thanks for all your work on this, @pvary! I'll merge it.

Thanks for reviewing, @shardulm94!

@pvary pvary deleted the hiveschema branch November 26, 2020 07:50
@pvary
Copy link
Contributor Author

pvary commented Nov 26, 2020

Big thanks @rdblue and @shardulm94 for following this through!

anuragmantri added a commit to anuragmantri/iceberg that referenced this pull request Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants