Add HiveCatalog implementation #240

rdblue · 2019-06-28T04:44:16Z

This is based on Parth's PR #187 and implements the review comments.

This includes an implementation of the Catalog API for Hive. It also has some minor updates to the Catalog API and fixes some problems with Hive connections found in testing. Now all Hive tests successfully run using a single shared pool of 2 connections.

Closes #187.

rdblue · 2019-06-28T04:46:04Z

hive/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

                null,
-                ICEBERG_TABLE_TYPE_VALUE);
+            TableType.EXTERNAL_TABLE.toString());
+        tbl.getParameters().put("EXTERNAL", "TRUE"); // using the external table type also requires this


@aokolnychyi, this is a new fix to make tables external so that Iceberg manages the underlying location instead of Hive. In the previous version of this PR, using a managed table caused the table location to be moved during rename, which would break Iceberg metadata. You probably want to make sure you're using this change.

@rdblue yeah, we had this change internally already to be compatible with the Spark external catalog in 2.4. Thanks!

This means the metastore won't delete the location directory when the table is dropped. In general, this seems like the right behavior, since Iceberg data can be anywhere. But for users that are only going through this interface, how is data cleaned up? Because tables are created in the default Hive location based on the name, what happens if we do a drop followed by a create (since the old data still exists)?

New table data and old table data can be mixed together without corrupting the table state. The problem is that this would leave data lying around. We could add a purge option to drop the data when the table is dropped.

Who is responsible for cleaning up data in the Iceberg model? Is that out of scope for the project (assumed that a complementary system handles it)?

I think it should be up to the caller whether Iceberg cleans it up or leaves it for a complementary system. We have a janitor process that cleans it up, but we don't want to assume that users will. I'll add this in a follow-up issue.

Just to clarify: will purge flag indicate whether to delete location in HMS or will it look into where the data and metadata actually is? (e.g. write.folder-storage.path and write.metadata.path, which can change over time)

Purge would basically do the same thing as a snapshot expiration and would remove the entire tree of metadata and data files. I don't think it should delete directories because Iceberg can't necessarily assume that it owns the entire directory structure. Iceberg tables actually use the same storage locations.

rdblue · 2019-06-28T04:48:47Z

hive/src/main/java/org/apache/iceberg/hive/HiveClientPool.java


  HiveClientPool(int poolSize, Configuration conf) {
-    super(poolSize, TException.class);
+    super(poolSize, TTransportException.class);


@aokolnychyi, this was incorrect and causing the clients to reconnect on every TException, including exceptions thrown when tables already exist or do not exist (in the existence check). You will probably want to use this change to avoid over-reconnecting.

In addition, I think that it is correct to close the client before reconnecting because with this problem I was still seeing connection exhaustion in the test metastore. Closing before reconnect fixed the problem and using TTransportException also fixed the problem independently.

+1, thanks! We will pull this in.

hive/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

aokolnychyi · 2019-06-28T08:11:24Z

hive/src/main/java/org/apache/iceberg/hive/HiveClientPool.java


  HiveClientPool(int poolSize, Configuration conf) {
-    super(poolSize, TException.class);
+    super(poolSize, TTransportException.class);


+1, thanks! We will pull this in.

aokolnychyi · 2019-06-28T08:20:35Z

Shall we use HiveCatalog in #239?

rdblue · 2019-06-28T15:23:29Z

@aokolnychyi, we should eventually. This and #239 are mostly independent changes, so let's make that change in the PR that gets merged last.

aokolnychyi · 2019-06-28T17:04:47Z

LGTM

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

danielcweeks · 2019-06-28T17:29:16Z

Minor comment. Other than that +1

electrum · 2019-06-28T17:52:02Z

hive/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

+
+    try {
+      clients.run(client -> {
+        client.dropTable(database, identifier.name());


This delegates to the longer variant with dropTable=true. We might want to call the other version explicitly, if that is not the intended behavior. Also, we should consider how this affects earlier created tables that had a table type of ICEBERG rather than EXTERNAL_TABLE.

This doesn't actually drop the data because it is an external table. I've added a test to validate this behavior. We can add a purge option in a follow-up.

My point was that this is confusing to readers, since we are asking the metastore to delete the data, but we know it won’t be deleted since it’s an external table.

And because the delete flag is set, existing Iceberg tables from before this change may have the data deleted, since they are not external tables. (I haven’t checked how metastore treats custom table types. Maybe same as external, in which case this is ok.)

I'll clean this up in a follow-up that addresses delete data directly, I agree with you that we should consider tables that aren't external here. If we have a flag we can set for the expected behavior, then we should set it.

aokolnychyi · 2019-06-30T20:16:20Z

@rdblue What about cases when we drop tables from the catalog but keep the metadata? Will it make sense to have a way to register them back by giving a location?

It won't be straightforward, though. The location can be shared by multiple Iceberg tables and we also add a UUID to metadata file names in BaseMetastoreTableOperations.

rdblue · 2019-07-03T20:08:44Z

I agree with the idea to add a UUID to table metadata. I've been meaning to do that.

Are you suggesting that we purge data but not metadata?

rdblue commented Jun 28, 2019

View reviewed changes

rdblue requested a review from danielcweeks June 28, 2019 04:49

rdblue mentioned this pull request Jun 28, 2019

Apply Baseline plugin to iceberg-hive #233

Merged

aokolnychyi reviewed Jun 28, 2019

View reviewed changes

Parth-Brahmbhatt and others added 2 commits June 28, 2019 09:04

Catalog Implementation for hive and hadoop.

bbe3b7e

Fix review comments from Parth's PR.

2ed46bd

rdblue force-pushed the add-catalog-implementation branch from b0eb257 to 2ed46bd Compare June 28, 2019 16:09

Support location in the Catalog API, add default create variants.

c6a6d8d

danielcweeks reviewed Jun 28, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java Show resolved Hide resolved

electrum reviewed Jun 28, 2019

View reviewed changes

rdblue force-pushed the add-catalog-implementation branch from cc5c42e to 648b4c0 Compare June 29, 2019 04:47

Update for review comments.

757fee8

rdblue force-pushed the add-catalog-implementation branch from 648b4c0 to 757fee8 Compare June 29, 2019 14:05

rdblue merged commit e133f53 into apache:master Jun 29, 2019

rdblue mentioned this pull request Jun 29, 2019

Update iceberg-spark to use HiveTables #239

Merged

aokolnychyi mentioned this pull request Jul 4, 2019

Register existing tables in Iceberg HiveCatalog #253

Closed

rdblue mentioned this pull request Aug 5, 2019

Add dropTable purge option to Catalog API #350

Merged

Add HiveCatalog implementation #240

Add HiveCatalog implementation #240

Uh oh!

Conversation

rdblue commented Jun 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jun 28, 2019

Uh oh!

rdblue commented Jun 28, 2019

Uh oh!

aokolnychyi commented Jun 28, 2019

Uh oh!

Uh oh!

danielcweeks commented Jun 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jun 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Jul 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rdblue commented Jun 28, 2019 •

edited

Loading

rdblue Jun 29, 2019 •

edited

Loading

aokolnychyi commented Jun 30, 2019 •

edited

Loading