Register existing tables in Iceberg HiveCatalog #253

aokolnychyi · 2019-07-04T15:51:06Z

This PR allows us to register existing tables in Iceberg HiveCatalog and resolves #251.

aokolnychyi · 2019-07-04T15:52:02Z

@rdblue this is what I meant in #240.

aokolnychyi · 2019-07-04T15:58:36Z

hive/src/test/java/org/apache/iceberg/hive/HiveTableTest.java

+    anotherTable.newAppend().appendFile(anotherFile).commit();
+
+    // verify that both tables continue to function independently
+    Assert.assertNotEquals(table.currentSnapshot().manifests(), anotherTable.currentSnapshot().manifests());


I have to say that the registered table won't have previous_metadata_location in its table properties. This doesn't seem like an issue to me.

aokolnychyi · 2019-07-04T16:04:54Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

+   * @param metadataFileLocation the location of a metadata file
+   * @return a Table instance
+   */
+  Table registerTable(TableIdentifier identifier, String metadataFileLocation);


Maybe, import would be a better name.

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

hive/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

rdblue · 2019-07-12T20:53:18Z

@aokolnychyi, this looks ready. But before we commit it, I want to suggest an alternative for you to consider. What about instead of adding the registerTable method, we added a way to set an existing table's metadata location directly? That would avoid extra public methods in the Catalog interface. It would also be a way to roll back the entire table state, not just rolling back the current snapshot.

This would primarily be an administrator change, so we could expose it through a method in TableOperations, like setMetadataLocation. What do you think? This may be difficult with the new table UUID...

aokolnychyi · 2019-07-15T07:21:09Z

I also thought about registering tables by giving a pointer to the metadata location and not a specific metadata file. However, my assumption was that a location can be shared by multiple Iceberg tables (which is possible because we have random UUIDs in metadata version file names). So, we need to somehow determine what is the last metadata file for a particular table if the location is shared (which can be done using Table UUID we added recently, but someone has to find that UUID anyway).

As always, I am open to considering any alternatives. To avoid changes to the Catalog API, we can require users interested in this to interact with HiveTableOperations directly, meaning they have to create a separate pool of HMS clients and pass a Hadoop config. I am not sure that's a good idea, though.

rdblue · 2019-07-15T17:39:04Z

I also thought about registering tables by giving a pointer to the metadata location and not a specific metadata file.

That isn't quite what I'm suggesting. I'm suggesting that to do the same thing that registerTable does, you create a table and then set its metadata location directly. I've been doing this to troubleshoot by creating a new table, then copying another table's metadata file over its current metadata location. Having a way to simply replace that metadata location of a table would support the use case you have here, but would also allow rolling back to a previous metadata file for the same table.

I think the UUID would be fine. You'd just have to reload the table from the Catalog instead of calling refresh because refresh would detect the UUID change and throw an exception.

aokolnychyi · 2019-07-22T09:25:39Z

I think I am getting it now but there are a couple of questions I want to clarify before updating the PR.

Of course, having a way to roll back the entire table state and minimizing changes to the Catalog API are reasonable benefits. The drawback is that we will have to create a separate instance of HiveTableOperations with its own HMS client pool.

I am not sure we need to create a table and then replace its metadata to simply register a table from a metadata file. If we do so, someone might actually query the table before we swap the metadata location.

This snippet creates a table and sets the location correctly from the beginning:

TableOperations ops = newTableOps(identifier);
HadoopInputFile metadataFile = HadoopInputFile.fromLocation(metadataFileLocation, conf);
TableMetadata metadata = TableMetadataParser.read(ops, metadataFile);
ops.commit(null, metadata);

In theory, this can be executed directly without the catalog and will be sufficient to create a table form an existing metadata file. As suggested, we can extend TableOperations with setMetadataLocation that will be used for rolling back entire table state for tables already present in the catalog.

rdblue · 2019-07-24T19:10:44Z

Of course, having a way to roll back the entire table state and minimizing changes to the Catalog API are reasonable benefits. The drawback is that we will have to create a separate instance of HiveTableOperations with its own HMS client pool.

Why is this? Wouldn't the table use the client pool from the Catalog instance? I think TableOperations is exposed from BaseTable, right? So as long as you are using the HiveCatalog, you'd get a table with an accessible operations object.

To avoid the issue of a table being available before the metadata is replaced, we should add the create/replace transaction methods to Catalog. We'll need those for atomic operations anyway.

aokolnychyi · 2019-08-04T13:26:28Z

@rdblue You are right, we can fetch TableOperations via BaseTable. Create/replace table transactions will solve the second issue as well.

Then table state can be rolled back using these lines:

TableOperations ops = ((BaseTable) table).operations();
ops.commit(ops.current(), TableMetadataParser.read(ops, metadataFile));

Having said that, I think we can close this PR. What do you think?

rdblue · 2019-08-04T21:47:40Z

@aokolnychyi, sounds good to me if that works for you! I didn't think about using TableMetadataParser to do this. That looks better than what I've been doing.

aokolnychyi commented Jul 4, 2019

View reviewed changes

rdblue reviewed Jul 7, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 7, 2019

View reviewed changes

hive/src/main/java/org/apache/iceberg/hive/HiveCatalog.java Outdated Show resolved Hide resolved

aokolnychyi force-pushed the register-tables branch from 5413464 to afd9de1 Compare July 8, 2019 00:10

aokolnychyi changed the title ~~[WIP] Register existing tables in Iceberg HiveCatalog~~ Register existing tables in Iceberg HiveCatalog Jul 8, 2019

Register existing tables in Iceberg HiveCatalog

f1595b0

aokolnychyi force-pushed the register-tables branch from afd9de1 to f1595b0 Compare July 8, 2019 19:06

aokolnychyi closed this Aug 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register existing tables in Iceberg HiveCatalog #253

Register existing tables in Iceberg HiveCatalog #253

Uh oh!

aokolnychyi commented Jul 4, 2019

Uh oh!

aokolnychyi commented Jul 4, 2019

Uh oh!

aokolnychyi Jul 4, 2019

Uh oh!

aokolnychyi Jul 4, 2019

Uh oh!

Uh oh!

Uh oh!

rdblue commented Jul 12, 2019

Uh oh!

aokolnychyi commented Jul 15, 2019

Uh oh!

rdblue commented Jul 15, 2019

Uh oh!

aokolnychyi commented Jul 22, 2019

Uh oh!

rdblue commented Jul 24, 2019

Uh oh!

aokolnychyi commented Aug 4, 2019

Uh oh!

rdblue commented Aug 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Register existing tables in Iceberg HiveCatalog #253

Register existing tables in Iceberg HiveCatalog #253

Uh oh!

Conversation

aokolnychyi commented Jul 4, 2019

Uh oh!

aokolnychyi commented Jul 4, 2019

Uh oh!

aokolnychyi Jul 4, 2019

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jul 4, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rdblue commented Jul 12, 2019

Uh oh!

aokolnychyi commented Jul 15, 2019

Uh oh!

rdblue commented Jul 15, 2019

Uh oh!

aokolnychyi commented Jul 22, 2019

Uh oh!

rdblue commented Jul 24, 2019

Uh oh!

aokolnychyi commented Aug 4, 2019

Uh oh!

rdblue commented Aug 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants