Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR allows us to register existing tables in Iceberg HiveCatalog and resolves #251.

@aokolnychyi
Copy link
Contributor Author

@rdblue this is what I meant in #240.

anotherTable.newAppend().appendFile(anotherFile).commit();

// verify that both tables continue to function independently
Assert.assertNotEquals(table.currentSnapshot().manifests(), anotherTable.currentSnapshot().manifests());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say that the registered table won't have previous_metadata_location in its table properties. This doesn't seem like an issue to me.

* @param metadataFileLocation the location of a metadata file
* @return a Table instance
*/
Table registerTable(TableIdentifier identifier, String metadataFileLocation);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, import would be a better name.

@aokolnychyi aokolnychyi changed the title [WIP] Register existing tables in Iceberg HiveCatalog Register existing tables in Iceberg HiveCatalog Jul 8, 2019
@rdblue
Copy link
Contributor

rdblue commented Jul 12, 2019

@aokolnychyi, this looks ready. But before we commit it, I want to suggest an alternative for you to consider. What about instead of adding the registerTable method, we added a way to set an existing table's metadata location directly? That would avoid extra public methods in the Catalog interface. It would also be a way to roll back the entire table state, not just rolling back the current snapshot.

This would primarily be an administrator change, so we could expose it through a method in TableOperations, like setMetadataLocation. What do you think? This may be difficult with the new table UUID...

@aokolnychyi
Copy link
Contributor Author

I also thought about registering tables by giving a pointer to the metadata location and not a specific metadata file. However, my assumption was that a location can be shared by multiple Iceberg tables (which is possible because we have random UUIDs in metadata version file names). So, we need to somehow determine what is the last metadata file for a particular table if the location is shared (which can be done using Table UUID we added recently, but someone has to find that UUID anyway).

As always, I am open to considering any alternatives. To avoid changes to the Catalog API, we can require users interested in this to interact with HiveTableOperations directly, meaning they have to create a separate pool of HMS clients and pass a Hadoop config. I am not sure that's a good idea, though.

@rdblue
Copy link
Contributor

rdblue commented Jul 15, 2019

I also thought about registering tables by giving a pointer to the metadata location and not a specific metadata file.

That isn't quite what I'm suggesting. I'm suggesting that to do the same thing that registerTable does, you create a table and then set its metadata location directly. I've been doing this to troubleshoot by creating a new table, then copying another table's metadata file over its current metadata location. Having a way to simply replace that metadata location of a table would support the use case you have here, but would also allow rolling back to a previous metadata file for the same table.

I think the UUID would be fine. You'd just have to reload the table from the Catalog instead of calling refresh because refresh would detect the UUID change and throw an exception.

@aokolnychyi
Copy link
Contributor Author

I think I am getting it now but there are a couple of questions I want to clarify before updating the PR.

Of course, having a way to roll back the entire table state and minimizing changes to the Catalog API are reasonable benefits. The drawback is that we will have to create a separate instance of HiveTableOperations with its own HMS client pool.

I am not sure we need to create a table and then replace its metadata to simply register a table from a metadata file. If we do so, someone might actually query the table before we swap the metadata location.

This snippet creates a table and sets the location correctly from the beginning:

TableOperations ops = newTableOps(identifier);
HadoopInputFile metadataFile = HadoopInputFile.fromLocation(metadataFileLocation, conf);
TableMetadata metadata = TableMetadataParser.read(ops, metadataFile);
ops.commit(null, metadata);

In theory, this can be executed directly without the catalog and will be sufficient to create a table form an existing metadata file. As suggested, we can extend TableOperations with setMetadataLocation that will be used for rolling back entire table state for tables already present in the catalog.

@rdblue
Copy link
Contributor

rdblue commented Jul 24, 2019

Of course, having a way to roll back the entire table state and minimizing changes to the Catalog API are reasonable benefits. The drawback is that we will have to create a separate instance of HiveTableOperations with its own HMS client pool.

Why is this? Wouldn't the table use the client pool from the Catalog instance? I think TableOperations is exposed from BaseTable, right? So as long as you are using the HiveCatalog, you'd get a table with an accessible operations object.

To avoid the issue of a table being available before the metadata is replaced, we should add the create/replace transaction methods to Catalog. We'll need those for atomic operations anyway.

@aokolnychyi
Copy link
Contributor Author

@rdblue You are right, we can fetch TableOperations via BaseTable. Create/replace table transactions will solve the second issue as well.

Then table state can be rolled back using these lines:

TableOperations ops = ((BaseTable) table).operations();
ops.commit(ops.current(), TableMetadataParser.read(ops, metadataFile));

Having said that, I think we can close this PR. What do you think?

@rdblue
Copy link
Contributor

rdblue commented Aug 4, 2019

@aokolnychyi, sounds good to me if that works for you! I didn't think about using TableMetadataParser to do this. That looks better than what I've been doing.

@aokolnychyi aokolnychyi closed this Aug 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Register existing tables in Iceberg HiveCatalog

2 participants