-
Notifications
You must be signed in to change notification settings - Fork 3k
Catalog Implementation for hive and hadoop. #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catalog Implementation for hive and hadoop. #187
Conversation
|
I have deliberately left the Tables interface and implementation for BaseMetaStore tables as we have not decided what are we going to do about the transaction APIs. We can either take this PR as an opportunity to address that or move those APIs under BaseMetastoreCatalog and remove Tables interface and all of its implementation. |
| default boolean tableExists(TableIdentifier tableIdentifier) { | ||
| try { | ||
| getTable(tableIdentifier); | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return true; ?
|
|
||
| @Override | ||
| public String toString() { | ||
| // assumes a level it self won't have a period in it, otherwise it will be ambiguous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo - it self -> itself
| default boolean tableExists(TableIdentifier tableIdentifier) { | ||
| try { | ||
| getTable(tableIdentifier); | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it should return a true here.
| */ | ||
| boolean tableExists(TableIdentifier tableIdentifier); | ||
| default boolean tableExists(TableIdentifier tableIdentifier) { | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation here simplifies code but also introduces exception-handling to ordinary control flow. An alternative is to leave the method blank and give the implementors discretion about how to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is reasonable to do this in a default implementation. You can't expect the default implementation to do everything the best way.
| import org.apache.iceberg.exceptions.NoSuchTableException; | ||
|
|
||
| /** | ||
| * Top level Catalog APIs that supports table DDLs and namespace listing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a brief explanation on the rationale for leaving out the namespace/table listing methods?
| @Override | ||
| public void dropTable(TableIdentifier tableIdentifier) { | ||
| validateTableIdentifier(tableIdentifier); | ||
| HiveMetaStoreClient hiveMetaStoreClient = this.clients.newClient(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like clients.get() is a more legitimate method to use here? Alternatively, clients.run() can be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. newClient is protected so that subclasses can provide an implementation, not so that other classes in this package can call it. The correct way to do this is to use run. That handles reconnections and uses the client pool. get is also private.
| * @param location a path URI (e.g. hdfs:///warehouse/my_table) | ||
| * @return newly created table implementation | ||
| * @param tableIdentifier an identifier to identify this table in a namespace. | ||
| * @param schema the schema for this table, can not be null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add a Preconditions.checkNotNull() check to assert this.
| @@ -1,18 +1,3 @@ | |||
| /* | |||
| * Copyright 2017 Netflix, Inc. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you missed adding the apache header.
| public void testCreate() throws TException { | ||
| // Table should be created in hive metastore | ||
| final org.apache.hadoop.hive.metastore.api.Table table = metastoreClient.getTable(DB_NAME, TABLE_NAME); | ||
| varifyTable(TABLE_IDENTIFIER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo varifyTable -> verifyTable
| * @param tableIdentifier an identifier to identify this table in a namespace. | ||
| * @return instance of {@link Table} implementation referred by {@code tableIdentifier} | ||
| */ | ||
| Table getTable(TableIdentifier tableIdentifier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be loadTable because get usually implies that the operation is a quick fetch, like a getter would be on a Java object.
| @Override | ||
| public String toString() { | ||
| // assumes a level it self won't have a period in it, otherwise it will be ambiguous | ||
| return Arrays.stream(levels).collect(joining(".")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to use the streams API here? What about Joiners.on(".").join(levels)? Seems simpler.
| tableIdentifier.name()); | ||
| } | ||
|
|
||
| protected static final void validateTableIdentifier(TableIdentifier tableIdentifier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be in Hive, not in the base implementation. The base could be used for metastores that support more nesting than 1 namespace level.
| Preconditions.checkArgument(tableIdentifier.namespace().levels().length == 1, "metastore tables should only have " + | ||
| "schema name as namespace"); | ||
| String schemaName = tableIdentifier.namespace().levels()[0]; | ||
| Preconditions.checkArgument(schemaName != null && !schemaName.isEmpty(), "schema name can't be null or " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should TableIdentifier and Namespace check that none of the parts are null instead? I think that is going to be easier than validating in lots of other places.
|
|
||
| protected static final void validateTableIdentifier(TableIdentifier tableIdentifier) { | ||
| Preconditions.checkArgument(tableIdentifier.hasNamespace(), "metastore tables should have schema as namespace"); | ||
| Preconditions.checkArgument(tableIdentifier.namespace().levels().length == 1, "metastore tables should only have " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for Hive, I think that the correct term is "database" and not "schema" for the namespace.
| import org.apache.iceberg.exceptions.NoSuchTableException; | ||
| import org.apache.iceberg.exceptions.RuntimeIOException; | ||
|
|
||
| public class HadoopCatalog implements Catalog, Configurable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this implement Configurable? Seems strange to me.
| * Top level Catalog APIs that supports table DDLs and namespace listing. | ||
| */ | ||
| public interface Catalog { | ||
| public interface Catalog extends Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make all catalogs Closeable?
| static final String DB_NAME = "hivedb"; | ||
| static final String TABLE_NAME = "tbl"; | ||
| static final TableIdentifier TABLE_IDENTIFIER = | ||
| new TableIdentifier(Namespace.namespace(new String[] {DB_NAME}), TABLE_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like namespace should take String....
| import org.apache.iceberg.exceptions.AlreadyExistsException; | ||
| import org.apache.iceberg.exceptions.NoSuchTableException; | ||
|
|
||
| public abstract class BaseMetastoreCatalog implements Catalog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this replace BaseMetastoreTables?
| validateTableIdentifier(to); | ||
|
|
||
| HiveMetaStoreClient hiveMetaStoreClient = this.clients.newClient(); | ||
| String location = ((BaseTable) getTable(from)).operations().current().file().location(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think renaming the table should change the location. Data can be stored in the old location and referenced by a new name.
| try { | ||
| Table table = hiveMetaStoreClient.getTable(oldDBName, oldTableName); | ||
|
|
||
| // hive metastore renames the table's directory as part of renaming the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does?
| tables.close(); | ||
| this.tables = null; | ||
| try { | ||
| metastoreClient.getTable(DB_NAME, TABLE_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this get the table first?
| .protocolFactory(new TBinaryProtocol.Factory()) | ||
| .minWorkerThreads(3) | ||
| .maxWorkerThreads(5); | ||
| .maxWorkerThreads(32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the reason for changing this? Keeping it low helps us catch client leaks, like the one introduced by the newClient call in the Hive catalog.
|
|
||
| hiveCatalog.dropTable(TABLE_IDENTIFIER); | ||
|
|
||
| metastoreClient.getTable(DB_NAME, TABLE_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is usually better to use assertThrows because you can check the error message. Then you wouldn't need to load the table first.
| private Configuration conf; | ||
|
|
||
| public HadoopTables() { | ||
| public HadoopCatalog() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to move the Hadoop implementation to be a Catalog. Identifiers don't handle paths well and there are quite a few changes for not much benefit.
What about just moving Hive to use the Catalog interface? Do we need to move HadoopTables or can we leave it as-is?
|
Do we see the new |
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
| package org.apache.iceberg.catalog; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we require an empty line after the header? I think it is true for most of the files. However, we don't enforce it in checkstyle.xml. Shall we add PACKAGE_DEF to EmptyLineSeparator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine either way. If we choose to enforce it, we'll have to add it as a separate PR.
| import org.apache.iceberg.exceptions.NoSuchTableException; | ||
|
|
||
| /** | ||
| * Top level Catalog APIs that supports table DDLs and namespace listing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the comment be "Top-level Catalog API that supports ..." or "Top-level Catalog APIs that support ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be singular API.
| public interface Catalog { | ||
| public interface Catalog extends Closeable { | ||
| /** | ||
| * creates the table or throws {@link AlreadyExistsException}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what about having the same formatting for Javadoc in this file? I see that some comments start with a capital letter some not.
| public abstract TableOperations newTableOps(Configuration newConf, TableIdentifier tableIdentifier); | ||
|
|
||
| protected String defaultWarehouseLocation(Configuration hadoopConf, TableIdentifier tableIdentifier) { | ||
| String warehouseLocation = hadoopConf.get("hive.metastore.warehouse.dir"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rely on a Hive Metastore property to get a table base location? Maybe the Hive specific implementation can set the Hive warehouse dir, though I'm not sure whether we need to provide an implementation here or leave it abstract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I think this should be left abstract and implemented in the Hive module.
| try { | ||
| metastoreClient.getTable(DB_NAME, TABLE_NAME); | ||
| metastoreClient.dropTable(DB_NAME, TABLE_NAME); | ||
| this.catalog.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we close the Catalog after drop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was to avoid exhausting handlers in the test metastore. The problem is that this is instantiating new clients instead of using the pool.
Yes, for metastore use cases. |
No description provided.