Skip to content

Conversation

@Parth-Brahmbhatt
Copy link
Contributor

No description provided.

@Parth-Brahmbhatt
Copy link
Contributor Author

I have deliberately left the Tables interface and implementation for BaseMetaStore tables as we have not decided what are we going to do about the transaction APIs. We can either take this PR as an opportunity to address that or move those APIs under BaseMetastoreCatalog and remove Tables interface and all of its implementation.

default boolean tableExists(TableIdentifier tableIdentifier) {
try {
getTable(tableIdentifier);
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return true; ?


@Override
public String toString() {
// assumes a level it self won't have a period in it, otherwise it will be ambiguous
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo - it self -> itself

default boolean tableExists(TableIdentifier tableIdentifier) {
try {
getTable(tableIdentifier);
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it should return a true here.

*/
boolean tableExists(TableIdentifier tableIdentifier);
default boolean tableExists(TableIdentifier tableIdentifier) {
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation here simplifies code but also introduces exception-handling to ordinary control flow. An alternative is to leave the method blank and give the implementors discretion about how to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable to do this in a default implementation. You can't expect the default implementation to do everything the best way.

import org.apache.iceberg.exceptions.NoSuchTableException;

/**
* Top level Catalog APIs that supports table DDLs and namespace listing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a brief explanation on the rationale for leaving out the namespace/table listing methods?

@Override
public void dropTable(TableIdentifier tableIdentifier) {
validateTableIdentifier(tableIdentifier);
HiveMetaStoreClient hiveMetaStoreClient = this.clients.newClient();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like clients.get() is a more legitimate method to use here? Alternatively, clients.run() can be used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. newClient is protected so that subclasses can provide an implementation, not so that other classes in this package can call it. The correct way to do this is to use run. That handles reconnections and uses the client pool. get is also private.

* @param location a path URI (e.g. hdfs:///warehouse/my_table)
* @return newly created table implementation
* @param tableIdentifier an identifier to identify this table in a namespace.
* @param schema the schema for this table, can not be null.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a Preconditions.checkNotNull() check to assert this.

@@ -1,18 +1,3 @@
/*
* Copyright 2017 Netflix, Inc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you missed adding the apache header.

public void testCreate() throws TException {
// Table should be created in hive metastore
final org.apache.hadoop.hive.metastore.api.Table table = metastoreClient.getTable(DB_NAME, TABLE_NAME);
varifyTable(TABLE_IDENTIFIER);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo varifyTable -> verifyTable

* @param tableIdentifier an identifier to identify this table in a namespace.
* @return instance of {@link Table} implementation referred by {@code tableIdentifier}
*/
Table getTable(TableIdentifier tableIdentifier);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be loadTable because get usually implies that the operation is a quick fetch, like a getter would be on a Java object.

@Override
public String toString() {
// assumes a level it self won't have a period in it, otherwise it will be ambiguous
return Arrays.stream(levels).collect(joining("."));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to use the streams API here? What about Joiners.on(".").join(levels)? Seems simpler.

tableIdentifier.name());
}

protected static final void validateTableIdentifier(TableIdentifier tableIdentifier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be in Hive, not in the base implementation. The base could be used for metastores that support more nesting than 1 namespace level.

Preconditions.checkArgument(tableIdentifier.namespace().levels().length == 1, "metastore tables should only have " +
"schema name as namespace");
String schemaName = tableIdentifier.namespace().levels()[0];
Preconditions.checkArgument(schemaName != null && !schemaName.isEmpty(), "schema name can't be null or " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should TableIdentifier and Namespace check that none of the parts are null instead? I think that is going to be easier than validating in lots of other places.


protected static final void validateTableIdentifier(TableIdentifier tableIdentifier) {
Preconditions.checkArgument(tableIdentifier.hasNamespace(), "metastore tables should have schema as namespace");
Preconditions.checkArgument(tableIdentifier.namespace().levels().length == 1, "metastore tables should only have " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for Hive, I think that the correct term is "database" and not "schema" for the namespace.

import org.apache.iceberg.exceptions.NoSuchTableException;
import org.apache.iceberg.exceptions.RuntimeIOException;

public class HadoopCatalog implements Catalog, Configurable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this implement Configurable? Seems strange to me.

* Top level Catalog APIs that supports table DDLs and namespace listing.
*/
public interface Catalog {
public interface Catalog extends Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make all catalogs Closeable?

static final String DB_NAME = "hivedb";
static final String TABLE_NAME = "tbl";
static final TableIdentifier TABLE_IDENTIFIER =
new TableIdentifier(Namespace.namespace(new String[] {DB_NAME}), TABLE_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like namespace should take String....

import org.apache.iceberg.exceptions.AlreadyExistsException;
import org.apache.iceberg.exceptions.NoSuchTableException;

public abstract class BaseMetastoreCatalog implements Catalog {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this replace BaseMetastoreTables?

validateTableIdentifier(to);

HiveMetaStoreClient hiveMetaStoreClient = this.clients.newClient();
String location = ((BaseTable) getTable(from)).operations().current().file().location();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think renaming the table should change the location. Data can be stored in the old location and referenced by a new name.

try {
Table table = hiveMetaStoreClient.getTable(oldDBName, oldTableName);

// hive metastore renames the table's directory as part of renaming the table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does?

tables.close();
this.tables = null;
try {
metastoreClient.getTable(DB_NAME, TABLE_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this get the table first?

.protocolFactory(new TBinaryProtocol.Factory())
.minWorkerThreads(3)
.maxWorkerThreads(5);
.maxWorkerThreads(32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for changing this? Keeping it low helps us catch client leaks, like the one introduced by the newClient call in the Hive catalog.


hiveCatalog.dropTable(TABLE_IDENTIFIER);

metastoreClient.getTable(DB_NAME, TABLE_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is usually better to use assertThrows because you can check the error message. Then you wouldn't need to load the table first.

private Configuration conf;

public HadoopTables() {
public HadoopCatalog() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to move the Hadoop implementation to be a Catalog. Identifiers don't handle paths well and there are quite a few changes for not much benefit.

What about just moving Hive to use the Catalog interface? Do we need to move HadoopTables or can we leave it as-is?

@aokolnychyi
Copy link
Contributor

Do we see the new Catalog API as a better version of the Tables API that would be used in data sources?

* specific language governing permissions and limitations
* under the License.
*/
package org.apache.iceberg.catalog;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we require an empty line after the header? I think it is true for most of the files. However, we don't enforce it in checkstyle.xml. Shall we add PACKAGE_DEF to EmptyLineSeparator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way. If we choose to enforce it, we'll have to add it as a separate PR.

import org.apache.iceberg.exceptions.NoSuchTableException;

/**
* Top level Catalog APIs that supports table DDLs and namespace listing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the comment be "Top-level Catalog API that supports ..." or "Top-level Catalog APIs that support ..."?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be singular API.

public interface Catalog {
public interface Catalog extends Closeable {
/**
* creates the table or throws {@link AlreadyExistsException}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what about having the same formatting for Javadoc in this file? I see that some comments start with a capital letter some not.

public abstract TableOperations newTableOps(Configuration newConf, TableIdentifier tableIdentifier);

protected String defaultWarehouseLocation(Configuration hadoopConf, TableIdentifier tableIdentifier) {
String warehouseLocation = hadoopConf.get("hive.metastore.warehouse.dir");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rely on a Hive Metastore property to get a table base location? Maybe the Hive specific implementation can set the Hive warehouse dir, though I'm not sure whether we need to provide an implementation here or leave it abstract.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. I think this should be left abstract and implemented in the Hive module.

try {
metastoreClient.getTable(DB_NAME, TABLE_NAME);
metastoreClient.dropTable(DB_NAME, TABLE_NAME);
this.catalog.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we close the Catalog after drop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was to avoid exhausting handlers in the test metastore. The problem is that this is instantiating new clients instead of using the pool.

@rdblue
Copy link
Contributor

rdblue commented May 27, 2019

Do we see the new Catalog API as a better version of the Tables API that would be used in data sources?

Yes, for metastore use cases. Tables is still useful for path-based cases, like Hadoop tables and in Pig.

@rdblue rdblue closed this in #240 Jun 29, 2019
InvisibleProgrammer pushed a commit to InvisibleProgrammer/iceberg that referenced this pull request Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants