Catalog Implementation for hive and hadoop. #187

Parth-Brahmbhatt · 2019-05-15T17:01:01Z

No description provided.

Parth-Brahmbhatt · 2019-05-15T17:08:38Z

I have deliberately left the Tables interface and implementation for BaseMetaStore tables as we have not decided what are we going to do about the transaction APIs. We can either take this PR as an opportunity to address that or move those APIs under BaseMetastoreCatalog and remove Tables interface and all of its implementation.

samarthjain · 2019-05-16T00:53:41Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

+  default boolean tableExists(TableIdentifier tableIdentifier) {
+    try {
+      getTable(tableIdentifier);
+      return false;


return true; ?

samarthjain · 2019-05-16T00:54:47Z

api/src/main/java/org/apache/iceberg/catalog/Namespace.java

+
+  @Override
+  public String toString() {
+    // assumes a level it self won't have a period in it, otherwise it will be ambiguous


nit: typo - it self -> itself

lxynov · 2019-05-16T00:15:38Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

+  default boolean tableExists(TableIdentifier tableIdentifier) {
+    try {
+      getTable(tableIdentifier);
+      return false;


Looks like it should return a true here.

lxynov · 2019-05-16T00:33:24Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

   */
-  boolean tableExists(TableIdentifier tableIdentifier);
+  default boolean tableExists(TableIdentifier tableIdentifier) {
+    try {


The implementation here simplifies code but also introduces exception-handling to ordinary control flow. An alternative is to leave the method blank and give the implementors discretion about how to do it.

I think it is reasonable to do this in a default implementation. You can't expect the default implementation to do everything the best way.

lxynov · 2019-05-16T00:38:57Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

 import org.apache.iceberg.exceptions.NoSuchTableException;

 /**
 * Top level Catalog APIs that supports table DDLs and namespace listing.


Can you give a brief explanation on the rationale for leaving out the namespace/table listing methods?

lxynov · 2019-05-16T00:47:32Z

hive/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

+  @Override
+  public void dropTable(TableIdentifier tableIdentifier) {
+    validateTableIdentifier(tableIdentifier);
+    HiveMetaStoreClient hiveMetaStoreClient = this.clients.newClient();


Looks like clients.get() is a more legitimate method to use here? Alternatively, clients.run() can be used here.

That's correct. newClient is protected so that subclasses can provide an implementation, not so that other classes in this package can call it. The correct way to do this is to use run. That handles reconnections and uses the client pool. get is also private.

samarthjain · 2019-05-16T00:57:35Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java

-   * @param location a path URI (e.g. hdfs:///warehouse/my_table)
-   * @return newly created table implementation
+   * @param tableIdentifier an identifier to identify this table in a namespace.
+   * @param schema the schema for this table, can not be null.


Would be good to add a Preconditions.checkNotNull() check to assert this.

samarthjain · 2019-05-16T00:59:52Z

hive/src/test/java/org/apache/iceberg/hive/HiveCatalogTest.java

@@ -1,18 +1,3 @@
-/*
- * Copyright 2017 Netflix, Inc.


Looks like you missed adding the apache header.

samarthjain · 2019-05-16T01:00:27Z

hive/src/test/java/org/apache/iceberg/hive/HiveCatalogTest.java

  public void testCreate() throws TException {
    // Table should be created in hive metastore
-    final org.apache.hadoop.hive.metastore.api.Table table = metastoreClient.getTable(DB_NAME, TABLE_NAME);
+    varifyTable(TABLE_IDENTIFIER);


nit: typo varifyTable -> verifyTable

rdblue · 2019-05-16T21:41:43Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

+   * @param tableIdentifier an identifier to identify this table in a namespace.
+   * @return instance of {@link Table} implementation referred by {@code tableIdentifier}
+   */
+  Table getTable(TableIdentifier tableIdentifier);


Should be loadTable because get usually implies that the operation is a quick fetch, like a getter would be on a Java object.

rdblue · 2019-05-16T21:42:59Z

api/src/main/java/org/apache/iceberg/catalog/Namespace.java

+  @Override
+  public String toString() {
+    // assumes a level it self won't have a period in it, otherwise it will be ambiguous
+    return Arrays.stream(levels).collect(joining("."));


Is it necessary to use the streams API here? What about Joiners.on(".").join(levels)? Seems simpler.

rdblue · 2019-05-16T21:48:31Z

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

+        tableIdentifier.name());
+  }
+
+  protected static final void validateTableIdentifier(TableIdentifier tableIdentifier) {


I think this should be in Hive, not in the base implementation. The base could be used for metastores that support more nesting than 1 namespace level.

rdblue · 2019-05-16T21:49:29Z

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

+    Preconditions.checkArgument(tableIdentifier.namespace().levels().length == 1, "metastore tables should only have " +
+        "schema name as namespace");
+    String schemaName = tableIdentifier.namespace().levels()[0];
+    Preconditions.checkArgument(schemaName != null && !schemaName.isEmpty(), "schema name can't be null or " +


Should TableIdentifier and Namespace check that none of the parts are null instead? I think that is going to be easier than validating in lots of other places.

rdblue · 2019-05-16T21:49:50Z

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

+
+  protected static final void validateTableIdentifier(TableIdentifier tableIdentifier) {
+    Preconditions.checkArgument(tableIdentifier.hasNamespace(), "metastore tables should have schema as namespace");
+    Preconditions.checkArgument(tableIdentifier.namespace().levels().length == 1, "metastore tables should only have " +


At least for Hive, I think that the correct term is "database" and not "schema" for the namespace.

rdblue · 2019-05-16T21:51:39Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java

 import org.apache.iceberg.exceptions.NoSuchTableException;
+import org.apache.iceberg.exceptions.RuntimeIOException;
+
+public class HadoopCatalog implements Catalog, Configurable {


Why does this implement Configurable? Seems strange to me.

rdblue · 2019-05-16T21:53:07Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

 * Top level Catalog APIs that supports table DDLs and namespace listing.
 */
-public interface Catalog {
+public interface Catalog extends Closeable {


Why make all catalogs Closeable?

rdblue · 2019-05-16T22:00:41Z

hive/src/test/java/org/apache/iceberg/hive/HiveCatalogBaseTest.java

  static final String DB_NAME = "hivedb";
  static final String TABLE_NAME =  "tbl";
+  static final TableIdentifier TABLE_IDENTIFIER =
+      new TableIdentifier(Namespace.namespace(new String[] {DB_NAME}), TABLE_NAME);


Looks like namespace should take String....

rdblue · 2019-05-16T22:03:11Z

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.NoSuchTableException;
+
+public abstract class BaseMetastoreCatalog implements Catalog {


Should this replace BaseMetastoreTables?

rdblue · 2019-05-16T22:08:14Z

hive/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

+    validateTableIdentifier(to);
+
+    HiveMetaStoreClient hiveMetaStoreClient = this.clients.newClient();
+    String location = ((BaseTable) getTable(from)).operations().current().file().location();


I don't think renaming the table should change the location. Data can be stored in the old location and referenced by a new name.

rdblue · 2019-05-16T22:09:02Z

hive/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

+    try {
+      Table table = hiveMetaStoreClient.getTable(oldDBName, oldTableName);
+
+      // hive metastore renames the table's directory as part of renaming the table.


rdblue · 2019-05-16T22:10:39Z

hive/src/test/java/org/apache/iceberg/hive/HiveCatalogBaseTest.java

-    tables.close();
-    this.tables = null;
+    try {
+      metastoreClient.getTable(DB_NAME, TABLE_NAME);


Why does this get the table first?

rdblue · 2019-05-16T22:11:43Z

hive/src/test/java/org/apache/iceberg/hive/HiveCatalogBaseTest.java

            .protocolFactory(new TBinaryProtocol.Factory())
            .minWorkerThreads(3)
-            .maxWorkerThreads(5);
+            .maxWorkerThreads(32);


What was the reason for changing this? Keeping it low helps us catch client leaks, like the one introduced by the newClient call in the Hive catalog.

rdblue · 2019-05-16T22:14:25Z

hive/src/test/java/org/apache/iceberg/hive/HiveCatalogTest.java

+
+    hiveCatalog.dropTable(TABLE_IDENTIFIER);
+
+    metastoreClient.getTable(DB_NAME, TABLE_NAME);


It is usually better to use assertThrows because you can check the error message. Then you wouldn't need to load the table first.

rdblue · 2019-05-17T15:55:20Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java

  private Configuration conf;

-  public HadoopTables() {
+  public HadoopCatalog() {


I don't think it makes sense to move the Hadoop implementation to be a Catalog. Identifiers don't handle paths well and there are quite a few changes for not much benefit.

What about just moving Hive to use the Catalog interface? Do we need to move HadoopTables or can we leave it as-is?

aokolnychyi · 2019-05-21T11:15:14Z

Do we see the new Catalog API as a better version of the Tables API that would be used in data sources?

aokolnychyi · 2019-05-21T11:25:05Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

 * specific language governing permissions and limitations
 * under the License.
 */
 package org.apache.iceberg.catalog;


Should we require an empty line after the header? I think it is true for most of the files. However, we don't enforce it in checkstyle.xml. Shall we add PACKAGE_DEF to EmptyLineSeparator?

I'm fine either way. If we choose to enforce it, we'll have to add it as a separate PR.

aokolnychyi · 2019-05-21T11:28:08Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

 import org.apache.iceberg.exceptions.NoSuchTableException;

 /**
 * Top level Catalog APIs that supports table DDLs and namespace listing.


Should the comment be "Top-level Catalog API that supports ..." or "Top-level Catalog APIs that support ..."?

I think it should be singular API.

aokolnychyi · 2019-05-21T11:31:19Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

-public interface Catalog {
+public interface Catalog extends Closeable {
  /**
   * creates the table or throws {@link AlreadyExistsException}.


nit: what about having the same formatting for Javadoc in this file? I see that some comments start with a capital letter some not.

rdsr · 2019-05-21T16:52:43Z

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

+  public abstract TableOperations newTableOps(Configuration newConf, TableIdentifier tableIdentifier);
+
+  protected String defaultWarehouseLocation(Configuration hadoopConf, TableIdentifier tableIdentifier) {
+    String warehouseLocation = hadoopConf.get("hive.metastore.warehouse.dir");


Should we rely on a Hive Metastore property to get a table base location? Maybe the Hive specific implementation can set the Hive warehouse dir, though I'm not sure whether we need to provide an implementation here or leave it abstract.

Yeah, good point. I think this should be left abstract and implemented in the Hive module.

rdsr · 2019-05-21T17:14:51Z

hive/src/test/java/org/apache/iceberg/hive/HiveCatalogBaseTest.java

+    try {
+      metastoreClient.getTable(DB_NAME, TABLE_NAME);
+      metastoreClient.dropTable(DB_NAME, TABLE_NAME);
+      this.catalog.close();


Why do we close the Catalog after drop?

I think this was to avoid exhausting handlers in the test metastore. The problem is that this is instantiating new clients instead of using the pool.

rdblue · 2019-05-27T16:49:51Z

Do we see the new Catalog API as a better version of the Tables API that would be used in data sources?

Yes, for metastore use cases. Tables is still useful for path-based cases, like Hadoop tables and in Pig.

Catalog Implementation for hive and hadoop.

1948d4e

samarthjain reviewed May 16, 2019

View reviewed changes

lxynov reviewed May 16, 2019

View reviewed changes

samarthjain reviewed May 16, 2019

View reviewed changes

rdblue reviewed May 16, 2019

View reviewed changes

rdblue reviewed May 17, 2019

View reviewed changes

aokolnychyi reviewed May 21, 2019

View reviewed changes

rdsr reviewed May 21, 2019

View reviewed changes

rdblue mentioned this pull request Jun 28, 2019

Add HiveCatalog implementation #240

Merged

rdblue closed this in #240 Jun 29, 2019

InvisibleProgrammer pushed a commit to InvisibleProgrammer/iceberg that referenced this pull request Mar 10, 2023

CDPD-49985: Update internal Flink dependencies (apache#187)

662c14f


		hiveCatalog.dropTable(TABLE_IDENTIFIER);

		metastoreClient.getTable(DB_NAME, TABLE_NAME);

Catalog Implementation for hive and hadoop. #187

Catalog Implementation for hive and hadoop. #187

Uh oh!

Conversation

Parth-Brahmbhatt commented May 15, 2019

Uh oh!

Parth-Brahmbhatt commented May 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented May 21, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!