Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Sep 21, 2020

If tools want to implement drop and create table operations using the Catalogs they need a common interface to accessing the functionality.

For example these methods will be used by HiveMetaHook to allow Iceberg table creation through Hive SQL commands

The patch includes 2 commits:

  1. HadoopTables did not have the dropTable functionality. Refactored the code out from BaseCatalog to be accessible by HadoopTables, and implemented the method. Added some tests around the new function.
  2. Created the corresponding Catalogs methods and added tests around that.

@pvary
Copy link
Contributor Author

pvary commented Sep 21, 2020

CC: @rdblue, @massdosage - another piece of Hive integration

* @param location a path URI (e.g. hdfs:///warehouse/my_table)
* @return true if the table was dropped
*/
public boolean dropTable(String location) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wasn't implemented before because it is not part of the Tables API, but now that this is the only implementation, maybe we should consider just deprecating the Tables API and making HadoopTables a stand-alone class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this would merit another discussion, and another PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

public static Table createTable(Configuration conf, Properties props) {
String schemaString = props.getProperty(InputFormatConfig.TABLE_SCHEMA);
Preconditions.checkNotNull(schemaString, "Table schema not set");
Schema schema = SchemaParser.fromJson(props.getProperty(InputFormatConfig.TABLE_SCHEMA));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is the reason why the examples specify a table property. Can we instead use Hive schema DDL and convert it to Iceberg? Similarly, can we get the identity partition fields that way to create a spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the serialized schema for the Catalogs interface. Other systems like Impala, Presto, etc. might want to use it as well.
I would like to tackle the Hive schema DDL in another PR. The data is available in HiveIcebergSerDe.initialize in a somewhat convoluted way. I would like to get it there and convert it to the Iceberg Schema string. From there I would only push the Iceberg related stuff down further.
What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to do this in a separate PR. I just really don't want to require setting properties with JSON schema or spec representations as the way to use Iceberg. It's okay for a way to customize if there isn't syntax, but normal cases should just use DDL.

}
}

Optional<Catalog> catalog = loadCatalog(conf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat out of scope: We might want to build a Catalog for this logic so that this class can avoid loading and checking the catalog in every method. The catalog would get created with the configuration and handle this delegation internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveCatalog has a cache, but this might be useful for other Catalogs as well.
This seems like a good idea to pursue, but I do not promise anything here, as I have too much on my plate currently

@rdblue
Copy link
Contributor

rdblue commented Sep 24, 2020

Thanks, @pvary! Mostly looks good, but I'd like to fix the protected method that was removed. People might rely on it since it was in a release.

lastMetadata = ops.current();
}
} else {
throw new NoSuchTableException("Table does not exist at location: %s, so it can not be dropped", location);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have caught this yesterday, but shouldn't this return false instead of throwing the exception? That's what all the other drop methods do. If the table doesn't exist, it isn't an exceptional case. It just returns false to signal that it nothing needed to be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have caught this yesterday, but shouldn't this return false instead of throwing the exception? That's what all the other drop methods do. If the table doesn't exist, it isn't an exceptional case. It just returns false to signal that it nothing needed to be done.

Good point! Finding it now is better than finding it after pushing the PR 😄
Fixed.

@rdblue rdblue merged commit 66a37c2 into apache:master Sep 25, 2020
@rdblue
Copy link
Contributor

rdblue commented Sep 25, 2020

Thanks for the fix! I merged this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants