Add dropTable purge option to Catalog API #350

rdblue · 2019-08-05T20:20:46Z

This fixes concerns raised on the commit that added HiveCatalog. Specifically:

The call to Hive's drop table should specify not to delete data, in case older tables are not external
There should be a way to clean up table data as part of the drop operation

For the second change, this adds Catalog.dropTable(table, purge) that can leave data or delete all files referenced in the metadata tree. Catalog.dropTable(table) now calls Catalog.dropTable(table, purge=true).

rdblue · 2019-08-05T20:21:51Z

@electrum and @aokolnychyi, this PR fixes delete behavior concerns that you both raised on #240. Please have a look.

rdblue · 2019-08-05T20:23:08Z

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

+      try (ManifestReader reader = ManifestReader.read(io.newInputFile(manifest.path()))) {
+        for (ManifestEntry entry : reader.entries()) {
+          // intern the file path because the weak key map uses identity (==) instead of equals
+          String path = entry.file().path().toString().intern();


@danielcweeks and I looked into this offline and concluded that it is now safe to use intern in this case. As of Java 7, the intern table is kept on the heap and strings in the table are eligible for garbage collection.

electrum · 2019-08-06T00:16:52Z

This looks like it can fail to cleanup if the process crashes after the drop. Should we have a special deleted state for tables that prevents them from being visible or queried, but allows a background process to guarantee completion of the cleanup?

rdblue · 2019-08-06T16:25:41Z

Should we have a special deleted state for tables that prevents them from being visible or queried ... ?

I'd be fine with that, but it is an additional feature and should probably be a separate follow-up PR.

edgarRd · 2019-08-06T18:33:34Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

+   * @param identifier a table identifier
+   * @return true if the table was dropped, false if the table did not exist
+   */
+  boolean dropTableAndData(TableIdentifier identifier);


Is another call API method necessary (it has the same parameter) or could the behavior be represented by a boolean parameter in dropTable? I think dropTableAndData adds the assumption that dropTable does not delete the data but that is not specified in the dropTable docs either so it makes it confusing.

I think avoiding another method call could help keeping the API simple.

I think it is better to add a second call than to add a boolean parameter, since Java doesn't support calling methods using names. It isn't clear how dropTable(name, true) differs from dropTable(name, false).

It isn't clear how dropTable(name, true) differs from dropTable(name, false)

This is not specific to boolean parameters, but any Java argument. However, usually this is clarified with clear javadocs and I agree that it would not be a good choice if multiple boolean parameters were in the method signature - however, this is not the case here.

Still, having multiple drop methods seems confusing if there's not a clear idea of what each of them do and how do they differ, which is still a problem with dropTable, since the JavaDoc does not specify how it differs from dropTableAndData.

I agree, either way the docs should be updated.

Okay, I've updated this. The new behavior is to drop all table data in dropTable, which is documented in javadoc. I've also added a variant of dropTable with a purge flag that specifies whether to delete data files or not.

Hey, @rdblue taking a look, I'm not sure I see the updates that you mentioned. I assume what you meant to say was that dropTable drops all table metadata, but not data files. I don't see that in the javadoc.

@danielcweeks, the new purge option will delete all data and metadata files, not just metadata files. The default dropTable calls dropTable with purge enabled. There is also a variant where you can set purge to false to avoid deleting data.

edgarRd

Thanks for the changes!

rdsr · 2019-08-07T21:01:10Z

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

+        .onFailure((item, exc) -> LOG.warn("Failed to get deleted files: this may cause orphaned data files", exc))
+        .run(manifest -> {
+          try (ManifestReader reader = ManifestReader.read(io.newInputFile(manifest.path()))) {
+            for (ManifestEntry entry : reader.entries()) {


reader.entries() can give back deleted entries right? What happens in that case, would it needlessly cause us to print warning exceptions ?

Maybe we should use (ManifestEntry entry : reader) to give back live entries?

The reader is Iterable<DataFile> and we do want some deleted entries.

What I forgot to add was a filter for the deleted entries that have already been deleted. That set is all of the deleted files that have a snapshot in the list of snapshots that were valid just before the table was dropped. I'll update this.

rdsr · 2019-08-07T21:04:47Z

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

+    deleteFiles(io, manifestsToDelete);
+
+    Tasks.foreach(Iterables.transform(manifestsToDelete, ManifestFile::path))
+        .noRetry().suppressFailureWhenFinished()


Why didn't we parallelize snapshot and manifest file deletions, these can grow large in numbers as well right?

If we do parallelize, maybe there's scope of reusing deleteFiles method with a little more parameterization?

We don't expect there to be nearly as many of those.

Previously you were unable to access metadata tables from Spark Sql in Spark 2.4 because the parser/resolution rules forbid three part identifiers. As a workaround, we will now take table names and attempt to split them if they contain a # or . character if they are being read as an Iceberg table. This means that a table like "default.table" can have its metadata tables accessed with "default.`table.snapshots`" or "default.`table#snapshots`". (cherry picked from commit 1769fb3ba1d5c7fde603d9b8697a339e165338c3)

Add dropTableAndData to Catalog API.

65599da

rdblue commented Aug 5, 2019

View reviewed changes

rdblue added this to the 0.1.0 Release milestone Aug 5, 2019

Fix indentation.

eb627b1

rdblue requested a review from danielcweeks August 5, 2019 23:18

rdblue self-assigned this Aug 5, 2019

edgarRd reviewed Aug 6, 2019

View reviewed changes

rdblue removed their assignment Aug 7, 2019

Add a purge flag to dropTable instead of dropTableAndData.

95224dd

rdblue changed the title ~~Add dropTableAndData to Catalog API~~ Add dropTable purge option to Catalog API Aug 7, 2019

edgarRd approved these changes Aug 7, 2019

View reviewed changes

rdsr reviewed Aug 7, 2019

View reviewed changes

aokolnychyi approved these changes Aug 25, 2019

View reviewed changes

rdblue merged commit 6cac41b into apache:master Aug 26, 2019

zhangdove mentioned this pull request Nov 26, 2020

When using hiveCatalog.dropTable(identifier, true), the table directory is not completely removed #1764

Closed

Add dropTable purge option to Catalog API #350

Add dropTable purge option to Catalog API #350

Uh oh!

Conversation

rdblue commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Aug 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

electrum commented Aug 6, 2019

Uh oh!

rdblue commented Aug 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edgarRd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdsr Aug 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rdblue commented Aug 5, 2019 •

edited

Loading

rdsr Aug 7, 2019 •

edited

Loading