-
Notifications
You must be signed in to change notification settings - Fork 3k
Add dropTable purge option to Catalog API #350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@electrum and @aokolnychyi, this PR fixes delete behavior concerns that you both raised on #240. Please have a look. |
| try (ManifestReader reader = ManifestReader.read(io.newInputFile(manifest.path()))) { | ||
| for (ManifestEntry entry : reader.entries()) { | ||
| // intern the file path because the weak key map uses identity (==) instead of equals | ||
| String path = entry.file().path().toString().intern(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielcweeks and I looked into this offline and concluded that it is now safe to use intern in this case. As of Java 7, the intern table is kept on the heap and strings in the table are eligible for garbage collection.
|
This looks like it can fail to cleanup if the process crashes after the drop. Should we have a special deleted state for tables that prevents them from being visible or queried, but allows a background process to guarantee completion of the cleanup? |
I'd be fine with that, but it is an additional feature and should probably be a separate follow-up PR. |
| * @param identifier a table identifier | ||
| * @return true if the table was dropped, false if the table did not exist | ||
| */ | ||
| boolean dropTableAndData(TableIdentifier identifier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is another call API method necessary (it has the same parameter) or could the behavior be represented by a boolean parameter in dropTable? I think dropTableAndData adds the assumption that dropTable does not delete the data but that is not specified in the dropTable docs either so it makes it confusing.
I think avoiding another method call could help keeping the API simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to add a second call than to add a boolean parameter, since Java doesn't support calling methods using names. It isn't clear how dropTable(name, true) differs from dropTable(name, false).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear how dropTable(name, true) differs from dropTable(name, false)
This is not specific to boolean parameters, but any Java argument. However, usually this is clarified with clear javadocs and I agree that it would not be a good choice if multiple boolean parameters were in the method signature - however, this is not the case here.
Still, having multiple drop methods seems confusing if there's not a clear idea of what each of them do and how do they differ, which is still a problem with dropTable, since the JavaDoc does not specify how it differs from dropTableAndData.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, either way the docs should be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've updated this. The new behavior is to drop all table data in dropTable, which is documented in javadoc. I've also added a variant of dropTable with a purge flag that specifies whether to delete data files or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @rdblue taking a look, I'm not sure I see the updates that you mentioned. I assume what you meant to say was that dropTable drops all table metadata, but not data files. I don't see that in the javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielcweeks, the new purge option will delete all data and metadata files, not just metadata files. The default dropTable calls dropTable with purge enabled. There is also a variant where you can set purge to false to avoid deleting data.
edgarRd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes!
| .onFailure((item, exc) -> LOG.warn("Failed to get deleted files: this may cause orphaned data files", exc)) | ||
| .run(manifest -> { | ||
| try (ManifestReader reader = ManifestReader.read(io.newInputFile(manifest.path()))) { | ||
| for (ManifestEntry entry : reader.entries()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reader.entries() can give back deleted entries right? What happens in that case, would it needlessly cause us to print warning exceptions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use (ManifestEntry entry : reader) to give back live entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reader is Iterable<DataFile> and we do want some deleted entries.
What I forgot to add was a filter for the deleted entries that have already been deleted. That set is all of the deleted files that have a snapshot in the list of snapshots that were valid just before the table was dropped. I'll update this.
| deleteFiles(io, manifestsToDelete); | ||
|
|
||
| Tasks.foreach(Iterables.transform(manifestsToDelete, ManifestFile::path)) | ||
| .noRetry().suppressFailureWhenFinished() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why didn't we parallelize snapshot and manifest file deletions, these can grow large in numbers as well right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do parallelize, maybe there's scope of reusing deleteFiles method with a little more parameterization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't expect there to be nearly as many of those.
Previously you were unable to access metadata tables from Spark Sql in Spark 2.4 because the parser/resolution rules forbid three part identifiers. As a workaround, we will now take table names and attempt to split them if they contain a # or . character if they are being read as an Iceberg table. This means that a table like "default.table" can have its metadata tables accessed with "default.`table.snapshots`" or "default.`table#snapshots`". (cherry picked from commit 1769fb3ba1d5c7fde603d9b8697a339e165338c3)
This fixes concerns raised on the commit that added
HiveCatalog. Specifically:For the second change, this adds
Catalog.dropTable(table, purge)that can leave data or delete all files referenced in the metadata tree.Catalog.dropTable(table)now callsCatalog.dropTable(table, purge=true).