-
Notifications
You must be signed in to change notification settings - Fork 3k
Api, Spark: Add orphanFilesCount to DeleteOrphanFiles.Result
#14886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Api, Spark: Add orphanFilesCount to DeleteOrphanFiles.Result
#14886
Conversation
4241fd6 to
2f3dea8
Compare
deletedOrphanFilesCount to DeleteOrphanFiles.Result
bf5f6ff to
ff7e365
Compare
ff7e365 to
564e2d6
Compare
| return ImmutableDeleteOrphanFiles.Result.builder().orphanFileLocations(orphanFileList).build(); | ||
| return ImmutableDeleteOrphanFiles.Result.builder() | ||
| .orphanFileLocations(orphanFileList) | ||
| .deletedOrphanFilesCount(filesCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filesCount is the total number of orphan file paths processed (i.e., attempted deletes) across all fileGroups, but it’s not guaranteed to mean “successfully deleted” (because deletes can fail).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I updated it to orphanFilesCount()
deletedOrphanFilesCount to DeleteOrphanFiles.ResultorphanFilesCount to DeleteOrphanFiles.Result
|
It looks like there is a flaky test for RemoteScanPlanning unrelated to these changes: |
a3959de to
ddffbb9
Compare
|
|
||
| /** Returns the total number of orphan files. */ | ||
| default long orphanFilesCount() { | ||
| return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we return -1 to say its invalid or not realible ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set it to 0 to be consistent with other action results. For example for rewriteDataFiles the defaults are also set to 0.
I'm also fine with changing it to -1, let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singhpk234 I updated to -1 in the end. Could you please re-review? 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah just commented. I don't think having the -1 is a good idea.
If I understand right this is specifically how many orphan files were deleted. That should always be >= 0, at least for me that's the least surprising result .
Let's say we fail to compute orphans to begin with, there should be an exception that's thrown (so there's no result to begin with). Let's say we compute the orphans, and then fail to delete 1 or more (then filesCount >= 0). etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was comming from other case pov what if we deleted the orphan file count and we said 0 were deleted since we didn't override it ?
may be then throwing unsupported exception is the right thing to do ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, an exception is already thrown if we forget to set orphanFilesCount with the builder:
java.lang.IllegalStateException: Cannot build Result, some of required attributes are not set [orphanFilesCount]
at org.apache.iceberg.actions.ImmutableDeleteOrphanFiles$Result$Builder.build(ImmutableDeleteOrphanFiles.java:255)
at org.apache.iceberg.spark.actions.DeleteOrphanFilesSparkAction.deleteFiles(DeleteOrphanFilesSparkAction.java:303)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't optimize for the case where someone extends this result interface for a custom implementation but then also forgets to override and also forgets to set the number of file counts.
Throwing an exception as a default is probably also OK but also just seems needless in that it can trigger needless failures when someone tries to dereference the number of files, after all, if someone can do orphan file removal in a custom implementation (because they have their own custom result type), they'll also probably be able to count how many files they also deleted.
I'd stick with just returning 0 for consistency with some of the other actions. I'm unaware of any cases where this has been problematic for custom extensions.
|
|
||
| /** Returns the total number of orphan files. */ | ||
| default long orphanFilesCount() { | ||
| return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the number of orphan files that were actually deleted no?
I think -1 as a default is misleading. If orphan files were computed but all failed to be deleted, then this is valid to be 0. If we fail to even compute the orphans, we'd just throw an exception and there'd be no result to begin with.
So I'd always expect a Result to have this value be greater than or equal to 0.
c39bca3 to
4fc403c
Compare
When using
stream-results, the result collectionorphanFileLocationsonly contains a sample (Max 20,000 elements) of the deleted files.It is useful to have a metric on the total number of orphan files.