Skip to content

Conversation

@alessandro-nori
Copy link
Contributor

@alessandro-nori alessandro-nori commented Dec 18, 2025

When using stream-results, the result collection orphanFileLocations only contains a sample (Max 20,000 elements) of the deleted files.
It is useful to have a metric on the total number of orphan files.

@alessandro-nori alessandro-nori force-pushed the add-deleted-orphan-files-count branch from 4241fd6 to 2f3dea8 Compare December 18, 2025 19:24
@alessandro-nori alessandro-nori changed the title add deletedOrphanFilesCount Api, Spark: Add deletedOrphanFilesCount to DeleteOrphanFiles.Result Dec 18, 2025
@alessandro-nori alessandro-nori changed the title Api, Spark: Add deletedOrphanFilesCount to DeleteOrphanFiles.Result Api, Spark: Add deletedOrphanFilesCount to DeleteOrphanFiles.Result Dec 18, 2025
@alessandro-nori alessandro-nori force-pushed the add-deleted-orphan-files-count branch 2 times, most recently from bf5f6ff to ff7e365 Compare December 18, 2025 19:32
@alessandro-nori alessandro-nori marked this pull request as draft December 18, 2025 19:48
@alessandro-nori alessandro-nori force-pushed the add-deleted-orphan-files-count branch from ff7e365 to 564e2d6 Compare December 18, 2025 20:04
@github-actions github-actions bot added the core label Dec 18, 2025
@alessandro-nori alessandro-nori marked this pull request as ready for review December 18, 2025 21:18
return ImmutableDeleteOrphanFiles.Result.builder().orphanFileLocations(orphanFileList).build();
return ImmutableDeleteOrphanFiles.Result.builder()
.orphanFileLocations(orphanFileList)
.deletedOrphanFilesCount(filesCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filesCount is the total number of orphan file paths processed (i.e., attempted deletes) across all fileGroups, but it’s not guaranteed to mean “successfully deleted” (because deletes can fail).

Copy link
Contributor Author

@alessandro-nori alessandro-nori Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I updated it to orphanFilesCount()

@alessandro-nori alessandro-nori changed the title Api, Spark: Add deletedOrphanFilesCount to DeleteOrphanFiles.Result Api, Spark: Add orphanFilesCount to DeleteOrphanFiles.Result Dec 19, 2025
@alessandro-nori
Copy link
Contributor Author

alessandro-nori commented Dec 19, 2025

It looks like there is a flaky test for RemoteScanPlanning unrelated to these changes:

TestRemoteScanPlanning > testUseSnapshotIdForTagReferenceAsOf() > catalogName = testrest, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=rest, cache-enabled=false, uri=http://localhost:44373/, rest-scan-planning-enabled=true}, binaryTableName = testrest.default.binary_table FAILED
    org.opentest4j.AssertionFailedError: [Snapshot at specific tag reference name: row 1 contents should match] 
    expected: 2L
     but was: 1L

@alessandro-nori alessandro-nori force-pushed the add-deleted-orphan-files-count branch from a3959de to ddffbb9 Compare December 19, 2025 14:13

/** Returns the total number of orphan files. */
default long orphanFilesCount() {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we return -1 to say its invalid or not realible ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set it to 0 to be consistent with other action results. For example for rewriteDataFiles the defaults are also set to 0.
I'm also fine with changing it to -1, let me know what you think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singhpk234 I updated to -1 in the end. Could you please re-review? 🙇

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah just commented. I don't think having the -1 is a good idea.

If I understand right this is specifically how many orphan files were deleted. That should always be >= 0, at least for me that's the least surprising result .

Let's say we fail to compute orphans to begin with, there should be an exception that's thrown (so there's no result to begin with). Let's say we compute the orphans, and then fail to delete 1 or more (then filesCount >= 0). etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was comming from other case pov what if we deleted the orphan file count and we said 0 were deleted since we didn't override it ?
may be then throwing unsupported exception is the right thing to do ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, an exception is already thrown if we forget to set orphanFilesCount with the builder:

java.lang.IllegalStateException: Cannot build Result, some of required attributes are not set [orphanFilesCount]
        at org.apache.iceberg.actions.ImmutableDeleteOrphanFiles$Result$Builder.build(ImmutableDeleteOrphanFiles.java:255)
        at org.apache.iceberg.spark.actions.DeleteOrphanFilesSparkAction.deleteFiles(DeleteOrphanFilesSparkAction.java:303)

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't optimize for the case where someone extends this result interface for a custom implementation but then also forgets to override and also forgets to set the number of file counts.

Throwing an exception as a default is probably also OK but also just seems needless in that it can trigger needless failures when someone tries to dereference the number of files, after all, if someone can do orphan file removal in a custom implementation (because they have their own custom result type), they'll also probably be able to count how many files they also deleted.

I'd stick with just returning 0 for consistency with some of the other actions. I'm unaware of any cases where this has been problematic for custom extensions.


/** Returns the total number of orphan files. */
default long orphanFilesCount() {
return -1;
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the number of orphan files that were actually deleted no?
I think -1 as a default is misleading. If orphan files were computed but all failed to be deleted, then this is valid to be 0. If we fail to even compute the orphans, we'd just throw an exception and there'd be no result to begin with.

So I'd always expect a Result to have this value be greater than or equal to 0.

@alessandro-nori alessandro-nori force-pushed the add-deleted-orphan-files-count branch from c39bca3 to 4fc403c Compare January 6, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants