Core: Fix incremental compute of partition stats for various edge cases#13163

Merged

pvary merged 3 commits intoapache:mainfrom

ajantha-bhat:fix_incremental

May 29, 2025

Member

ajantha-bhat commented May 27, 2025 •

edited

Loading

Fixes an edge case where deleted entires information is not carried to next snapshot. So, incremental stats compute was wrong for this copy on write case. Fix is to apply incremental stats compute snapshot by snapshot for its added manifest.

Also fallback when stats file exist for table but not for current snapshot chain, instead of throwing exception.

Fixes: #13155

ajantha-bhat marked this pull request as draft

May 27, 2025 11:31

github-actions bot added the core label

This was referenced May 27, 2025

Incrementally computing partition stats can miss deleted files #13155

Closed

Core: Optimize MergingSnapshotProducer to use referenced manifests to determine if manifest needs to be rewritten #11131

Merged

lirui-apache reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java Outdated Show resolved Hide resolved

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java Outdated Show resolved Hide resolved

Contributor

lirui-apache commented May 28, 2025

@ajantha-bhat Thanks for working on this! Left some comments.

ajantha-bhat force-pushed the fix_incremental branch from 37ac725 to 5e092ef Compare

May 28, 2025 11:41

ajantha-bhat commented

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

                   try (CloseableIterable<PartitionStats> oldStats =
-                      readPartitionStatsFile(schema(partitionType), Files.localInput(previousStatsFile.path()))) {
+                      readPartitionStatsFile(
+                          schema(partitionType), table.io().newInputFile(previousStatsFile.path()))) {

Member Author

ajantha-bhat May 28, 2025

Just using the FileIO as pointed out in the PR.
Unrelated to this bug fix. But related to this feature.

ajantha-bhat commented

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java Outdated

+                                id ->
+                                    table.snapshot(id).allManifests(table.io()).stream()
+                                        .filter(file -> file.snapshotId().equals(id)))
+                            .collect(Collectors.toList());

Member Author

ajantha-bhat May 28, 2025 •

edited

Loading

I also checked that if snapshots are expired, we cannot find previous stats for the table in the caller.
So, it will fallback to full compute.

Member Author

ajantha-bhat May 28, 2025

Also note that, because of snapshot id filter,
Each snapshot's added manifest files will be considered only once for compute. So, reused manifests won't be considered again. If manifests are rewritten, entries will be marked as EXISTING and won't be considered for incremental compute from existing logic in collectStatsForManifest.

So, IMO it works for all the scenarios now and we have testcase to cover all the scenarios.

ajantha-bhat marked this pull request as ready for review

May 28, 2025 11:46

ajantha-bhat changed the title ~~Core: Fix incremental compute of partition stats~~ Core: Fix incremental compute of partition stats with COW deletes

ajantha-bhat requested a review from pvary

May 28, 2025 11:49


          Core: Fix incremental compute of partition stats

6a6e2f1

ajantha-bhat force-pushed the fix_incremental branch from 5e092ef to 6a6e2f1 Compare

May 28, 2025 11:58

github-actions bot added the ORC label

pvary reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java Outdated Show resolved Hide resolved


          Address comments

b082873

pvary approved these changes

View reviewed changes

Member Author

ajantha-bhat commented May 29, 2025

@lirui-apache: Could you please take another look at the fix, verify and approve the PR if it looks good for you?

lirui-apache reviewed

View reviewed changes

Contributor

lirui-apache left a comment

Thanks for the fix @ajantha-bhat . I only have a minor comment. But I have another question related to the feature and may be worth another PR. In PartitionStatsHandler::latestStatsFile, we throw exception if there is any previous stats file but not reachable via the parent snapshot pointers. I think this means we don't allow gaps in snapshot history, which can be common when using tags. Won't it be more user friendly to just fall back to a full compute in that case?

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java Outdated

+                  // So, for incremental computation, gather the manifests added by each snapshot
+                  // instead of relying solely on those from the latest snapshot.
+                  List<ManifestFile> manifests =
+                      snapshotIdsRange.stream()

Contributor

lirui-apache May 29, 2025

nit: How about just call SnapshotUtil::ancestorsBetween and iterate through the ancestors?

Member Author

ajantha-bhat May 29, 2025

Updated.


          Address new comments

d29d2b4

Member Author

ajantha-bhat commented May 29, 2025

But I have another question related to the feature and may be worth another PR. In PartitionStatsHandler::latestStatsFile, we throw exception if there is any previous stats file but not reachable via the parent snapshot pointers. I think this means we don't allow gaps in snapshot history, which can be common when using tags. Won't it be more user friendly to just fall back to a full compute in that case?

Great. I didn't think of this case before. Hence, added that exception before. I made it to fallback now and added the test.

Member Author

ajantha-bhat commented May 29, 2025

@pvary: Thanks for previous review and approval. Please take another look after @lirui-apache's approval.

lirui-apache approved these changes

View reviewed changes

Contributor

lirui-apache left a comment

Thanks for updating. LGTM

pvary merged commit 2ada622 into apache:main

42 checks passed

pvary changed the title ~~Core: Fix incremental compute of partition stats with COW deletes~~ Core: Fix incremental compute of partition stats for various edge cases

Contributor

pvary commented May 29, 2025

Merged to main.
Thanks for the fix @ajantha-bhat and @lirui-apache for reporting and reviewing!

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request


          Core: Fix incremental compute of partition stats for various edge cas…

7a3faa4

…es (apache#13163)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels