Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12647

jeesou · 2025-03-26T06:59:43Z

No description provided.

jeesou · 2025-03-26T08:03:00Z

This is the backport changes from #12482

Kindly help review and merge

wypoon · 2025-03-26T17:45:28Z

This is a clean backport of my fix.
@jeesou can you please use the same title as the original fix, prefixed by "Spark 3.4" --
"Spark 3.4: Use correct statistics file in SparkScan::estimateStatistics(Snapshot)".
Thanks.
(Btw, it is customary to acknowledge the author of the original fix if you're doing a backport.)

jeesou · 2025-03-27T04:30:21Z

Yes thanks @wypoon for this fix, and for reviewing it.

pvary · 2025-03-27T05:55:38Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

-      if (!files.isEmpty()) {
-        List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();
+      Optional<StatisticsFile> file =
+          files.stream().filter(f -> f.snapshotId() == snapshot.snapshotId()).findFirst();


Could we have multiple files with the same snapshotId? Or findAny is enough?

As per java implementation, it is always a single stats file per snapshot id.

iceberg/core/src/main/java/org/apache/iceberg/TableMetadata.java

Line 1369 in 695374d

statisticsFiles.put(statisticsFile.snapshotId(), ImmutableList.of(statisticsFile));

I followed similar design while working on partition stats too.

So we could have a theoretical performance boost from findAny

The spec doesn't actually say that there should only be one statistics file per snapshot. This happens to be how it is implemented in Java. The spec simply allows for multiple statistics files.
I was thinking about the problem of tracking orphaned statistics files when they are recomputed. One idea I had was to keep replaced statistics files (for a snapshot) still in the list (as long as the files are tracked in metadata we can clean up unused ones), but to keep the newest one before others. Hence findFirst. It was just an idea (and honestly not one I'm seriously considering).
In any case, I do not think that findAny is faster than findFirst here.

pvary · 2025-03-27T10:22:24Z

Why is this PR against Spark 3.4?
In Flink we usually create PRs against the current version

ajantha-bhat · 2025-03-27T10:23:51Z

@pvary: This PR is looks like just a backport of #12482.

He has not updated the PR descriptions properly (looks like new contributor).

pvary · 2025-03-27T10:38:33Z

Thanks @ajantha-bhat for catching that this is a backport!

@jeesou: Please when backporting tell it in the PR description. Also highlight any changes needed compared to the original PR, so the reviewers could have easier time.

wypoon · 2025-03-27T17:35:10Z

@pvary, @ajantha-bhat I already reviewed this a day before you and if you read my comment, I mentioned that this is a clean backport of my fix.

@jeesou as Peter stated, you should state in the description that this is a backport of #12482 (and that I am the author of the fix). I implied it but did not state it explicitly.

BAckporting Correct Stats file fetch fix to Spark 3.4

2050bd6

github-actions bot added the spark label Mar 26, 2025

jeesou changed the title ~~Backporting Correct Stats file fetch fix to Spark 3.4~~ Spark : Backporting Correct Stats file fetch fix to Spark 3.4 Mar 26, 2025

jeesou changed the title ~~Spark : Backporting Correct Stats file fetch fix to Spark 3.4~~ Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) Mar 27, 2025

pvary reviewed Mar 27, 2025

View reviewed changes

pvary approved these changes Mar 27, 2025

View reviewed changes

pvary merged commit d54d81e into apache:main Mar 27, 2025
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12647

Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12647

Uh oh!

jeesou commented Mar 26, 2025

Uh oh!

jeesou commented Mar 26, 2025

Uh oh!

wypoon commented Mar 26, 2025

Uh oh!

jeesou commented Mar 27, 2025

Uh oh!

pvary Mar 27, 2025

Uh oh!

ajantha-bhat Mar 27, 2025

Uh oh!

pvary Mar 27, 2025

Uh oh!

wypoon Mar 27, 2025

Uh oh!

pvary commented Mar 27, 2025

Uh oh!

ajantha-bhat commented Mar 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

pvary commented Mar 27, 2025

Uh oh!

wypoon commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12647

Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12647

Uh oh!

Conversation

jeesou commented Mar 26, 2025

Uh oh!

jeesou commented Mar 26, 2025

Uh oh!

wypoon commented Mar 26, 2025

Uh oh!

jeesou commented Mar 27, 2025

Uh oh!

pvary Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

pvary Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

wypoon Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

pvary commented Mar 27, 2025

Uh oh!

ajantha-bhat commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pvary commented Mar 27, 2025

Uh oh!

wypoon commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajantha-bhat commented Mar 27, 2025 •

edited

Loading