Spark: Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12482

wypoon · 2025-03-08T20:23:58Z

This fixes a bug in SparkScan::estimateStatistics(Snapshot).
Table::statisticsFiles() returns a List<StatisticsFile>. We need to get the StatisticsFile with the snapshotId of the Snapshot for use in estimating the statistics.
I modified an existing test so that it fails without the fix and passes with it.

Table::statisticsFiles() returns a List. We need to get the StatisticsFile with the snapshotId of the Snapshot.

wypoon · 2025-03-08T20:36:37Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java

+
+    Map<String, Long> expectedNDV = Maps.newHashMap();
+    expectedNDV.put("id", 6L);
+    withSQLConf(reportColStatsEnabled, () -> checkColStatisticsReported(scan, 6L, expectedNDV));


The test is parameterized with three parameters. When run on its own without the fix, from one to three cases will fail. The reason some of the time the correct StatisticsFile appears as the first in the List is that when TableMetadata is built (

iceberg/core/src/main/java/org/apache/iceberg/TableMetadata.java

Line 1602 in 9a8466c

statisticsFiles.values().stream().flatMap(List::stream).collect(Collectors.toList()),

), the List is built from a Map and the order of the entries depend on the hashing of the snapshotId (which is random).

wypoon · 2025-03-10T16:57:18Z

@huaxingao can you please review this?

wypoon · 2025-03-20T05:10:49Z

@findepi can you please review this simple fix?

wypoon · 2025-03-20T16:38:53Z

Thanks @findepi!

…cs(Snapshot) (#12647) This backports #12482 to Spark 3.4

Use correct statistics file in SparkScan::estimateStatistics(Snapshot)

90c7aac

Table::statisticsFiles() returns a List. We need to get the StatisticsFile with the snapshotId of the Snapshot.

github-actions bot added the spark label Mar 8, 2025

wypoon commented Mar 8, 2025

View reviewed changes

wypoon mentioned this pull request Mar 8, 2025

Support Spark Column Stats #10659

Merged

findepi approved these changes Mar 20, 2025

View reviewed changes

findepi merged commit ff5004e into apache:main Mar 20, 2025
27 checks passed

This was referenced Mar 26, 2025

Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12647

Merged

Use Snapshot's statistics file in SparkScan #11040

Closed

pvary pushed a commit that referenced this pull request Mar 27, 2025

Spark 3.4: Use correct statistics file in SparkScan::estimateStatisti…

d54d81e

…cs(Snapshot) (#12647) This backports #12482 to Spark 3.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12482

Spark: Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12482

Uh oh!

wypoon commented Mar 8, 2025

Uh oh!

wypoon Mar 8, 2025

Uh oh!

wypoon commented Mar 10, 2025

Uh oh!

wypoon commented Mar 20, 2025

Uh oh!

Uh oh!

wypoon commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Spark: Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12482

Spark: Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12482

Uh oh!

Conversation

wypoon commented Mar 8, 2025

Uh oh!

wypoon Mar 8, 2025

Choose a reason for hiding this comment

Uh oh!

wypoon commented Mar 10, 2025

Uh oh!

wypoon commented Mar 20, 2025

Uh oh!

Uh oh!

wypoon commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants