Spark: Update RewriteDataFilesSparkAction and RewritePositionDeleteFilesSparkAction to use the new APIs by pvary · Pull Request #12692 · apache/iceberg

pvary · 2025-03-30T22:11:54Z

No description provided.

pvary · 2025-03-31T06:53:56Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

    assertThatThrownBy(() -> actions().rewriteDataFiles(table).binPack().sort())
        .isInstanceOf(IllegalArgumentException.class)
-        .hasMessage("Must use only one rewriter type (bin-pack, sort, zorder)");
+        .hasMessage("Must use only one runner type (bin-pack, sort, zorder)");


I'm not sure about this change.
The internal naming has changed, but not sure we have to expose this to the users

I wouldn't go with the "runner" term. That's not really a concept in Spark, so I think this just exposes the internals in a confusing way.

We use "strategy" in the user doc.

I decided to keep this error message intact.

…lesSparkAction to use the new APIs

pvary · 2025-04-07T15:31:24Z

@danielcweeks, @manuzhang, @RussellSpitzer: I think this PR is ready for review now. Sorry for the force push. Instead of creating new files, I renamed the old ones so it is easier to review. This required me to force-push my changes. I will not do it again during the review 😄

RussellSpitzer · 2025-04-11T16:29:40Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

    Preconditions.checkArgument(
-        rewriter == null, "Must use only one rewriter type (bin-pack, sort, zorder)");
-    this.rewriter = new SparkBinPackDataRewriter(spark(), table);
+        runner == null, "Must use only one rewriter type (bin-pack, sort, zorder)");


Follow up, We need to clean up these messages. They probably should say, "Rewriter type already set to %s" or something like that

This is a second request to fix the error message, and it is not too complicated. So changed.

The solution is a bit lame:

Preconditions.checkArgument( runner == null, "Rewriter type already set to %s", runner == null ? null : runner.description());

We need a second null check for the error message, or a null check around it 😢
Decided to hide this ugliness in a method.
If you have better ideas, feel free to comment

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

RussellSpitzer · 2025-04-11T23:02:09Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkRewriteRunner.java

+import org.apache.spark.sql.SparkSession;
+
+/**
+ * Common parent for data and positional delete rewrite runners.


Unclear description here, I think this is mean to essentially replace the meet of the Action itself? Like it's a container for planner + rewriter?

The Action is using the 2 interfaces defined by the new API. Planner for grouping, Runner for executing the actual compaction.

This class is the base for the Spark based Runner implementations. Updated the javadoc

RussellSpitzer · 2025-04-11T23:03:34Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewritePlanner.java

+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * Extends the {@link BinPackRewriteFilePlanner} with the possibility to set the expected


Needs more description here. Extends the BinPack rewriter for Rewriters that induce a distributed shuffle to reorganize data. (or something like this)

Again, this is for the Planner. There was a specific configuration for the Shuffling rewriters which is modifying the plans. So I had to extend the core planners with the new configuration.

Alternatively we can move this functionality to the generic planner. In this case we don't need a specific class here. In this case the we have this config available for every planner (might be useful if there are compaction changes on any rewrite), but we have to reintroduce the flag shufflingPlanner (maybe with another name) to decide which Runner implementation to use.

I was just talking about the Java Doc. I have no problem with the code organization but the key detail about this planner is that it produces plans for Shuffling Rewrites, not that it has an additional option.

I thought this class adds an additional option of COMPRESSION_FACTOR used when calculating expectedOutputFiles (line 60). I guess shuffle/sort improves compression ratio and reduce file size.

Thanks for all the suggestions.
Updated the javadoc based on your input.

manuzhang · 2025-04-28T03:18:10Z

I add this to 1.10.0, since we are targeting removing deprecated APIs in 1.10.0

...v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackFileRewriteRunner.java

RussellSpitzer · 2025-04-28T16:43:08Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

                    catalogName, tableIdent))
        .isInstanceOf(IllegalArgumentException.class)
-        .hasMessage("Must use only one rewriter type (bin-pack, sort, zorder)");
+        .hasMessageStartingWith("Rewriter type already set to ");


I know this was not correct before but can we switch the error message here to "Cannot set rewriter, it has already been set to "

Changed to Cannot set rewrite mode, it has already been set to . I think this is a bit better, but ok with reverting to your version if you don't agree.

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

stevenzwu

LGTM overall. added a couple minor comments

pvary · 2025-04-30T14:37:44Z

Merged to main.
Thanks @manuzhang, @danielcweeks, @RussellSpitzer, @stevenzwu for the reviews!

…lesSparkAction to use the new APIs (apache#12692) (apache#1578) Co-authored-by: pvary <peter.vary.apache@gmail.com>

github-actions bot added the spark label Mar 30, 2025

pvary force-pushed the spark_rewrite_planner branch from f3dc0d9 to 84b780d Compare March 31, 2025 06:51

pvary commented Mar 31, 2025

View reviewed changes

This was referenced Apr 3, 2025

Spark: when doing rewrite_data_files, check for partitioning schema compatibility #12651

Closed

Spark: support rewrite on specified target branch #12257

Closed

pvary force-pushed the spark_rewrite_planner branch 2 times, most recently from fcf87ed to 5b5fc19 Compare April 7, 2025 12:20

Spark: Update RewriteDataFilesSparkAction and RewritePositionDeleteFi…

0f40605

…lesSparkAction to use the new APIs

pvary force-pushed the spark_rewrite_planner branch from 5b5fc19 to 0f40605 Compare April 7, 2025 12:33

RussellSpitzer reviewed Apr 11, 2025

View reviewed changes

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 11, 2025

View reviewed changes

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated Show resolved Hide resolved

Russell's comments

ecf46f7

RussellSpitzer reviewed Apr 11, 2025

View reviewed changes

Comment clarifications

de0d483

manuzhang added this to the Iceberg 1.10.0 milestone Apr 28, 2025

RussellSpitzer reviewed Apr 28, 2025

View reviewed changes

...v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackFileRewriteRunner.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 28, 2025

View reviewed changes

RussellSpitzer approved these changes Apr 28, 2025

View reviewed changes

manuzhang mentioned this pull request Apr 29, 2025

AWS, Core, Flink, Parquet: Remove deprecations for 1.10.0 #12909

Merged

pvary mentioned this pull request Apr 29, 2025

[Core] Add max files rewrite option for RewriteAction #12824

Merged

stevenzwu reviewed Apr 29, 2025

View reviewed changes

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated Show resolved Hide resolved

stevenzwu approved these changes Apr 30, 2025

View reviewed changes

Russell's and Steven's comments

a01360d

pvary merged commit bcda1a3 into apache:main Apr 30, 2025
27 checks passed

gaborkaszab mentioned this pull request May 6, 2025

Spark 3.4: Update RewriteDataFilesSparkAction and RewritePositionDeleteFilesSparkAction to use the new APIs #12980

Merged

Conversation

pvary commented Mar 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary commented Apr 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manuzhang commented Apr 28, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stevenzwu Apr 30, 2025 •

edited

Loading

pvary Apr 30, 2025 •

edited

Loading