Skip to content

Spark: Update RewriteDataFilesSparkAction and RewritePositionDeleteFilesSparkAction to use the new APIs#12692

Merged
pvary merged 4 commits intoapache:mainfrom
pvary:spark_rewrite_planner
Apr 30, 2025
Merged

Spark: Update RewriteDataFilesSparkAction and RewritePositionDeleteFilesSparkAction to use the new APIs#12692
pvary merged 4 commits intoapache:mainfrom
pvary:spark_rewrite_planner

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Mar 30, 2025

No description provided.

@github-actions github-actions bot added the spark label Mar 30, 2025
@pvary pvary force-pushed the spark_rewrite_planner branch from f3dc0d9 to 84b780d Compare March 31, 2025 06:51
assertThatThrownBy(() -> actions().rewriteDataFiles(table).binPack().sort())
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Must use only one rewriter type (bin-pack, sort, zorder)");
.hasMessage("Must use only one runner type (bin-pack, sort, zorder)");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this change.
The internal naming has changed, but not sure we have to expose this to the users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't go with the "runner" term. That's not really a concept in Spark, so I think this just exposes the internals in a confusing way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use "strategy" in the user doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to keep this error message intact.

@pvary pvary force-pushed the spark_rewrite_planner branch from 5b5fc19 to 0f40605 Compare April 7, 2025 12:33
@pvary
Copy link
Contributor Author

pvary commented Apr 7, 2025

@danielcweeks, @manuzhang, @RussellSpitzer: I think this PR is ready for review now. Sorry for the force push. Instead of creating new files, I renamed the old ones so it is easier to review. This required me to force-push my changes. I will not do it again during the review 😄

Preconditions.checkArgument(
rewriter == null, "Must use only one rewriter type (bin-pack, sort, zorder)");
this.rewriter = new SparkBinPackDataRewriter(spark(), table);
runner == null, "Must use only one rewriter type (bin-pack, sort, zorder)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up, We need to clean up these messages. They probably should say, "Rewriter type already set to %s" or something like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a second request to fix the error message, and it is not too complicated. So changed.

The solution is a bit lame:

    Preconditions.checkArgument(
        runner == null,
        "Rewriter type already set to %s",
        runner == null ? null : runner.description());

We need a second null check for the error message, or a null check around it 😢
Decided to hide this ugliness in a method.
If you have better ideas, feel free to comment

import org.apache.spark.sql.SparkSession;

/**
* Common parent for data and positional delete rewrite runners.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear description here, I think this is mean to essentially replace the meet of the Action itself? Like it's a container for planner + rewriter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Action is using the 2 interfaces defined by the new API. Planner for grouping, Runner for executing the actual compaction.

This class is the base for the Spark based Runner implementations. Updated the javadoc

import org.apache.iceberg.util.PropertyUtil;

/**
* Extends the {@link BinPackRewriteFilePlanner} with the possibility to set the expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs more description here. Extends the BinPack rewriter for Rewriters that induce a distributed shuffle to reorganize data. (or something like this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is for the Planner. There was a specific configuration for the Shuffling rewriters which is modifying the plans. So I had to extend the core planners with the new configuration.

Alternatively we can move this functionality to the generic planner. In this case we don't need a specific class here. In this case the we have this config available for every planner (might be useful if there are compaction changes on any rewrite), but we have to reintroduce the flag shufflingPlanner (maybe with another name) to decide which Runner implementation to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just talking about the Java Doc. I have no problem with the code organization but the key detail about this planner is that it produces plans for Shuffling Rewrites, not that it has an additional option.

Copy link
Contributor

@stevenzwu stevenzwu Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this class adds an additional option of COMPRESSION_FACTOR used when calculating expectedOutputFiles (line 60). I guess shuffle/sort improves compression ratio and reduce file size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the suggestions.
Updated the javadoc based on your input.

@manuzhang manuzhang added this to the Iceberg 1.10.0 milestone Apr 28, 2025
@manuzhang
Copy link
Member

I add this to 1.10.0, since we are targeting removing deprecated APIs in 1.10.0

catalogName, tableIdent))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Must use only one rewriter type (bin-pack, sort, zorder)");
.hasMessageStartingWith("Rewriter type already set to ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was not correct before but can we switch the error message here to "Cannot set rewriter, it has already been set to "

Copy link
Contributor Author

@pvary pvary Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Cannot set rewrite mode, it has already been set to . I think this is a bit better, but ok with reverting to your version if you don't agree.

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. added a couple minor comments

@pvary pvary merged commit bcda1a3 into apache:main Apr 30, 2025
27 checks passed
@pvary
Copy link
Contributor Author

pvary commented Apr 30, 2025

Merged to main.
Thanks @manuzhang, @danielcweeks, @RussellSpitzer, @stevenzwu for the reviews!

anuragmantri added a commit to anuragmantri/iceberg that referenced this pull request Jul 25, 2025
…lesSparkAction to use the new APIs (apache#12692) (apache#1578)

Co-authored-by: pvary <peter.vary.apache@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants