Core, Spark 3.3: Add FileRewriter API #7175

aokolnychyi · 2023-03-22T23:04:52Z

This PR refactors our compaction code so that it can be used for position delete file rewrites.

In particular, the following interfaces/classes have been added.

FileRewriter - a generic API for rewriting content files
SizeBasedFileRewriter - a common rewriter for content files primary based on file size
SizeBasedDataRewriter - a common data rewriter
SparkSizeBasedDataRewriter - a Spark data rewriter that stages tasks and uses a commit coordinator
SparkBinPackDataRewriter - a Spark data rewriter that uses bin-packing
SparkSortDataRewriter - a Spark data rewriter that shuffles and sorts data
SparkZOrderDataRewriter - a Spark data rewriter that shuffles and zorders data

The new API has the same behavior as the existing rewrite strategies.

aokolnychyi · 2023-03-23T00:05:54Z

core/src/main/java/org/apache/iceberg/actions/FileRewriter.java

+ * @param <T> the Java type of tasks to read content files
+ * @param <F> the Java type of content files
+ */
+public interface FileRewriter<T extends ContentScanTask<F>, F extends ContentFile<F>> {


I created a separate hierarchy because I did not follow the exact API.

I replaced RewriteStrategy options(map) with void init(map). In order to support proper method chaining with inheritance, we need to parametrize the strategy and rewriter with ThisT, like we do in Scan, for instance. That was not done and method chaining was partially broken in existing strategies. To simplify the API, I went for the approach we use in catalogs and FileIO and added void init(map) instead.

I did not make the new interface serializable as our strategies are not used in a distributed fashion. They can’t be serialized in their current form too (they have non-serializable fields). If we want to support that functionality in the future, we may add it later. We should either mark rewriters serializable and support that or don’t do that.

aokolnychyi · 2023-03-23T00:06:50Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

+    long defaultMax = (long) (target * MAX_FILE_SIZE_DEFAULT_RATIO);
+    long max = propertyAsLong(options, MAX_FILE_SIZE_BYTES, defaultMax);
+
+    checkArgument(target > 0, "'%s' is set to %s but must be > 0", TARGET_FILE_SIZE_BYTES, target);


I switched negation in error messages. I believe it is easier to understand an error message that states what the value should look like, instead of what it can’t be.

aokolnychyi · 2023-03-23T00:07:24Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkSizeBasedDataRewriter.java

+  }
+
+  @Override
+  public Set<DataFile> rewrite(List<FileScanTask> group) {


I added this class as the lifecycle of the rewrite is the same in all 3 compaction strategies.
I put all common logic here.

aokolnychyi · 2023-03-23T00:08:06Z

core/src/main/java/org/apache/iceberg/actions/RewriteStrategy.java

+ * @deprecated since 1.3.0, will be removed in 1.4.0; use {@link FileRewriter} instead.
+ */
+@Deprecated
 public interface RewriteStrategy extends Serializable {


I decided not to migrate our existing Spark strategies as they were public. Also, we have more information about all types of rewrites so we can structure code a bit differently.

aokolnychyi · 2023-04-07T19:38:25Z

This is ready for review, I am still working on tests.

szehon-ho

Looks mostly good to me, have some small comments.

szehon-ho · 2023-04-07T20:48:39Z

core/src/main/java/org/apache/iceberg/actions/FileRewriter.java

+   * Returns a set of supported options for this rewriter. This is an allowed-list and any options
+   * not specified here will be rejected at runtime.
+   *
+   * @return returns a set of supported options


Nit: extra "returns"

Is return annotation redundant? (comparing with other javadoc comments)

szehon-ho · 2023-04-10T21:22:51Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedDataRewriter.java

+    return Iterables.filter(tasks, task -> hasSuboptimalSize(task) || hasTooManyDeletes(task));
+  }
+
+  private boolean hasTooManyDeletes(FileScanTask task) {


Remove 'has' to just 'tooManyDeletes' ?

core/src/main/java/org/apache/iceberg/actions/FileRewriter.java

szehon-ho · 2023-04-10T22:39:08Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

+  public static final String TARGET_FILE_SIZE_BYTES = "target-file-size-bytes";
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than this value will be


The word 'adjusts' seems strange here. (file is not changed?)

Also 'functions independently' seems not clear. Can we clarify, ex :

Any file with size under this threshold will be re-written, regardless of ...

Also, one thought, as here we mention regardless of "MAX_FILE_SIZE_BYTES". Does it make sense to just say "regardless of any other criteria", as there is also the question of whether we need to check tooManyDeletes as well.

szehon-ho · 2023-04-10T22:39:17Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than this value will be


Same comment as above

szehon-ho · 2023-04-11T00:25:56Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

+   * groups. This option controls the largest amount of data that should be rewritten in a single
+   * group. It helps with breaking down the rewriting of very large partitions which may not be
+   * rewritable otherwise due to the resource constraints of the cluster. For example, a sort-based
+   * rewrite may not scale to TB sized partitions, those partitions need to be worked on in small


Nit: missing and

TB-sized partitions, and those partitions

szehon-ho · 2023-04-11T00:26:20Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

+   * rewrite may not scale to TB sized partitions, those partitions need to be worked on in small
+   * subsections to avoid exhaustion of resources.
+   *
+   * <p>When grouping files, the file rewriter will use this value to limit the files which will be


Same, I feel this context is more useful in class level.

szehon-ho · 2023-04-11T00:27:43Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

+  /**
+   * Estimates a larger max target file size than the target size used in task creation to avoid
+   * tasks which are predicted to have a certain size, but exceed that target size when serde is
+   * complete creating tiny remainder files.


Hard to read, a comma may help:

"when serde is complete, creating tiny remainder files"

Also I realize this explanation is just repeated on the below paragraph. Can't this be simpler and just be:

"Estimates a larger max target file size than the target size used in task creation to avoid creating tiny remainder files."

szehon-ho · 2023-04-11T00:29:33Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

+   * that the actual data will end up being larger than our target size due to various factors of
+   * compression, serialization and other factors outside our control. If this occurs, instead of
+   * making a single file that is close in size to our target, we would end up producing one file of
+   * the target size, and then a small extra file with the remaining data. For example, if our


Suggest to put "For example" on new paragraph

szehon-ho · 2023-04-11T00:29:55Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

+   * making a single file that is close in size to our target, we would end up producing one file of
+   * the target size, and then a small extra file with the remaining data. For example, if our
+   * target is 512 MB, we may generate a rewrite task that should be 500 MB. When we write the data
+   * we may find we actually have to write out 530 MB. If we use the target size while writing we


nit: comma before we

"while writing, we..."

aokolnychyi · 2023-04-12T20:58:45Z

I added tests and addressed some comments, I'll address the rest by the end of today.

szehon-ho · 2023-04-12T21:30:32Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

+    return group.size() > 1 && group.size() >= minInputFiles;
+  }
+
+  protected boolean enoughData(List<T> group) {


Should name be more generic (data => rows), so no confusion with Data/Delete files?

aokolnychyi · 2023-04-13T01:36:01Z

core/src/main/java/org/apache/iceberg/actions/FileRewriter.java

+/**
+ * A class for rewriting content files.
+ *
+ * <p>The entire rewrite operation is broken down into pieces based on partitioning, and size-based


@szehon-ho, moved some of those detailed comments here instead of per options.

aokolnychyi · 2023-04-13T01:38:12Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

+  public static final String TARGET_FILE_SIZE_BYTES = "target-file-size-bytes";
+
+  /**
+   * Controls which files will be considered for rewriting. Files with sizes under this threshold


@szehon-ho, changed docs for many of these options too. Another look would be appreciated!

aokolnychyi · 2023-04-13T01:42:01Z

@szehon-ho, this one is ready for another round.

szehon-ho

Looks great to me, thanks for the changes!

aokolnychyi · 2023-04-13T18:16:37Z

Thanks a lot for reviewing, @szehon-ho!

manuzhang · 2024-01-02T05:17:29Z

core/src/main/java/org/apache/iceberg/actions/SizeBasedDataRewriter.java

+    int value =
+        PropertyUtil.propertyAsInt(options, DELETE_FILE_THRESHOLD, DELETE_FILE_THRESHOLD_DEFAULT);
+    Preconditions.checkArgument(
+        value >= 0, "'%s' is set to %s but must be >= 0", DELETE_FILE_THRESHOLD, value);


@aokolnychyi Why do we allow the delete-file-threshold to be 0 here? Is there a meaningful use case?

github-actions bot added core INFRA spark labels Mar 22, 2023

aokolnychyi force-pushed the file-rewriter-api branch from 35d951f to 1333bc7 Compare March 22, 2023 23:07

aokolnychyi commented Mar 23, 2023

View reviewed changes

aokolnychyi force-pushed the file-rewriter-api branch from 1333bc7 to 809dd9b Compare April 7, 2023 19:37

aokolnychyi marked this pull request as ready for review April 7, 2023 19:37

github-actions bot removed the INFRA label Apr 7, 2023

Core, Spark 3.3: Add FileRewriter API

7c1d222

aokolnychyi force-pushed the file-rewriter-api branch from 809dd9b to 7c1d222 Compare April 7, 2023 19:47

szehon-ho reviewed Apr 11, 2023

View reviewed changes

Partially address feedback, add tests

4b7e791

szehon-ho reviewed Apr 12, 2023

View reviewed changes

More comments addressed

57c74b4

aokolnychyi commented Apr 13, 2023

View reviewed changes

Minor update

d6f84e5

aokolnychyi added 2 commits April 12, 2023 18:43

Add missing coma

f0a2f0a

Fix import

16fd48a

szehon-ho approved these changes Apr 13, 2023

View reviewed changes

aokolnychyi merged commit f53c4ee into apache:master Apr 13, 2023

ericlgoodman pushed a commit to ericlgoodman/iceberg that referenced this pull request Apr 14, 2023

Core, Spark 3.3: Add FileRewriter API (apache#7175)

f27d0ee

szehon-ho mentioned this pull request Apr 20, 2023

Spark 3.4: Implement rewrite position deletes #7389

Merged

manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023

Core, Spark 3.3: Add FileRewriter API (apache#7175)

a2ffabe

manuzhang reviewed Jan 2, 2024

View reviewed changes

Core, Spark 3.3: Add FileRewriter API #7175

Core, Spark 3.3: Add FileRewriter API #7175

Uh oh!

Conversation

aokolnychyi commented Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Apr 7, 2023

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Apr 12, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Apr 13, 2023

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Apr 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aokolnychyi commented Mar 22, 2023 •

edited

Loading