Core: FileRewritePlanner implementation#12493
Conversation
4d2f52b to
c3c8cd8
Compare
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
| RewriteJobOrder.fromName( | ||
| PropertyUtil.propertyAsString( | ||
| options, | ||
| RewriteDataFiles.REWRITE_JOB_ORDER, |
There was a problem hiding this comment.
Do these properties belong here now?
There was a problem hiding this comment.
The plan currently returns the jobs in order in a CloseableIterable.
We can push the ordering to the user, but this seems something we can reuse between Spark and Flink, so I decided to keep it here.
Your thoughts?
core/src/main/java/org/apache/iceberg/actions/RewritePositionDeletesGroup.java
Show resolved
Hide resolved
|
Thanks @RussellSpitzer for the review! |
| * These will be grouped by partitions based on their size using fix sized bins. Extends the {@link | ||
| * SizeBasedFileRewritePlanner} with {@link RewritePositionDeleteFiles#REWRITE_JOB_ORDER} handling. | ||
| */ | ||
| public class RewritePositionDeletesGroupPlanner |
There was a problem hiding this comment.
probably RewritePositionDeletesPlanner is more accurate
There was a problem hiding this comment.
Renamed to BinPackRewritePositionDeletePlanner
core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewritePlanner.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/SizeBasedDataRewriter.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
| * | ||
| * <p>Defaults to Integer.MAX_VALUE, which means this feature is not enabled by default. | ||
| */ | ||
| public static final String DELETE_FILE_THRESHOLD = "delete-file-threshold"; |
There was a problem hiding this comment.
I know this is refactored from existing code. this config name doesn't seem the most intuitive. delete-file-threshold could create an impression of delete file size threshold (in bytes). I think it meant deleted-rows-threshold. But it is probably too late to change it for compatibility reason.
There was a problem hiding this comment.
While I agree with you, I also agree with you that we don't want to touch this in this PR (and probably not even later)
core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
| PositionDeletesScanTask.class::cast); | ||
| } | ||
|
|
||
| private StructLikeMap<List<PositionDeletesScanTask>> groupByPartition( |
There was a problem hiding this comment.
some of the methods in this class look similar to the code in the BinPackRewriteFilePlanner. can the code be pushed to the base class? e.g. plan(), most of planFileGroups, etc.
In this specific method, I guess PartitionUtil.coercePartition is probably different
There was a problem hiding this comment.
I had the same feeling, and I tried it several times, but there are nuances which makes them different, and hard to generalize:
- The input tasks are different
- The partition handling is different (data file compaction handles partition evolution)
- The file filtering logic is different
- The group filterig logic is different
- Result groups are different
I see the partition handling as the main issue.
When compacting data files we have this additional logic:
// If a task uses an incompatible partition spec the data inside could contain values
// which belong to multiple partitions in the current spec. Treating all such files as
// un-partitioned and grouping them together helps to minimize new files made.
When compacting position delete files just uses coercePartition to update the task partition to the current one.
Based on these, I decided against the generalization.
core/src/test/java/org/apache/iceberg/actions/TestBinPackRewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/actions/TestBinPackRewriteFileGroupPlanner.java
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/actions/TestBinPackRewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/actions/TestBinPackRewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/actions/TestBinPackRewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/actions/TestBinPackRewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/actions/TestBinPackRewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
|
I haven't had time to check the rest of this, but I trust @stevenzwu 's review so please go on without me. |
|
Merged to main. |
This PR creates the core implementations for the Planners defined by #12306.
The PR deprecates the old
*Rewriterimplementations. Adds test for the Planners.This is part of the #11513 refactor