Core: Interface changes for separating rewrite planner and runner#12306
Core: Interface changes for separating rewrite planner and runner#12306stevenzwu merged 4 commits intoapache:mainfrom
Conversation
ced2bee to
33827d1
Compare
| * @param <T> the Java type of the tasks to read the files which are rewritten | ||
| * @param <F> the Java type of the content files which are rewritten | ||
| */ | ||
| public abstract class FileRewriteGroup<I, T extends ContentScanTask<F>, F extends ContentFile<F>> { |
There was a problem hiding this comment.
it is confusing to introduce FileRewriteGroup name with another existing class named RewriteFileGroup.
We can probably call this BaseFileGroup or BaseRewriteFileGroup.
There was a problem hiding this comment.
Names staring with Base are reserved for the immutable interfaces for the actions themselves.
I would like to keep the FileRewriteGroup/FileRewritePlan/FileRewritePlanner/FileRewriteRunner similar and/or grouped together.
Maybe create a new package (org.apache.iceberg.actions.rewrite)? Maybe a name starting with the same prefix (RewriteGroup/RewritePlan/RewritePlanner/RewriteRunner)?
There was a problem hiding this comment.
Names staring with Base are reserved for the immutable interfaces for the actions themselves.
Is Base* reserved only for actions in that package? Personally, I don't see an issue of adding classes like BaseRewriteFileGroup, BaseRewriteFilePlan etc.
There was a problem hiding this comment.
Currently only action base classes are there with name starting with base.
RewriteFilePlan will be the final class. I don't plan to add different plans for different compaction methods. They will only differ in the RewriteFileGroups.
What about the planner/runner interfaces? They are definitely not base
core/src/main/java/org/apache/iceberg/actions/FileRewriteExecutor.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/FileRewriteExecutor.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/FileRewriteExecutor.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/FileRewriteExecutor.java
Outdated
Show resolved
Hide resolved
| * @param <G> the Java type of the planned groups | ||
| */ | ||
| public interface FileRewriteExecutor< | ||
| FGI, |
There was a problem hiding this comment.
I didn't change this, because the info objects are coming from the immutable base infrastructure for the actions. I decided against touching those to limit the scope of the PR
| /** | ||
| * Container class representing a set of files to be rewritten by a {@link FileRewriteExecutor}. | ||
| * | ||
| * @param <I> the Java type of the plan info member like {@link RewriteDataFiles.FileGroupInfo} or |
There was a problem hiding this comment.
for consistency with the class above, should this generic type also be called FGI?
There was a problem hiding this comment.
This checkstyle module prevents using FGI for classes:
<module name="ClassTypeParameterName"> <!-- Java Style Guide: Type variable names -->
<property name="format" value="(^[A-Z][0-9]?)$|([A-Z][a-zA-Z0-9]*[T]$)"/>
</module>
We don't have similar for interfaces, like:
<module name="InterfaceTypeParameterName"> <!-- Java Style Guide: Type variable names -->
<property name="format" value="(^[A-Z][0-9]?)$|([A-Z][a-zA-Z0-9]*[T]$)"/>
</module>
Since it was not highlighted by checkstyle, I forgot to update the interfaces. I will move from FGI to I as well.
| } | ||
|
|
||
| /** Accumulated size for the input files. */ | ||
| public long sizeInBytes() { |
There was a problem hiding this comment.
This method exists in the current API as RewriteFileGroup.sizeInBytes() and RewritePositionDeletesGroup.rewrittenBytes() and the new method name will be inherited from the FileRewriteGroup
Minimally one of them needs to be deprecated, renamed. Do you think we should change both?
core/src/main/java/org/apache/iceberg/actions/FileRewriteGroup.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/FileRewriteGroup.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/FileRewritePlanner.java
Outdated
Show resolved
Hide resolved
43c75d0 to
a70ee75
Compare
|
@pvary regarding the inheritance, I am thinking it might still be valuable to rename/clarify them in the new base class. We can still keep and deprecate the existing methods and just forward the calls to the base class. some of the method names aren't as clear, like |
Ok. Will do. I agree but wanted to start from the minimal changeset. |
| /** | ||
| * Result of the file rewrite planning as generated by the {@link FileRewritePlanner#plan()}. | ||
| * | ||
| * <p>The plan contains an iterable for the planned groups and statistics about the number of the |
There was a problem hiding this comment.
The doc here is a bit confusing to me. I think it has a bit too many implementation details.
There was a problem hiding this comment.
Probably something like
a Container class holding a the output of a FileRewritePlanner.plan call. Contains a list of "RewriteFileGroups" see ... link which are used by engines to rewrite a particular set of files. ?
| import org.apache.iceberg.ContentScanTask; | ||
|
|
||
| /** | ||
| * A class for planning content file rewrites. |
There was a problem hiding this comment.
Planning needs more defining here imho.
The notes below are not clear to me. They seem to be for the entire Operation. I think if we were going to write that there it should be something like
A Rewrite Operation is performed by several classes working in conjunction.
- A FileRewritePlanner - (this class) which scans the files in the table and determines which files should be rewritten and how they should be grouped. This produces a FileRewritePlan.
- FileRewritePlan - This immutable output of the planner contains a set of Groups, each containing a set of files that are ment to be compacted together by the FileRewriteRunner.
- The FileRewrite Runner is the engine specific implementation that can take a FileRewritePlan and perform a rewrite on each of the groups within. This is the actual implementation of the rewrite which generates new files.
?
There was a problem hiding this comment.
Fixed, please take a look
core/src/main/java/org/apache/iceberg/actions/FileRewriteRunner.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| /** | ||
| * While we create tasks that should all be smaller than our target size, there is a chance that |
There was a problem hiding this comment.
This doc is specific to the planner not the Group
There was a problem hiding this comment.
Reworded to highlight the relevant part for the runners. Please take a look.
| } | ||
|
|
||
| /** | ||
| * Determines the reader split size as the input size divided by the desired number of output |
There was a problem hiding this comment.
Also specific to the planner. I'm also not sure if split size is a good name here.
There was a problem hiding this comment.
This should be something like
The amount of bytes of data the FileRewriteRunner should pull from a single group in a single read task (?)
There was a problem hiding this comment.
Reworded. Please take a look.
core/src/main/java/org/apache/iceberg/actions/RewriteGroupBase.java
Outdated
Show resolved
Hide resolved
|
Ok I'm on board, I think the interfaces are good and Java Doc is solid for now, if @stevenzwu is good with the changes I say we merge and start the rest of the refactor. |
|
yes, current state looks good to me. let me merge it. |
|
Thanks @stevenzwu and @RussellSpitzer! |
Compaction interface changes which separates the planning and the execution phase of the compaction.
This enables moving the common planning part of the compaction from Spark code to the core, and Flink (or another execution engines) could implement their own rewrite executor to do the actual compaction.
See discussion on: https://lists.apache.org/thread/6lj2jn3dbvqjscc96w0mc32bhxq0qfqv