Skip to content

Core: Interface changes for separating rewrite planner and runner#12306

Merged
stevenzwu merged 4 commits intoapache:mainfrom
pvary:rewrite_api
Feb 27, 2025
Merged

Core: Interface changes for separating rewrite planner and runner#12306
stevenzwu merged 4 commits intoapache:mainfrom
pvary:rewrite_api

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Feb 18, 2025

Compaction interface changes which separates the planning and the execution phase of the compaction.
This enables moving the common planning part of the compaction from Spark code to the core, and Flink (or another execution engines) could implement their own rewrite executor to do the actual compaction.
See discussion on: https://lists.apache.org/thread/6lj2jn3dbvqjscc96w0mc32bhxq0qfqv

@github-actions github-actions bot added the core label Feb 18, 2025
@pvary pvary force-pushed the rewrite_api branch 2 times, most recently from ced2bee to 33827d1 Compare February 18, 2025 10:41
* @param <T> the Java type of the tasks to read the files which are rewritten
* @param <F> the Java type of the content files which are rewritten
*/
public abstract class FileRewriteGroup<I, T extends ContentScanTask<F>, F extends ContentFile<F>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is confusing to introduce FileRewriteGroup name with another existing class named RewriteFileGroup.

We can probably call this BaseFileGroup or BaseRewriteFileGroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names staring with Base are reserved for the immutable interfaces for the actions themselves.

I would like to keep the FileRewriteGroup/FileRewritePlan/FileRewritePlanner/FileRewriteRunner similar and/or grouped together.

Maybe create a new package (org.apache.iceberg.actions.rewrite)? Maybe a name starting with the same prefix (RewriteGroup/RewritePlan/RewritePlanner/RewriteRunner)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names staring with Base are reserved for the immutable interfaces for the actions themselves.

Is Base* reserved only for actions in that package? Personally, I don't see an issue of adding classes like BaseRewriteFileGroup, BaseRewriteFilePlan etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently only action base classes are there with name starting with base.

RewriteFilePlan will be the final class. I don't plan to add different plans for different compaction methods. They will only differ in the RewriteFileGroups.

What about the planner/runner interfaces? They are definitely not base

* @param <G> the Java type of the planned groups
*/
public interface FileRewriteExecutor<
FGI,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, it seems that {@link RewriteDataFiles.FileGroupInfo} and {@link RewritePositionDeleteFiles.FileGroupInfo} interfaces are identical. wondering if we should start without generics for FGI? it can affect future flexibility though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this, because the info objects are coming from the immutable base infrastructure for the actions. I decided against touching those to limit the scope of the PR

/**
* Container class representing a set of files to be rewritten by a {@link FileRewriteExecutor}.
*
* @param <I> the Java type of the plan info member like {@link RewriteDataFiles.FileGroupInfo} or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency with the class above, should this generic type also be called FGI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checkstyle module prevents using FGI for classes:

        <module name="ClassTypeParameterName"> <!-- Java Style Guide: Type variable names -->
            <property name="format" value="(^[A-Z][0-9]?)$|([A-Z][a-zA-Z0-9]*[T]$)"/>
        </module>

We don't have similar for interfaces, like:

        <module name="InterfaceTypeParameterName"> <!-- Java Style Guide: Type variable names -->
            <property name="format" value="(^[A-Z][0-9]?)$|([A-Z][a-zA-Z0-9]*[T]$)"/>
        </module>

Since it was not highlighted by checkstyle, I forgot to update the interfaces. I will move from FGI to I as well.

}

/** Accumulated size for the input files. */
public long sizeInBytes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputFilesSizeInBytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method exists in the current API as RewriteFileGroup.sizeInBytes() and RewritePositionDeletesGroup.rewrittenBytes() and the new method name will be inherited from the FileRewriteGroup

Minimally one of them needs to be deprecated, renamed. Do you think we should change both?

@pvary pvary force-pushed the rewrite_api branch 3 times, most recently from 43c75d0 to a70ee75 Compare February 19, 2025 10:50
@stevenzwu
Copy link
Contributor

@pvary regarding the inheritance, I am thinking it might still be valuable to rename/clarify them in the new base class. We can still keep and deprecate the existing methods and just forward the calls to the base class. some of the method names aren't as clear, like info, numFiles. We can use this refactoring opportunity to improve them.

@pvary
Copy link
Contributor Author

pvary commented Feb 19, 2025

@pvary regarding the inheritance, I am thinking it might still be valuable to rename/clarify them in the new base class. We can still keep and deprecate the existing methods and just forward the calls to the base class. some of the method names aren't as clear, like info, numFiles. We can use this refactoring opportunity to improve them.

Ok. Will do. I agree but wanted to start from the minimal changeset.

@pvary pvary changed the title Core: Interface changes for separating rewrite planner and executor Core: Interface changes for separating rewrite planner and runner Feb 19, 2025
Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

/**
* Result of the file rewrite planning as generated by the {@link FileRewritePlanner#plan()}.
*
* <p>The plan contains an iterable for the planned groups and statistics about the number of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc here is a bit confusing to me. I think it has a bit too many implementation details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably something like

a Container class holding a the output of a FileRewritePlanner.plan call. Contains a list of "RewriteFileGroups" see ... link which are used by engines to rewrite a particular set of files. ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

import org.apache.iceberg.ContentScanTask;

/**
* A class for planning content file rewrites.
Copy link
Member

@RussellSpitzer RussellSpitzer Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planning needs more defining here imho.

The notes below are not clear to me. They seem to be for the entire Operation. I think if we were going to write that there it should be something like

A Rewrite Operation is performed by several classes working in conjunction.

  1. A FileRewritePlanner - (this class) which scans the files in the table and determines which files should be rewritten and how they should be grouped. This produces a FileRewritePlan.
  2. FileRewritePlan - This immutable output of the planner contains a set of Groups, each containing a set of files that are ment to be compacted together by the FileRewriteRunner.
  3. The FileRewrite Runner is the engine specific implementation that can take a FileRewritePlan and perform a rewrite on each of the groups within. This is the actual implementation of the rewrite which generates new files.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, please take a look

}

/**
* While we create tasks that should all be smaller than our target size, there is a chance that
Copy link
Member

@RussellSpitzer RussellSpitzer Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is specific to the planner not the Group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded to highlight the relevant part for the runners. Please take a look.

}

/**
* Determines the reader split size as the input size divided by the desired number of output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also specific to the planner. I'm also not sure if split size is a good name here.

Copy link
Member

@RussellSpitzer RussellSpitzer Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be something like

The amount of bytes of data the FileRewriteRunner should pull from a single group in a single read task (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded. Please take a look.

@pvary pvary mentioned this pull request Feb 26, 2025
6 tasks
@RussellSpitzer
Copy link
Member

Ok I'm on board, I think the interfaces are good and Java Doc is solid for now, if @stevenzwu is good with the changes I say we merge and start the rest of the refactor.

@stevenzwu
Copy link
Contributor

yes, current state looks good to me. let me merge it.

@stevenzwu stevenzwu merged commit a50ec92 into apache:main Feb 27, 2025
42 checks passed
@pvary
Copy link
Contributor Author

pvary commented Feb 27, 2025

Thanks @stevenzwu and @RussellSpitzer!
I will be OOO next week, so the next PR will come after that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants