-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Add option to introduce ordering of RewriteFileGroup #4377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Add option to introduce ordering of RewriteFileGroup #4377
Conversation
7469626 to
f8b9bbd
Compare
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroup.java
Outdated
Show resolved
Hide resolved
....2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
....2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
4de2828 to
34b70ca
Compare
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
Outdated
Show resolved
Hide resolved
....2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
....2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
Show resolved
Hide resolved
de8118f to
d21536c
Compare
6f9858e to
067c849
Compare
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroup.java
Outdated
Show resolved
Hide resolved
....2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
....2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
Outdated
Show resolved
Hide resolved
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some notes on Enum usage, we can remove some of the if statements by
using built in functions and switch statements.
Tests can probably be simplified a bit to just check the task ordering.
Other than that I think we are general good to go
6ed30e1 to
9508eef
Compare
9508eef to
769c088
Compare
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
Show resolved
Hide resolved
2c3e1cf to
2d7197f
Compare
2d7197f to
84658a6
Compare
|
@RussellSpitzer PR is ready for another round of review. |
....2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java
Show resolved
Hide resolved
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
Outdated
Show resolved
Hide resolved
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
Outdated
Show resolved
Hide resolved
24ba231 to
5f4803e
Compare
5f4803e to
efc59b7
Compare
|
Thanks for the patch @rajarshisarkar , Let's do a quick followup to move the changes into 3.0 and 3.1 |
) Allows users to specify the order in which Rewrite JobGroups should be executed. (cherry picked from commit d1476c6)
This PR introduces ordering of
RewriteFileGroup. This should be helpful in cases where we want to force the compaction order according to the size (in bytes) or number of files.Introduce a new Spark option
rewrite.job-order = {"bytes-asc", "bytes-desc", "files-asc", "files-desc", "none"}.rewrite.job-order=bytes-asc: rewrite the smallest job groups first.rewrite.job-order=bytes-desc: rewrite the largest job groups first.rewrite.job-order=files-asc: rewrite the job groups with the least files first.rewrite.job-order=files-desc: rewrite the job groups with the most files first.rewrite.job-order=none: rewrite job groups in the order they were planned (no specific ordering).cc: @RussellSpitzer