Core: Fix split size calculations in file rewriters#9069
Merged
RussellSpitzer merged 1 commit intoapache:mainfrom Nov 16, 2023
Merged
Core: Fix split size calculations in file rewriters#9069RussellSpitzer merged 1 commit intoapache:mainfrom
RussellSpitzer merged 1 commit intoapache:mainfrom
Conversation
aokolnychyi
commented
Nov 14, 2023
|
|
||
| public static final long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT = 100L * 1024 * 1024 * 1024; // 100 GB | ||
|
|
||
| private static final long SPLIT_OVERHEAD = 5 * 1024; |
Contributor
Author
There was a problem hiding this comment.
This split overhead did very little given that it is only 5 KB. Row group size discrepancies are usually a few megabytes. I will have a follow-up PR to make our split planning less sensitive.
1e69b07 to
557ba36
Compare
anuragmantri
approved these changes
Nov 14, 2023
nk1506
reviewed
Nov 14, 2023
| SizeBasedDataRewriter.MAX_FILE_SIZE_BYTES, String.valueOf(maxFileSize)); | ||
| rewriter.init(options); | ||
|
|
||
| // the total task size is 580 bytes and the target file size is 512 bytes |
Contributor
There was a problem hiding this comment.
I think it is 580 mega bytes 512 mega bytes
Contributor
Author
There was a problem hiding this comment.
Good catch, I updated the logic but forgot the comment. Fixed.
ba171da to
0a743e8
Compare
0a743e8 to
08eb3d5
Compare
Contributor
Author
|
I reverted removal of the split overhead. Some tests are sensitive. I'll make that change in a follow-up PR. |
nk1506
approved these changes
Nov 15, 2023
RussellSpitzer
approved these changes
Nov 16, 2023
Member
|
Thanks @aokolnychyi for the fix and @nk1506 and @anuragmantri for the reviews |
devangjhabakh
pushed a commit
to cdouglas/iceberg
that referenced
this pull request
Apr 22, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adjusts the split computation logic in file rewriters. The previous logic performed poorly in some cases.
Suppose we have 4 files, 145 MB each. This means we have 580 MB to compact. If the target file size is 512 MB, the rewriter will decide to produce 2 output files as the input is 13% larger than the target file size (we allow 10% overhead). If this happens, the previous logic will use 580 MB / 2 = 290 MB as the split size. That's why the compaction will produce 2 output files that are already poorly sized. Such files will be picked up again in the next round even if there is no new data, creating a never ending loop of useless compaction.
This PR makes sure the split size is never less than the target output file.