Skip to content

Core: Fix split size calculations in file rewriters#9069

Merged
RussellSpitzer merged 1 commit intoapache:mainfrom
aokolnychyi:fix-compaction-size-issue
Nov 16, 2023
Merged

Core: Fix split size calculations in file rewriters#9069
RussellSpitzer merged 1 commit intoapache:mainfrom
aokolnychyi:fix-compaction-size-issue

Conversation

@aokolnychyi
Copy link
Contributor

This PR adjusts the split computation logic in file rewriters. The previous logic performed poorly in some cases.

Suppose we have 4 files, 145 MB each. This means we have 580 MB to compact. If the target file size is 512 MB, the rewriter will decide to produce 2 output files as the input is 13% larger than the target file size (we allow 10% overhead). If this happens, the previous logic will use 580 MB / 2 = 290 MB as the split size. That's why the compaction will produce 2 output files that are already poorly sized. Such files will be picked up again in the next round even if there is no new data, creating a never ending loop of useless compaction.

This PR makes sure the split size is never less than the target output file.

@github-actions github-actions bot added the core label Nov 14, 2023

public static final long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT = 100L * 1024 * 1024 * 1024; // 100 GB

private static final long SPLIT_OVERHEAD = 5 * 1024;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This split overhead did very little given that it is only 5 KB. Row group size discrepancies are usually a few megabytes. I will have a follow-up PR to make our split planning less sensitive.

@aokolnychyi aokolnychyi force-pushed the fix-compaction-size-issue branch from 1e69b07 to 557ba36 Compare November 14, 2023 18:47
SizeBasedDataRewriter.MAX_FILE_SIZE_BYTES, String.valueOf(maxFileSize));
rewriter.init(options);

// the total task size is 580 bytes and the target file size is 512 bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is 580 mega bytes 512 mega bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I updated the logic but forgot the comment. Fixed.

@aokolnychyi aokolnychyi force-pushed the fix-compaction-size-issue branch 2 times, most recently from ba171da to 0a743e8 Compare November 14, 2023 19:01
@aokolnychyi aokolnychyi force-pushed the fix-compaction-size-issue branch from 0a743e8 to 08eb3d5 Compare November 14, 2023 20:48
@aokolnychyi
Copy link
Contributor Author

I reverted removal of the split overhead. Some tests are sensitive. I'll make that change in a follow-up PR.

@RussellSpitzer RussellSpitzer merged commit 2e2ac8d into apache:main Nov 16, 2023
@RussellSpitzer
Copy link
Member

Thanks @aokolnychyi for the fix and @nk1506 and @anuragmantri for the reviews

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants