-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[opt](split) get file splits in batch mode #34032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
clang-tidy review says "All clean, LGTM! 👍" |
9e436b6 to
380cea0
Compare
|
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
TeamCity be ut coverage result: |
Jibing-Li
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
| SplitSource splitSource = new SplitSource( | ||
| this::splitToScanRange, backend, locationProperties, splits, pathPartitionKeys); | ||
| splitSources.add(splitSource); | ||
| SplitSourceManager.registerSplitSource(splitSource); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use singleton
| import java.util.Map; | ||
| import java.util.concurrent.ConcurrentHashMap; | ||
|
|
||
| public class SplitSourceManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest:
- not using singletion
- extends
MasterDaemonclass
| return QeProcessorImpl.INSTANCE.reportExecStatus(params, getClientAddr()); | ||
| } | ||
|
|
||
| public TFetchSplitBatchResult fetchSplitBatch(TFetchSplitBatchRequest request) throws TException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing @OverRide
| LOG(WARNING) << "Failed to get batch of split source: {}, try to reopen" << e1.what(); | ||
| RETURN_IF_ERROR(coord.reopen()); | ||
| try { | ||
| coord->fetchSplitBatch(result, request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should not retry when failure.
If first call fail, it is highly possible the second would fail too.
Simply fail this query to avoid avalanche
| @ConfField(mutable = true, masterOnly = false, description = { | ||
| "如果切片数量超过阈值,BE将通过batch方式获取scan ranges", | ||
| "If the number of splits exceeds the threshold, scan ranges will be got through batch mode."}) | ||
| public static int num_splits_in_batch_mode = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better be a session varible?
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
TPC-H: Total hot run time: 41933 ms |
TPC-DS: Total hot run time: 187732 ms |
|
TeamCity be ut coverage result: |
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
kaka11chen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When scanning a table with many files, It will take a lot of time to transfer splits to backends.(20s of the following 1209172 splits). Therefore, using batch mode to fetch the file splits, BE can do scanning while fetch the file splits.
When scanning a table with many files, It will take a lot of time to transfer splits to backends.(20s of the following 1209172 splits). Therefore, using batch mode to fetch the file splits, BE can do scanning while fetch the file splits.
When scanning a table with many files, It will take a lot of time to transfer splits to backends.(20s of the following 1209172 splits). Therefore, using batch mode to fetch the file splits, BE can do scanning while fetch the file splits.
When scanning a table with many files, It will take a lot of time to transfer splits to backends.(20s of the following 1209172 splits). Therefore, using batch mode to fetch the file splits, BE can do scanning while fetch the file splits.
PR #34032 introduce a new method to get splits batch by batch, but it removed a logic that BE will merge scan ranges to avoid too many scan ranges being scheduled. This PR mainly changes: 1. Add scan range merging logic back. 2. Change the default file split size from 8MB to 64MB, to avoid too many small split.
PR apache#34032 introduce a new method to get splits batch by batch, but it removed a logic that BE will merge scan ranges to avoid too many scan ranges being scheduled. This PR mainly changes: 1. Add scan range merging logic back. 2. Change the default file split size from 8MB to 64MB, to avoid too many small split.
PR #34032 introduce a new method to get splits batch by batch, but it removed a logic that BE will merge scan ranges to avoid too many scan ranges being scheduled. This PR mainly changes: 1. Add scan range merging logic back. 2. Change the default file split size from 8MB to 64MB, to avoid too many small split.
PR #34032 introduce a new method to get splits batch by batch, but it removed a logic that BE will merge scan ranges to avoid too many scan ranges being scheduled. This PR mainly changes: 1. Add scan range merging logic back. 2. Change the default file split size from 8MB to 64MB, to avoid too many small split.
Proposed changes
When scanning a table with many files, It will take a lot of time to transfer splits to backends.(20s of the following 1209172 splits).
Therefore, using batch mode to fetch the file splits, BE can do scanning while fetch the file splits.
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...