-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[pipeline](datagen) Improve datagen operator parallelism #37195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 39996 ms |
TPC-DS: Total hot run time: 171719 ms |
ClickBench: Total hot run time: 30.71 s |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
|
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 39833 ms |
TPC-DS: Total hot run time: 174348 ms |
ClickBench: Total hot run time: 30.54 s |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 40169 ms |
TPC-DS: Total hot run time: 172881 ms |
ClickBench: Total hot run time: 30.4 s |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
PR approved by anyone and no changes requested. |
|
PR approved by at least one committer and no changes requested. |
TPC-H: Total hot run time: 41567 ms |
TPC-DS: Total hot run time: 167780 ms |
ClickBench: Total hot run time: 30.3 s |
Now we use ``` DataGenOperator (num_instance=1) -> ResultSinkOperator(num_instance=1) ``` for loading/query tasks. This PR use a local shuffle to improve its parallelism and the plan is ``` DataGenOperator (num_instance=1) -> LocalExchangeSink (num_instance=1) -> LocalExchangeSource (num_instance=(cores / 2) -> ResultSinkOperator(num_instance=(cores / 2)) ```
Now we use ``` DataGenOperator (num_instance=1) -> ResultSinkOperator(num_instance=1) ``` for loading/query tasks. This PR use a local shuffle to improve its parallelism and the plan is ``` DataGenOperator (num_instance=1) -> LocalExchangeSink (num_instance=1) -> LocalExchangeSource (num_instance=(cores / 2) -> ResultSinkOperator(num_instance=(cores / 2)) ```
Proposed changes
Now we use
DataGenOperator (num_instance=1) -> ResultSinkOperator(num_instance=1)for loading/query tasks.This PR use a local shuffle to improve its parallelism and the plan is
DataGenOperator (num_instance=1) -> LocalExchangeSink (num_instance=1) -> LocalExchangeSource (num_instance=(cores / 2) -> ResultSinkOperator(num_instance=(cores / 2))