-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[opt](cloud) optimize cloud balance warm up rpc #59155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[opt](cloud) optimize cloud balance warm up rpc #59155
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
18f574d to
8be3b78
Compare
8be3b78 to
acb754a
Compare
|
run buildall |
TPC-H: Total hot run time: 34835 ms |
TPC-DS: Total hot run time: 178294 ms |
ClickBench: Total hot run time: 27.69 s |
|
@deardeng Hi, Do you have time to see this PR |
|
我这里有个修复,解的更彻底些,be 层面也需要修复的 @liutang123 可以帮review下的 |
| private Map<InfightTablet, InfightTask> tabletToInfightTask = new HashMap<>(); | ||
| private Map<InfightTablet, InfightTask> tabletToInfightTask = new ConcurrentHashMap<>(); | ||
|
|
||
| private ForkJoinPool warmUpSendRpcPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forkjoinpool 用起来会有奇怪的bug,参看这个https://github.com/apache/doris/pull/57382,
What problem does this PR solve?
If a cluster has many tablets, During cluster scaling (both scale-up and scale-down), the cache warmup process takes an extremely long time.

With an 8-node cluster (each node hosting approximately 400,000 tablets), taking 2 nodes offline takes roughly 1.5 hour.
I found that a large portion of the time is spent on the FE sending RPCs to the BEs. Although the latency of each RPC is short, the time consumed by serially executing hundreds of thousands of RPCs is still quite considerable.
I attempted to implement batching and delayed RPC sending, which reduced the overall time cost by a factor of 5, bringing it down to 15 minutes.

I will add UT and RT later.
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)