branch-4.0: [fix](cloud) Fix cloud warm up balance slow scheduling #58962 #59337

deardeng · 2025-12-24T11:36:03Z

cherry pick from #58962

Currently, when performing tablet warm-up balancing in the cloud, the sequential execution of a single warm-up task leads to a series of problems, such as: 1. When scaling up a computer group to include beta nodes, with a large number of tables (millions of tablets), actual tests showed that scaling from 1 beta node to 10 beta nodes took more than 6 hours to reach a balanced state. Each warm-up task RPC took about 30ms. This means that even if a new node can handle the load, scaling up a new node in the cloud can still take up to 6 hours in the worst case. 2. Due to the same logic, decomission be is also relatively slow. Fixes: 1. Batch and pipeline warm-up tasks. Each batch can contain multiple warm-up tasks with the same source and destination (each task represents migrating one tablet). 2. Separate the warm-up task finish thread to prevent scheduling logic from affecting the logic that modifies tablet-to-tablet mappings. 3. Asynchronously fetch file cache meta in the warm_up_cache_async logic and add some bvars. Post-fix testing showed that in a scenario with 10 databases, 10,000 tables, 100,000 partitions, and 1 million tablets, the number of be nodes increased from 3 to 10 within 10 minutes.

deardeng · 2025-12-24T11:36:05Z

run buildall

hello-stephen · 2025-12-24T11:36:08Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

doris-robot · 2025-12-24T13:01:50Z

BE UT Coverage Report

Increment line coverage 0.00% (0/12) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.30% (18603/34902)
Line Coverage	39.04% (172225/441175)
Region Coverage	33.70% (133088/394916)
Branch Coverage	34.72% (57593/165883)

hello-stephen · 2025-12-24T13:03:14Z

FE UT Coverage Report

Increment line coverage 2.06% (4/194) 🎉
Increment coverage report
Complete coverage report

deardeng requested a review from yiguolei as a code owner December 24, 2025 11:36

yiguolei approved these changes Dec 25, 2025

View reviewed changes

yiguolei merged commit 97ea173 into apache:branch-4.0 Dec 25, 2025
22 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

branch-4.0: [fix](cloud) Fix cloud warm up balance slow scheduling #58962 #59337

branch-4.0: [fix](cloud) Fix cloud warm up balance slow scheduling #58962 #59337

Uh oh!

deardeng commented Dec 24, 2025

Uh oh!

deardeng commented Dec 24, 2025

Uh oh!

hello-stephen commented Dec 24, 2025

Uh oh!

doris-robot commented Dec 24, 2025

Uh oh!

hello-stephen commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

branch-4.0: [fix](cloud) Fix cloud warm up balance slow scheduling #58962 #59337

branch-4.0: [fix](cloud) Fix cloud warm up balance slow scheduling #58962 #59337

Uh oh!

Conversation

deardeng commented Dec 24, 2025

Uh oh!

deardeng commented Dec 24, 2025

Uh oh!

hello-stephen commented Dec 24, 2025

Uh oh!

doris-robot commented Dec 24, 2025

BE UT Coverage Report

Uh oh!

hello-stephen commented Dec 24, 2025

FE UT Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants