Skip to content

Conversation

@deardeng
Copy link
Contributor

cherry pick from #58962

Currently, when performing tablet warm-up balancing in the cloud, the
sequential execution of a single warm-up task leads to a series of
problems, such as:

1. When scaling up a computer group to include beta nodes, with a large
number of tables (millions of tablets), actual tests showed that scaling
from 1 beta node to 10 beta nodes took more than 6 hours to reach a
balanced state. Each warm-up task RPC took about 30ms. This means that
even if a new node can handle the load, scaling up a new node in the
cloud can still take up to 6 hours in the worst case.

2. Due to the same logic, decomission be is also relatively slow.

Fixes:

1. Batch and pipeline warm-up tasks. Each batch can contain multiple
warm-up tasks with the same source and destination (each task represents
migrating one tablet).

2. Separate the warm-up task finish thread to prevent scheduling logic
from affecting the logic that modifies tablet-to-tablet mappings.

3. Asynchronously fetch file cache meta in the warm_up_cache_async logic
and add some bvars.

Post-fix testing showed that in a scenario with 10 databases, 10,000
tables, 100,000 partitions, and 1 million tablets, the number of be
nodes increased from 3 to 10 within 10 minutes.
@deardeng deardeng requested a review from yiguolei as a code owner December 24, 2025 11:36
@deardeng
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/12) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.30% (18603/34902)
Line Coverage 39.04% (172225/441175)
Region Coverage 33.70% (133088/394916)
Branch Coverage 34.72% (57593/165883)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 2.06% (4/194) 🎉
Increment coverage report
Complete coverage report

@yiguolei yiguolei merged commit 97ea173 into apache:branch-4.0 Dec 25, 2025
22 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants