Skip to content

Conversation

@deardeng
Copy link
Contributor

Currently, when performing tablet warm-up balancing in the cloud, the sequential execution of a single warm-up task leads to a series of problems, such as:

  1. When scaling up a computer group to include beta nodes, with a large number of tables (millions of tablets), actual tests showed that scaling from 1 beta node to 10 beta nodes took more than 6 hours to reach a balanced state. Each warm-up task RPC took about 30ms. This means that even if a new node can handle the load, scaling up a new node in the cloud can still take up to 6 hours in the worst case.

  2. Due to the same logic, decomission be is also relatively slow.

Fixes:

  1. Batch and pipeline warm-up tasks. Each batch can contain multiple warm-up tasks with the same source and destination (each task represents migrating one tablet).

  2. Separate the warm-up task finish thread to prevent scheduling logic from affecting the logic that modifies tablet-to-tablet mappings.

  3. Asynchronously fetch file cache meta in the warm_up_cache_async logic and add some bvars.

Post-fix testing showed that in a scenario with 10 databases, 10,000 tables, 100,000 partitions, and 1 million tablets, the number of be nodes increased from 3 to 10 within 10 minutes.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Currently, when performing tablet warm-up balancing in the cloud, the
sequential execution of a single warm-up task leads to a series of
problems, such as:

1. When scaling up a computer group to include beta nodes, with a large
number of tables (millions of tablets), actual tests showed that scaling
from 1 beta node to 10 beta nodes took more than 6 hours to reach a
balanced state. Each warm-up task RPC took about 30ms. This means that
even if a new node can handle the load, scaling up a new node in the
cloud can still take up to 6 hours in the worst case.

2. Due to the same logic, decomission be is also relatively slow.

Fixes:

1. Batch and pipeline warm-up tasks. Each batch can contain multiple
warm-up tasks with the same source and destination (each task represents
migrating one tablet).

2. Separate the warm-up task finish thread to prevent scheduling logic
from affecting the logic that modifies tablet-to-tablet mappings.

3. Asynchronously fetch file cache meta in the warm_up_cache_async logic
and add some bvars.

Post-fix testing showed that in a scenario with 10 databases, 10,000
tables, 100,000 partitions, and 1 million tablets, the number of be
nodes increased from 3 to 10 within 10 minutes.
@deardeng deardeng requested a review from morrySnow as a code owner December 24, 2025 11:32
@Thearas
Copy link
Contributor

Thearas commented Dec 24, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@deardeng
Copy link
Contributor Author

run buildall

@deardeng
Copy link
Contributor Author

run feut

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants