Skip to content

Conversation

@yiguolei
Copy link
Contributor

@yiguolei yiguolei commented Sep 4, 2023

Proposed changes

In some cloud native deployment scenario, BE(especially the Compute Node BE) will be add to cluster and remove from cluster very frequently. User's query will fail if there is a fragment is running on the shutting down BE. Users could use stop_be.sh --grace, then BE will wait all running queries to stop to avoiding running query failure, but if the waiting time exceed the limit, then be will exit directly. During this period, FE will not send any queries to BE and waiting for all running queries to stop

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@yiguolei
Copy link
Contributor Author

yiguolei commented Sep 4, 2023

run buildall

@zhiqiang-hhhh
Copy link
Contributor

LGTM

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 48.17 seconds
stream load tsv: 532 seconds loaded 74807831229 Bytes, about 134 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.2 seconds inserted 10000000 Rows, about 342K ops/s
storage size: 17161853812 Bytes

@yiguolei
Copy link
Contributor Author

yiguolei commented Sep 5, 2023

run buildall

1 similar comment
@yiguolei
Copy link
Contributor Author

yiguolei commented Sep 5, 2023

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.37 seconds
stream load tsv: 535 seconds loaded 74807831229 Bytes, about 133 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162131886 Bytes

Copy link
Contributor

@Gabriel39 Gabriel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Sep 5, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

PR approved by anyone and no changes requested.

@zhiqiang-hhhh
Copy link
Contributor

LGTM

Copy link
Contributor

@zhiqiang-hhhh zhiqiang-hhhh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yiguolei yiguolei merged commit 1d1a9e2 into apache:master Sep 5, 2023
morningman added a commit that referenced this pull request Nov 7, 2025
…d Optimize Query Retry During BE Shutdown (#56601)

### What problem does this PR solve?

Related PR: #23865

This PR includes the following main changes:

#### BE Graceful Shutdown Improvements

1. New BE Parameter: `grace_shutdown_post_delay_seconds`

When using the BE graceful stop feature, after the main process waits
for all currently running tasks to complete, it will continue to wait
for an additional period to ensure that queries still running on other
nodes have also finished.
Since a BE node cannot detect the execution status of tasks on other BE
nodes, this threshold may need to be increased to allow a longer waiting
time.

2. Enhanced BE `api/health` Endpoint

* When the BE has not yet fully started or is in the process of shutting
down, the endpoint will return:

     * Message: `"Server is not available"`
     * HTTP Code: `200`

   * Under normal circumstances:

     * Message: `"OK"`
     * HTTP Code: `200`

#### Added FE Graceful Shutdown Support

When using `stop_fe.sh --grace`, the FE will wait for currently running
queries to finish before exiting.

Note, Currently, only query tasks are waited for; import and other types
of tasks are not yet included.

#### Query Retry Optimization During BE Shutdown

In cloud mode, when encountering the error `"No backend available as
scan node"`,
the FE will now internally retry the query to reassign it to other
available BE nodes.
morningman added a commit to morningman/doris that referenced this pull request Nov 7, 2025
…d Optimize Query Retry During BE Shutdown (apache#56601)

Related PR: apache#23865

This PR includes the following main changes:

1. New BE Parameter: `grace_shutdown_post_delay_seconds`

When using the BE graceful stop feature, after the main process waits
for all currently running tasks to complete, it will continue to wait
for an additional period to ensure that queries still running on other
nodes have also finished.
Since a BE node cannot detect the execution status of tasks on other BE
nodes, this threshold may need to be increased to allow a longer waiting
time.

2. Enhanced BE `api/health` Endpoint

* When the BE has not yet fully started or is in the process of shutting
down, the endpoint will return:

     * Message: `"Server is not available"`
     * HTTP Code: `200`

   * Under normal circumstances:

     * Message: `"OK"`
     * HTTP Code: `200`

When using `stop_fe.sh --grace`, the FE will wait for currently running
queries to finish before exiting.

Note, Currently, only query tasks are waited for; import and other types
of tasks are not yet included.

In cloud mode, when encountering the error `"No backend available as
scan node"`,
the FE will now internally retry the query to reassign it to other
available BE nodes.
wyxxxcat pushed a commit to wyxxxcat/doris that referenced this pull request Nov 18, 2025
…d Optimize Query Retry During BE Shutdown (apache#56601)

### What problem does this PR solve?

Related PR: apache#23865

This PR includes the following main changes:

#### BE Graceful Shutdown Improvements

1. New BE Parameter: `grace_shutdown_post_delay_seconds`

When using the BE graceful stop feature, after the main process waits
for all currently running tasks to complete, it will continue to wait
for an additional period to ensure that queries still running on other
nodes have also finished.
Since a BE node cannot detect the execution status of tasks on other BE
nodes, this threshold may need to be increased to allow a longer waiting
time.

2. Enhanced BE `api/health` Endpoint

* When the BE has not yet fully started or is in the process of shutting
down, the endpoint will return:

     * Message: `"Server is not available"`
     * HTTP Code: `200`

   * Under normal circumstances:

     * Message: `"OK"`
     * HTTP Code: `200`

#### Added FE Graceful Shutdown Support

When using `stop_fe.sh --grace`, the FE will wait for currently running
queries to finish before exiting.

Note, Currently, only query tasks are waited for; import and other types
of tasks are not yet included.

#### Query Retry Optimization During BE Shutdown

In cloud mode, when encountering the error `"No backend available as
scan node"`,
the FE will now internally retry the query to reassign it to other
available BE nodes.
morningman added a commit to morningman/doris that referenced this pull request Nov 29, 2025
…d Optimize Query Retry During BE Shutdown (apache#56601)

### What problem does this PR solve?

Related PR: apache#23865

This PR includes the following main changes:

#### BE Graceful Shutdown Improvements

1. New BE Parameter: `grace_shutdown_post_delay_seconds`

When using the BE graceful stop feature, after the main process waits
for all currently running tasks to complete, it will continue to wait
for an additional period to ensure that queries still running on other
nodes have also finished.
Since a BE node cannot detect the execution status of tasks on other BE
nodes, this threshold may need to be increased to allow a longer waiting
time.

2. Enhanced BE `api/health` Endpoint

* When the BE has not yet fully started or is in the process of shutting
down, the endpoint will return:

     * Message: `"Server is not available"`
     * HTTP Code: `200`

   * Under normal circumstances:

     * Message: `"OK"`
     * HTTP Code: `200`

#### Added FE Graceful Shutdown Support

When using `stop_fe.sh --grace`, the FE will wait for currently running
queries to finish before exiting.

Note, Currently, only query tasks are waited for; import and other types
of tasks are not yet included.

#### Query Retry Optimization During BE Shutdown

In cloud mode, when encountering the error `"No backend available as
scan node"`,
the FE will now internally retry the query to reassign it to other
available BE nodes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants