[improvement](graceful shutdown) waiting for all query finished when graceful shutdown #23865

yiguolei · 2023-09-04T10:30:54Z

Proposed changes

In some cloud native deployment scenario, BE(especially the Compute Node BE) will be add to cluster and remove from cluster very frequently. User's query will fail if there is a fragment is running on the shutting down BE. Users could use stop_be.sh --grace, then BE will wait all running queries to stop to avoiding running query failure, but if the waiting time exceed the limit, then be will exit directly. During this period, FE will not send any queries to BE and waiting for all running queries to stop

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

…graceful shutdown

github-actions · 2023-09-04T10:37:58Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2023-09-04T10:43:39Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2023-09-04T10:50:03Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2023-09-04T11:06:37Z

clang-tidy review says "All clean, LGTM! 👍"

yiguolei · 2023-09-04T11:13:43Z

run buildall

zhiqiang-hhhh · 2023-09-04T11:22:11Z

LGTM

hello-stephen · 2023-09-04T12:35:47Z

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 48.17 seconds
stream load tsv: 532 seconds loaded 74807831229 Bytes, about 134 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.2 seconds inserted 10000000 Rows, about 342K ops/s
storage size: 17161853812 Bytes

yiguolei · 2023-09-05T00:33:20Z

run buildall

yiguolei · 2023-09-05T00:36:14Z

run buildall

github-actions · 2023-09-05T00:41:39Z

clang-tidy review says "All clean, LGTM! 👍"

doris-robot · 2023-09-05T01:13:47Z

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.37 seconds
stream load tsv: 535 seconds loaded 74807831229 Bytes, about 133 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162131886 Bytes

Gabriel39

LGTM

github-actions · 2023-09-05T01:47:52Z

PR approved by at least one committer and no changes requested.

github-actions · 2023-09-05T01:47:54Z

PR approved by anyone and no changes requested.

zhiqiang-hhhh · 2023-09-05T01:50:06Z

LGTM

zhiqiang-hhhh

LGTM

…d Optimize Query Retry During BE Shutdown (#56601) ### What problem does this PR solve? Related PR: #23865 This PR includes the following main changes: #### BE Graceful Shutdown Improvements 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` #### Added FE Graceful Shutdown Support When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. #### Query Retry Optimization During BE Shutdown In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.

…d Optimize Query Retry During BE Shutdown (apache#56601) Related PR: apache#23865 This PR includes the following main changes: 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.

…d Optimize Query Retry During BE Shutdown (apache#56601) ### What problem does this PR solve? Related PR: apache#23865 This PR includes the following main changes: #### BE Graceful Shutdown Improvements 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` #### Added FE Graceful Shutdown Support When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. #### Query Retry Optimization During BE Shutdown In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.

Doris-Extras added 7 commits September 4, 2023 17:48

[improvement](graceful shutdown) waiting for all query finished when …

c3a9f3b

…graceful shutdown

f

b38fa5e

f

ee54ac0

f

8c55d98

f

72605cd

f

8f1822b

f

e02dbc0

f

b8ab18f

f

d2b18fb

yiguolei added the dev/2.0.2 label Sep 4, 2023

Merge branch 'master' into be_graceful_stop

fd1c59f

Gabriel39 approved these changes Sep 5, 2023

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Sep 5, 2023

github-actions bot added the reviewed label Sep 5, 2023

zhiqiang-hhhh approved these changes Sep 5, 2023

View reviewed changes

yiguolei merged commit 1d1a9e2 into apache:master Sep 5, 2023

xiaokang added the merge_conflict label Sep 5, 2023

yiguolei removed dev/2.0.2 merge_conflict labels Sep 5, 2023

morningman mentioned this pull request Sep 28, 2025

[opt](scheduler) Improve Graceful Shutdown Behavior for BE and FE, and Optimize Query Retry During BE Shutdown #56601

Merged

16 tasks

[improvement](graceful shutdown) waiting for all query finished when graceful shutdown #23865

[improvement](graceful shutdown) waiting for all query finished when graceful shutdown #23865

Uh oh!

Conversation

yiguolei commented Sep 4, 2023

Proposed changes

Further comments

Uh oh!

github-actions bot commented Sep 4, 2023

Uh oh!

github-actions bot commented Sep 4, 2023

Uh oh!

github-actions bot commented Sep 4, 2023

Uh oh!

github-actions bot commented Sep 4, 2023

Uh oh!

yiguolei commented Sep 4, 2023

Uh oh!

zhiqiang-hhhh commented Sep 4, 2023

Uh oh!

hello-stephen commented Sep 4, 2023

Uh oh!

yiguolei commented Sep 5, 2023

Uh oh!

yiguolei commented Sep 5, 2023

Uh oh!

github-actions bot commented Sep 5, 2023

Uh oh!

doris-robot commented Sep 5, 2023

Uh oh!

Gabriel39 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 5, 2023

Uh oh!

github-actions bot commented Sep 5, 2023

Uh oh!

zhiqiang-hhhh commented Sep 5, 2023

Uh oh!

zhiqiang-hhhh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants