-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[improvement](graceful shutdown) waiting for all query finished when graceful shutdown #23865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
clang-tidy review says "All clean, LGTM! 👍" |
|
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
|
clang-tidy review says "All clean, LGTM! 👍" |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
LGTM |
|
(From new machine)TeamCity pipeline, clickbench performance test result: |
|
run buildall |
1 similar comment
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
(From new machine)TeamCity pipeline, clickbench performance test result: |
Gabriel39
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
LGTM |
zhiqiang-hhhh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…d Optimize Query Retry During BE Shutdown (#56601) ### What problem does this PR solve? Related PR: #23865 This PR includes the following main changes: #### BE Graceful Shutdown Improvements 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` #### Added FE Graceful Shutdown Support When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. #### Query Retry Optimization During BE Shutdown In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.
…d Optimize Query Retry During BE Shutdown (apache#56601) Related PR: apache#23865 This PR includes the following main changes: 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.
…d Optimize Query Retry During BE Shutdown (apache#56601) ### What problem does this PR solve? Related PR: apache#23865 This PR includes the following main changes: #### BE Graceful Shutdown Improvements 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` #### Added FE Graceful Shutdown Support When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. #### Query Retry Optimization During BE Shutdown In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.
…d Optimize Query Retry During BE Shutdown (apache#56601) ### What problem does this PR solve? Related PR: apache#23865 This PR includes the following main changes: #### BE Graceful Shutdown Improvements 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` #### Added FE Graceful Shutdown Support When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. #### Query Retry Optimization During BE Shutdown In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.
Proposed changes
In some cloud native deployment scenario, BE(especially the Compute Node BE) will be add to cluster and remove from cluster very frequently. User's query will fail if there is a fragment is running on the shutting down BE. Users could use stop_be.sh --grace, then BE will wait all running queries to stop to avoiding running query failure, but if the waiting time exceed the limit, then be will exit directly. During this period, FE will not send any queries to BE and waiting for all running queries to stop
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...