-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[opt](query cancel) cancel query if it has pipeline task leakage #39223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
5eb9c7c to
af8853f
Compare
|
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
be/src/common/config.cpp
Outdated
|
|
||
| DEFINE_mInt16(topn_agg_limit_multiplier, "2"); | ||
|
|
||
| DEFINE_mInt64(pipeline_task_leakage_detect_period_sec, "60"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
secs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
be/src/runtime/fragment_mgr.cpp
Outdated
|
|
||
| for (const auto& query_id : query_ids_and_rpc_succeed.first) { | ||
| LOG_INFO("Running query id: {}", print_id(query_id)); | ||
| result_ref.insert(query_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可能不能这么写,如果一个fe fetch 失败,我们不能认为这个fe 上运行的query 是空的,此时应该认为都是合理的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果 FE fetch 失败的话,不会到这里,216 行直接返回 false 了。
TPC-H: Total hot run time: 39804 ms |
TPC-DS: Total hot run time: 202724 ms |
ClickBench: Total hot run time: 31.33 s |
af8853f to
109c130
Compare
|
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
TPC-H: Total hot run time: 40396 ms |
TPC-DS: Total hot run time: 184586 ms |
ClickBench: Total hot run time: 30.89 s |
|
run buildall |
TPC-H: Total hot run time: 40116 ms |
TPC-DS: Total hot run time: 185556 ms |
ClickBench: Total hot run time: 31.25 s |
be/src/runtime/fragment_mgr.cpp
Outdated
| const std::map<TNetworkAddress, FrontendInfo>& running_fes = | ||
| ExecEnv::GetInstance()->get_running_frontends(); | ||
|
|
||
| std::vector<TNetworkAddress> qualified_fes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们的返回值,不应该是一个set
应该是 map<feuid,set>
我们检测的时候,应该检测一个query的fe uid 在这个map里,同时他不在后面这个set里,那么表示这个是不合理的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果一个query的feuid,从这个map 里找不到,那么就不应该处理
be/src/runtime/fragment_mgr.cpp
Outdated
| auto future_status = future.wait_for(std::chrono::seconds(3)); | ||
| if (future_status != std::future_status::ready) { | ||
| LOG_WARNING("Fetch running queries from frontend timeout"); | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥是continue? 而不是报错?return false
be/src/runtime/fragment_mgr.cpp
Outdated
| // 2. the fe is starting, hb has not come yet | ||
| // 3. this query does not have coordinator at all (eg. streamload, spark connector) | ||
| if (q_ctx->get_fe_process_uuid() == 0) { | ||
| white_list_queries.insert(q_ctx->query_id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不用这个, 按道理说,如果get running queries 返回的是map,那么只要这个query的fe uid 不在这个map,那么就应该忽略
be/src/runtime/fragment_mgr.cpp
Outdated
| // Typically, this means this query is invalid, eg. we have some bugs in pipeline scheduler which | ||
| // makes the query can not be closed normally. | ||
| // We need to cancel these query to release resources. | ||
| LOG_ERROR( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
打一下时间间隔,比如第一次检查,第二次检查
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
TPC-H: Total hot run time: 38033 ms |
TPC-DS: Total hot run time: 184060 ms |
ClickBench: Total hot run time: 30.94 s |
|
run buildall |
TPC-H: Total hot run time: 37700 ms |
TPC-DS: Total hot run time: 189947 ms |
ClickBench: Total hot run time: 30.25 s |
|
run buildall |
TPC-H: Total hot run time: 37966 ms |
TPC-DS: Total hot run time: 189933 ms |
ClickBench: Total hot run time: 30.85 s |
wangbo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
) * Problem We are currently facing an issue where pipeline tasks experience leaks in certain situations. The leak in pipeline tasks refers to the scenario where a query has already been completed, but its associated data structures still persist on the backend (BE). This could lead to some memory or computational resources on the BE never being released. * Fix We will periodically reconcile queries with the Frontend (FE) in the cancel work thread. Once we detect that a query has been completed on the FE but still exists on the Backend (BE), we will cancel the query to promptly release the resources. To avoid mistakenly triggering cancellations, we employ a conservative strategy. For instance, we will not proactively cancel queries if we detect any FE is in an abnormal state or if there are network conflicts.
We are currently facing an issue where pipeline tasks experience leaks in certain situations. The leak in pipeline tasks refers to the scenario where a query has already been completed, but its associated data structures still persist on the backend (BE). This could lead to some memory or computational resources on the BE never being released.
We will periodically reconcile queries with the Frontend (FE) in the cancel work thread. Once we detect that a query has been completed on the FE but still exists on the Backend (BE), we will cancel the query to promptly release the resources. To avoid mistakenly triggering cancellations, we employ a conservative strategy. For instance, we will not proactively cancel queries if we detect any FE is in an abnormal state or if there are network conflicts.