-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Search before asking
- I had searched in the issues and found no similar issues.
Version
master
What's Wrong?
Here are two problem
one:
-
Problem phenomenon:
all thread in fragment thread pool inVNodeChannel::close_waitcallstd::this_thread::sleep_forto wait for finished. -
Problem analysis:
In functionFragmentMgr::_exec_actualwill callexec_state->execute(), but not process its return status, if_executor.open()failed, there is no place to call cancel actively, and in functionFragmentMgr::_exec_actualafterexec_state->execute()erase thefragment_instance_idfrom_fragment_map, leads to also cannot cancel through timeout, when deconstruction for exec_state, will callVNodeChannel::close_wait, and in this function callstd::this_thread::sleep_forto wait for finished, but the variable_add_batches_finishedand_cancelledvalue is always false because of executor open failed and not cancel, the thread will hang.
not process error return status, and erase fragment instance id from map directly:
void FragmentMgr::_exec_actual(std::shared_ptr<FragmentExecState> exec_state, FinishCallback cb) {
...
exec_state->execute();
...
// remove exec state after this fragment finished
{
std::lock_guard<std::mutex> lock(_lock);
_fragment_map.erase(exec_state->fragment_instance_id());
...
}
...
}
not process error return status, only print warning log:
Status FragmentExecState::execute() {
...
{
...
WARN_IF_ERROR(_executor.open(),
strings::Substitute("Got error while opening fragment $0, query id: $1",
print_id(_fragment_instance_id), print_id(_query_id)));
...
}
...
return Status::OK();
}
Status VNodeChannel::close_wait(RuntimeState* state) {
...
// waiting for finished, it may take a long time, so we couldn't set a timeout
while (!_add_batches_finished && !_cancelled) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
...
}
two:
-
Problem phenomenon:
bthread workers are exhausted, leads to BE cannot receive new rpc requests, FE send rpc timeout. -
Problem analysis:
When the brpc request reaches BEFragmentMgr::exec_plan_fragment, if the pthread pool is full, submit thread pool failed, will need bthread to destruct the local variablestd::shared_ptr<FragmentExecState> exec_state, and thenVNodeChannel::close_waitwill be called, but inVNodeChannel::close_waitcallstd::this_thread::sleep_forto wait finish, when the variable_add_batches_finishedand_cancelledvalue is always false, the bthread cannot switch out in time, which leads to the exhaustion of the bthread worker, and finally leads to BE cannot receive new rpc requests.
Status VNodeChannel::close_wait(RuntimeState* state) {
...
// waiting for finished, it may take a long time, so we couldn't set a timeout
while (!_add_batches_finished && !_cancelled) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
...
}
What You Expected?
For problem one:
process the return status for exec_state->execute() in function FragmentMgr::_exec_actual
For problem two:
- define member variable
_cancelledin FragmentExecState as atomic - use use bthread_usleep instead of std::this_thread::sleep_for in function
VNodeChannel::close_wait
How to Reproduce?
No response
Anything Else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct