-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Describe the bug
BE crash with coredump like:
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x0000000001045667 in release (bytes=8192, this=0x633b7d40) at /var/doris/palo-core/be/src/runtime/mem_tracker.h:221
#2 doris::RowBatch::clear (this=0x8fc6bce0) at /var/doris/palo-core/be/src/runtime/row_batch.cpp:294
#3 0x00000000010459dd in clear (this=0x8fc6bce0) at /var/doris/palo-core/be/src/common/object_pool.h:54
#4 doris::RowBatch::~RowBatch (this=0x8fc6bce0, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/row_batch.cpp:301
#5 0x0000000001045a01 in doris::RowBatch::~RowBatch (this=0x8fc6bce0, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/row_batch.cpp:302
#6 0x00000000014a0299 in operator() (this=<optimized out>, __ptr=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/unique_ptr.h:78
#7 ~unique_ptr (this=0x7070ec98, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/unique_ptr.h:268
#8 doris::stream_load::NodeChannel::~NodeChannel (this=0x7070ec00, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/exec/tablet_sink.cpp:41
#9 0x00000000014a53f2 in ~SpecificElement (this=0x77645500, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/common/object_pool.h:75
#10 doris::ObjectPool::SpecificElement<doris::stream_load::NodeChannel>::~SpecificElement (this=0x77645500, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/common/object_pool.h:76
#11 0x0000000001052efd in clear (this=0x6a213220) at /var/doris/palo-core/be/src/common/object_pool.h:54
#12 ~ObjectPool (this=0x6a213220, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/common/object_pool.h:37
#13 std::_Sp_counted_ptr<doris::ObjectPool*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:376
#14 0x000000000104b591 in _M_release (this=0x6a212fa0) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:154
#15 ~__shared_count (this=0x891de310, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:684
#16 ~__shared_ptr (this=0x891de308, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:1123
#17 ~shared_ptr (this=0x891de308, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr.h:93
#18 doris::RuntimeState::~RuntimeState (this=0x891de300, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/runtime_state.cpp:128
#19 0x000000000103dcc6 in checked_delete<doris::RuntimeState> (x=0x891de300) at /var/doris/thirdparty/installed/include/boost/core/checked_delete.hpp:34
#20 ~scoped_ptr (this=0x88082ce0, __in_chrg=<optimized out>) at /var/doris/thirdparty/installed/include/boost/smart_ptr/scoped_ptr.hpp:89
#21 doris::PlanFragmentExecutor::~PlanFragmentExecutor (this=0x88082b70, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/plan_fragment_executor.cpp:62
#22 0x0000000000fd39ed in doris::FragmentExecState::~FragmentExecState (this=0x88082b00, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/fragment_mgr.cpp:175
Debug
In some abnormal situations, performing Insert to load will cause BE to crash. The reasons for troubleshooting are as follows:
- FE side:
During the execution of execState.execRemoteFragmentAsync(), if an RPC error occurs, if the corresponding BE is down, an exception will be thrown directly instead of returning the error via Future<PExecPlanFragmentResult>. This time, the Coordinator will not proceed with the subsequent Cancel operation.
After an exception is thrown, the thread stack returns to handleInsertStmt() and directly returns the user error message. Insert failed. So far, FE has no further processing.
- BE side
BE receives the execution plan of Insert and calls _sink->open in PlanFragmentExecutor::open_internal() to open TabletSink.
The Open method of TabletSink will open all related NodeChannels via RPC, but because some BEs problem, some of NodeChannels fail to open. An error message appears in the BE log:
tablet open failed, load_id=4e2f4d6076cc4692-b8f26cc924b6be0d, txn_id144942945, node=10.xx.xx.xx:8060, errmsg=failed to open tablet writer
However, because the majority of NodeChannels were successfully opened, TabletSink Open operations returned success.
Next, in PlanFragmentExecutor::open_internal() will call get_next_internal() to start fetching data, because Insert already failed at this time, so it returns failure here. And there are bugs in the later update_status(status) method:
Line 493: if (_status.ok()) should be if (!_status.ok()). This error caused the _status variable not to be updated. This will cause the NodeChannel to be closed instead of being canceled when the TabletSink is finally closed.
The normal NodeChannel close operation will send the last batch of RowBatch it holds and destroy the RowBatch object. And because some NodeChannels were not opened normally, they will not be closed normally. The RowBatch held by these NodeChannels will not be destroyed.
After the execution of the entire plan is completed, the PlanFragmentExecutor's destruction process is entered. The destructor call chain is as follows:
PlanFragmentExecutor
|--- RuntimeState
|--- RuntimeProfile
|--- ObjectPool
|--- NodeChannel
|--- RowBatch
|--- MemTracker->release()
|--- profile->_consumption->add(-bytes)
Note that the RuntimeProfile will be first destructed in RuntimeState, which will cause the _consumption object that has been destructed to be called when destructing the RowBatch in the NodeChannel, which will eventually cause BE crash.
The whole process has the following problems:
- The bug in
update_status(status)caused the NodeChannel to not be canceled correctly, which caused problems in the final destruction. (If Cancel of NodeChannel is called in advance, RowBatch will be destructed in advance). - When the FE Coordinator executes
execRemoteFragmentAsync(), if it finds an RPC error, it should return a Future with an error code, continue the process afterwards, and actively call Cancel(). - The
_statusin RuntimeState has no lock protection, which may cause some potential problems.