Skip to content

BE occasionally crashes in DataStreamSender #2466

@gaodayue

Description

@gaodayue

Describe the bug
Recently we found BE occasionally crash inside brpc with the following stacktrace.

#0  dec_ref (this=0x2000000000134) at /home/kangkaisen/palo/thirdparty/src/incubator-brpc-0.9.5/src/butil/iobuf.cpp:248
#1  butil::IOBuf::clear (this=this@entry=0x7f0e2f176830) at /home/kangkaisen/palo/thirdparty/src/incubator-brpc-0.9.5/src/butil/iobuf.cpp:723
#2  0x0000000002189662 in ~IOBuf (this=0x7f0e2f176830, __in_chrg=<optimized out>) at /home/kangkaisen/palo/thirdparty/src/incubator-brpc-0.9.5/src/butil/iobuf.h:108
#3  brpc::Controller::IssueRPC (this=this@entry=0x4d70ba788, start_realtime_us=start_realtime_us@entry=1576230170582720)
    at /home/kangkaisen/palo/thirdparty/src/incubator-brpc-0.9.5/src/brpc/controller.cpp:1096
#4  0x0000000002180977 in brpc::Channel::CallMethod (this=0x692ed700, method=0x66dca380, controller_base=0x4d70ba788, request=0x8571315a0, response=0x4d70ba9d8, done=0x4d70ba780)
    at /home/kangkaisen/palo/thirdparty/src/incubator-brpc-0.9.5/src/brpc/channel.cpp:529
#5  0x000000000115b31f in palo::PInternalService_Stub::transmit_data (this=<optimized out>, controller=0x4d70ba788, request=0x8571315a0, response=0x4d70ba9d8, done=0x4d70ba780)
    at /home/kangkaisen/palo/gensrc/build/gen_cpp/palo_internal_service.pb.cc:299
#6  0x00000000013a4651 in doris::DataStreamSender::Channel::send_batch (this=this@entry=0x8571314a0, batch=batch@entry=0x0, eos=eos@entry=true)
    at /home/kangkaisen/palo/be/src/runtime/data_stream_sender.cpp:232
#7  0x00000000013a515c in doris::DataStreamSender::Channel::close_internal (this=0x8571314a0) at /home/kangkaisen/palo/be/src/runtime/data_stream_sender.cpp:289
#8  0x00000000013a5309 in close (state=0x39763a600, this=<optimized out>) at /home/kangkaisen/palo/be/src/runtime/data_stream_sender.cpp:296
#9  doris::DataStreamSender::close (this=0x7305669c0, state=0x39763a600, exec_status=...) at /home/kangkaisen/palo/be/src/runtime/data_stream_sender.cpp:607
#10 0x0000000000e944cc in doris::PlanFragmentExecutor::open_internal (this=this@entry=0x5e61ab670) at /home/kangkaisen/palo/be/src/runtime/plan_fragment_executor.cpp:348
#11 0x0000000000e946b4 in doris::PlanFragmentExecutor::open (this=this@entry=0x5e61ab670) at /home/kangkaisen/palo/be/src/runtime/plan_fragment_executor.cpp:281
#12 0x0000000000e23797 in doris::FragmentExecState::execute (this=0x5e61ab600) at /home/kangkaisen/palo/be/src/runtime/fragment_mgr.cpp:204
#13 0x0000000000e24f9c in doris::FragmentMgr::exec_actual(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>) (this=0x6ebc000, exec_state=..., cb=...)
    at /home/kangkaisen/palo/be/src/runtime/fragment_mgr.cpp:368
#14 0x0000000000e28364 in operator() (a2=<error reading variable: access outside bounds of object referenced via synthetic pointer>, a1=..., p=<optimized out>, this=<optimized out>)
    at /home/kangkaisen/palo/thirdparty/installed/include/boost/bind/mem_fn_template.hpp:280
#15 operator()<boost::_mfi::mf2<void, doris::FragmentMgr, std::shared_ptr<doris::FragmentExecState>, std::function<void(doris::PlanFragmentExecutor*)> >, boost::_bi::list0> (a=<synthetic pointer>, f=...,
    this=<optimized out>) at /home/kangkaisen/palo/thirdparty/installed/include/boost/bind/bind.hpp:398
#16 operator() (this=<optimized out>) at /home/kangkaisen/palo/thirdparty/installed/include/boost/bind/bind.hpp:1294
#17 boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf2<void, doris::FragmentMgr, std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)> >, boost::_bi::list3<boost::_bi::value<doris::FragmentMgr*>, boost::_bi::value<std::shared_ptr<doris::FragmentExecState> >, boost::_bi::value<std::function<void (doris::PlanFragmentExecutor*)> > > >, void>::invoke(boost::detail::function::function_buffer&) (function_obj_ptr=...) at /home/kangkaisen/palo/thirdparty/installed/include/boost/function/function_template.hpp:159
#18 0x0000000000ddf08a in operator() (this=0x7f0e2f177020) at /home/kangkaisen/palo/thirdparty/installed/include/boost/function/function_template.hpp:759
#19 doris::ThreadPool::work_thread (this=0x6ebc080, thread_id=<optimized out>) at /home/kangkaisen/palo/be/src/util/thread_pool.hpp:120
#20 0x00000000016f194d in thread_proxy ()
#21 0x00007f0e74130dd5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f0e73535ead in clone () from /lib64/libc.so.6

Content of the problematic butil::IOBuf is pasted below, both 0x2000000000134 and 0x30000000002a6 are invalid memory locations causing the segment fault we saw.

{_bv = {magic = 0, start = 400949248, refs = 0x2000000000134, nref = 0, cap_mask = 2281504768,
      nbytes = 844424930132646}, _sv = {refs = {{offset = 0, length = 400949248, block = 0x2000000000134}, {offset = 0, length = 2281504768, block = 0x30000000002a6}}}}

The context of DataStreamSender::Channel::send_batch is that _packet_seq == 1 which means the DataStreamSender::Channel is closed right after init.

What's also strange is that _closure->_refs == 1, does is mean the response of the RPC has been received?

(gdb) p _closure->_refs
$13 = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 1}, <No data fields>}
(gdb) p *_closure
$14 = {<google::protobuf::Closure> = {<No data fields>}, cntl = {<google::protobuf::RpcController> = {<No data fields>}, static FLAGS_IGNORE_EOVERCROWDED = 1, static FLAGS_SECURITY_MODE = 2,
    static FLAGS_ADDED_CONCURRENCY = 4, static FLAGS_READ_PROGRESSIVELY = 8, static FLAGS_PROGRESSIVE_READER = 16, static FLAGS_BACKUP_REQUEST = 32, static FLAGS_DESTROY_CID_IN_DONE = 128,
    static FLAGS_CLOSE_CONNECTION = 256, static FLAGS_LOG_ID = 512, static FLAGS_REQUEST_CODE = 1024, static FLAGS_PB_BYTES_TO_BASE64 = 2048, static FLAGS_ALLOW_DONE_TO_RUN_IN_PLACE = 4096,
    static FLAGS_USED_BY_RPC = 8192, static FLAGS_REQUEST_WITH_AUTH = 32768, static FLAGS_PB_JSONIFY_EMPTY_ARRAY = 65536, static FLAGS_ENABLED_CIRCUIT_BREAKER = 131072, _span = 0x0, _flags = 10240,
    _error_code = 0, _error_text = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
        _M_p = 0x4d70ba7b0 ""}, _M_string_length = 0, {_M_local_buf = "\000\377\377\377\377\377\377\377\000\000\000\000\000\000\000", _M_allocated_capacity = 18446744073709551360}}, _remote_side = {ip = {
        s_addr = 713365002}, port = 8060}, _local_side = {ip = {s_addr = 361312522}, port = 54256}, _session_local_data = 0x0, _server = 0x0, _oncancel_id = {value = 0}, _auth_context = 0x0,
    _mongo_session_data = {px = 0x0}, _rpc_dump_meta = 0x0, _request_protocol = brpc::PROTOCOL_BAIDU_STD, _max_retry = 3, _retry_policy = 0x0, _correlation_id = {value = 501484676562425},
    _connection_type = brpc::CONNECTION_TYPE_SINGLE, _fail_limit = -123456789, _pipelined_count = 0, _timeout_ms = 20000, _connect_timeout_ms = 200, _backup_request_ms = -1, _abstime_us = 1576230190582720,
    _timeout_id = 0, _begin_time_us = 1576230170582720, _end_time_us = 1576230170584451, _tos = 0, _preferred_index = 1, _request_compress_type = brpc::COMPRESS_TYPE_NONE,
    _response_compress_type = brpc::COMPRESS_TYPE_NONE, _log_id = 0, _pchan_sub_count = 0, _response = 0x4d70ba9d8, _done = 0x4d70ba780, _sender = 0x0, _request_code = 0, _single_server_id = 640, _lb = {
      px = 0x0}, _tmp_completion_info = {id = {value = 0}, responded = false}, _current_call = {nretry = 0, need_feedback = false, enable_circuit_breaker = false, touched_by_stream_creator = false,
      peer_id = 640, begin_time_us = 1576230170582720, sending_sock = {_M_t = {
          _M_t = {<std::_Tuple_impl<0, brpc::Socket*, brpc::SocketDeleter>> = {<std::_Tuple_impl<1, brpc::SocketDeleter>> = {<std::_Head_base<1, brpc::SocketDeleter, true>> = {<brpc::SocketDeleter> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, brpc::Socket*, false>> = {_M_head_impl = 0x0}, <No data fields>}, <No data fields>}}}, stream_user_data = 0x0}, _unfinished_call = 0x0,
    _accessed = 0x0, _stream_creator = 0x0,
    _pack_request = 0x21b69c0 <brpc::policy::PackRpcRequest(butil::IOBuf*, brpc::SocketMessage**, unsigned long, google::protobuf::MethodDescriptor const*, brpc::Controller*, butil::IOBuf const&, brpc::Authenticator const*)>, _method = 0x66dca380, _auth = 0x0, _request_buf = {static DEFAULT_BLOCK_SIZE = 8192, static INITIAL_CAP = 32, static INVALID_AREA = 0, {_bv = {magic = 6100, start = 34, refs = 0x4dae30000,
          nref = 0, cap_mask = 0, nbytes = 0}, _sv = {refs = {{offset = 6100, length = 34, block = 0x4dae30000}, {offset = 0, length = 0, block = 0x0}}}}}, _idl_names = {request_name = 0x2f5d3f0 "req",
      response_name = 0x2aed848 "res"}, _idl_result = 12345678987654321, _http_request = 0x0, _http_response = 0x0, _request_attachment = {static DEFAULT_BLOCK_SIZE = 8192, static INITIAL_CAP = 32,
      static INVALID_AREA = 0, {_bv = {magic = 0, start = 0, refs = 0x0, nref = 0, cap_mask = 0, nbytes = 0}, _sv = {refs = {{offset = 0, length = 0, block = 0x0}, {offset = 0, length = 0, block = 0x0}}}}},
    _response_attachment = {static DEFAULT_BLOCK_SIZE = 8192, static INITIAL_CAP = 32, static INVALID_AREA = 0, {_bv = {magic = 0, start = 0, refs = 0x0, nref = 0, cap_mask = 0, nbytes = 0}, _sv = {refs = {{
              offset = 0, length = 0, block = 0x0}, {offset = 0, length = 0, block = 0x0}}}}}, _wpa = {px = 0x0}, _rpa = {px = 0x0}, _request_stream = 18446744073709551615,
    _response_stream = 18446744073709551615, _remote_stream_settings = 0x0, _thrift_method_name = {static npos = 18446744073709551615,
      _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x4d70ba9c8 ""}, _M_string_length = 0, {
        _M_local_buf = "\000\000\000\000\000\000\000\000`\005\000\000\000\000\000", _M_allocated_capacity = 0}}}, result = {<google::protobuf::Message> = {<No data fields>}, static kIndexInFileMessages = 1,
    static kStatusFieldNumber = 1,
    _internal_metadata_ = {<google::protobuf::internal::InternalMetadataWithArenaBase<google::protobuf::UnknownFieldSet, google::protobuf::internal::InternalMetadataWithArena>> = {ptr_ = 0x0,
        static kPtrTagMask = 1, static kPtrValueMask = -2}, <No data fields>}, _has_bits_ = {has_bits_ = {0}}, _cached_size_ = 0, status_ = 0x0}, _refs = {<std::__atomic_base<int>> = {
      static _S_alignment = 4, _M_i = 1}, <No data fields>}}

To Reproduce
Right row I can't find a stable way to reproduce the bug. The version we use is 0.10 with #1830 patched.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions