-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Description
Describe the bug (描述bug)
在内存暴涨场景下(100GB+),抓取 bthread 堆栈,观察到 sched_to 函数的参数值被意外修改。且内存暴涨后不下降,貌似遇到 bthread 泄露。
典型的堆栈如下:
Bthread 50683:
#0 bthread::TaskGroup::sched_to (pg=0x7ffc42cb5000, next_meta=0x0) at /home/project/third_party/brpc/src/bthread/task_group.cpp:662
#1 0x00005611f605ba89 in bthread::TaskGroup::sched_to (next_tid=, pg=0x7fb4b94cb888)
at /home/project/third_party/brpc/src/bthread/task_group_inl.h:79
#2 bthread::TaskGroup::sched (pg=pg@entry=0x7fb4b94cb888) at /home/project/third_party/brpc/src/bthread/task_group.cpp:600
#3 0x00005611f6226f7e in bthread::butex_wait (arg=arg@entry=0x7fc7b2d11d80, expected_value=expected_value@entry=257, abstime=abstime@entry=0x0, prepend=prepend@entry=false)
at /home/project/third_party/brpc/src/bthread/butex.cpp:705
#4 0x00005611f60518c5 in bthread::mutex_lock_contended_impl (abstime=0x0, m=) at /home/project/third_party/brpc/src/bthread/mutex.cpp:1068
#5 bthread_mutex_lock_impl (m=, abstime=0x0) at /home/project/third_party/brpc/src/bthread/mutex.cpp:1220
#6 0x00005611f5dadde6 in bthread::Mutex::lock (this=) at /home/project/deps/brpc/include/bthread/mutex.h:59
#7 0x00005611f5dabc54 in comm::LockGuard::LockGuard (mtx=..., this=) at /home/project/src/comm/mutex.h:14
#8 extnode::ECExtent::<lambda(comm::ErrorCode, uint32_t, s2s::dn::ReadAtExtentResponse*, bool)>::operator()(comm::ErrorCode, uint32_t, s2s::dn::ReadAtExtentResponse , bool) const (__closure=0x7fcb5
8d73cd0, error_code=comm::REPLICA_PEER_CALL_ERR, shard_id=6, read_at_rsp=0x7fc89056ea00, reissue=)
at /home/project/src/extentnode/redundant/ec_extent.cc:568
#9 0x00005611f5da0c9a in std::function<void (comm::ErrorCode, unsigned int, s2s::dn::ReadAtExtentResponse, bool)>::operator()(comm::ErrorCode, unsigned int, s2s::dn::ReadAtExtentResponse*, bool) co
nst (__args#3=, __args#2=, __args#1=, __args#0=, this=0x7fcb5847a498)
at /usr/include/c++/9/bits/std_function.h:683
#10 extnode::ECExtent::<lambda(comm::ErrorCode)>::operator()(comm::ErrorCode) const (__closure=0x7fcb5847a450, error_code=comm::REPLICA_PEER_CALL_ERR)
at /home/project/src/extentnode/redundant/ec_extent.cc:875
#11 0x00005611f5d3f3e8 in std::function<void (comm::ErrorCode)>::operator()(comm::ErrorCode) const (__args#0=, this=)
at /usr/include/c++/9/bits/std_function.h:683
#12 extnode::BrpcShardCaller::<lambda()>::operator() (__closure=, __closure=)
at /home/project/src/extentnode/rpc_caller/brpc_extnode_caller.cc:162
#13 std::_Function_handler<void(), extnode::BrpcShardCaller::readAtRpcCb(comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, comm::Callback)::<lambda()> >::_M_invoke(const std::_Any_data &
) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#14 0x00005611f5d40ddc in std::function<void ()>::operator()() const (this=0x7fb4b94cbda0) at /usr/include/c++/9/bits/std_function.h:683
#15 comm::DeferHelper::~DeferHelper (this=0x7fb4b94cbda0, __in_chrg=) at /home/project/src/comm/defer.h:12
#16 extnode::BrpcShardCaller::readAtRpcCb(comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, std::function<void (comm::ErrorCode)>) (this=,
ctx=, cntl=0x7f874bebdd00, rpc_rsp=, cb=...) at /home/project/src/extentnode/rpc_caller/brpc_extnode_caller.cc:181
#17 0x00005611f5d445f1 in brpc::internal::MethodClosure4<extnode::BrpcShardCaller, extnode::BrpcShardCaller*, comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, std::function<void (comm::ErrorCode)> >::Run() (this=0x7fc55ff2eb70) at /usr/include/c++/9/bits/std_function.h:564
#18 0x00005611f6072c1b in brpc::Controller::EndRPC (this=0x7f874bebdd00, info=...) at /home/project/third_party/brpc/src/brpc/controller.cpp:968
#19 0x00005611f6072ef4 in brpc::Controller::RunEndRPC (arg=) at /home/project/third_party/brpc/src/brpc/controller.cpp:757
#20 0x00005611f605bdc7 in bthread::TaskGroup::task_runner (skip_remained=) at /home/project/third_party/brpc/src/bthread/task_group.cpp:305
#21 0x00005611f62288b1 in bthread_make_fcontext ()
#22 0x0000000000000000 in ?? ()Bthread 50684:
#0 bthread::TaskGroup::sched_to (pg=0x7ffc42cb5000, next_meta=0x0) at /home/project/third_party/brpc/src/bthread/task_group.cpp:662
#1 0x00005611f605ba89 in bthread::TaskGroup::sched_to (next_tid=, pg=0x7fc118bcb7c8)
at /home/project/third_party/brpc/src/bthread/task_group_inl.h:79
#2 bthread::TaskGroup::sched (pg=pg@entry=0x7fc118bcb7c8) at /home/project/third_party/brpc/src/bthread/task_group.cpp:600
#3 0x00005611f6226f7e in bthread::butex_wait (arg=arg@entry=0x7fc6552c6d80, expected_value=expected_value@entry=257, abstime=abstime@entry=0x0, prepend=prepend@entry=false)
at /home/project/third_party/brpc/src/bthread/butex.cpp:705
#4 0x00005611f60518c5 in bthread::mutex_lock_contended_impl (abstime=0x0, m=) at /home/project/third_party/brpc/src/bthread/mutex.cpp:1068
#5 bthread_mutex_lock_impl (m=, abstime=0x0) at /home/project/third_party/brpc/src/bthread/mutex.cpp:1220
#6 0x00005611f5dadde6 in bthread::Mutex::lock (this=) at /home/project/deps/brpc/include/bthread/mutex.h:59
#7 0x00005611f5da0de5 in comm::LockGuard::LockGuard (mtx=..., this=) at /home/project/src/comm/mutex.h:14
#8 extnode::ECExtent::<lambda(comm::ErrorCode)>::operator()(comm::ErrorCode) const (__closure=0x7fc96cb98ad0, error_code=)
at /home/project/src/extentnode/redundant/ec_extent.cc:890
#9 0x00005611f5d3f3e8 in std::function<void (comm::ErrorCode)>::operator()(comm::ErrorCode) const (__args#0=, this=)
at /usr/include/c++/9/bits/std_function.h:683
#10 extnode::BrpcShardCaller::<lambda()>::operator() (__closure=, __closure=)
at /home/project/src/extentnode/rpc_caller/brpc_extnode_caller.cc:162
#11 std::_Function_handler<void(), extnode::BrpcShardCaller::readAtRpcCb(comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, comm::Callback)::<lambda()> >::_M_invoke(const std::_Any_data &
) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#12 0x00005611f5d40ddc in std::function<void ()>::operator()() const (this=0x7fc118bcbab0) at /usr/include/c++/9/bits/std_function.h:683
#13 comm::DeferHelper::~DeferHelper (this=0x7fc118bcbab0, __in_chrg=) at /home/project/src/comm/defer.h:12
#14 extnode::BrpcShardCaller::readAtRpcCb(comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, std::function<void (comm::ErrorCode)>) (this=,
ctx=, cntl=0x7fc9df812620, rpc_rsp=, cb=...) at /home/project/src/extentnode/rpc_caller/brpc_extnode_caller.cc:181
#15 0x00005611f5d445f1 in brpc::internal::MethodClosure4<extnode::BrpcShardCaller, extnode::BrpcShardCaller*, comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, std::function<void (comm::
ErrorCode)> >::Run() (this=0x7fc9df812970) at /usr/include/c++/9/bits/std_function.h:564
#16 0x00005611f6072c1b in brpc::Controller::EndRPC (this=0x7fc9df812620, info=...) at /home/project/third_party/brpc/src/brpc/controller.cpp:968
#17 0x00005611f607436e in brpc::Controller::OnVersionedRPCReturned (this=this@entry=0x7fc9df812620, info=..., new_bthread=new_bthread@entry=false,
saved_error=saved_error@entry=0) at /home/project/third_party/brpc/src/brpc/controller.cpp:751
#18 0x00005611f60a76ab in brpc::ControllerPrivateAccessor::OnResponse (this=, saved_error=0, id=...)
at /home/project/third_party/brpc/src/brpc/details/controller_private_accessor.h:48
#19 brpc::policy::ProcessRpcResponse (msg_base=0x7fcb90062040) at /home/project/third_party/brpc/src/brpc/policy/baidu_rpc_protocol.cpp:819
#20 0x00005611f609d2db in brpc::ProcessInputMessage (void_arg=) at /home/project/third_party/brpc/src/brpc/input_messenger.cpp:184
#21 0x00005611f609f94b in brpc::InputMessenger::OnNewMessages (m=0x7fcb2cb32440) at /usr/include/c++/9/bits/atomic_base.h:493
#22 0x00005611f6180f95 in brpc::Socket::ProcessEvent (arg=0x7fcb2cb32440) at /home/project/third_party/brpc/src/brpc/socket.cpp:1196
#23 0x00005611f605bdc7 in bthread::TaskGroup::task_runner (skip_remained=) at /home/project/third_party/brpc/src/bthread/task_group.cpp:305
#24 0x00005611f62288b1 in bthread_make_fcontext ()
#25 0x0000000000000000 in ?? ()
有两个地方比较疑惑:
1、从 next_tid 为参数的 sched_to 跳转到以 next_meta 为参数的 sched_to 函数时,pg 值被改变了。且所有堆栈被改变的值都指 0x7ffc42cb5000。
2、next_meta 的值都变成 0x0。
To Reproduce (复现方法)
在我的服务环境里,在触发大量读数据请求(比如 33K QPS),每秒从网卡读 30Gb/s 左右的数据,持续 60~120s。读到的数据在本地处理,处理逻辑里有使用 mutex 和 conditionvariable,所以在短时间内服务的内存会瞬间暴涨到 100GB+。
Expected behavior (期望行为)
Versions (各种版本)
OS: 5.15.0
Compiler: c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
brpc: v1.12.1
protobuf:
Additional context/screenshots (更多上下文/截图)
补充两个不同的 pg 结构的输出
补充栈顶 sched_to 栈帧的代码位置
补充监控:bthread count 暴涨后不下降(与内存不下降一致)
