Skip to content

内存暴涨下,sched_to 参数 pg 值被意外地修改 #2925

@cbsheng

Description

@cbsheng

Describe the bug (描述bug)
在内存暴涨场景下(100GB+),抓取 bthread 堆栈,观察到 sched_to 函数的参数值被意外修改。且内存暴涨后不下降,貌似遇到 bthread 泄露。
典型的堆栈如下:

Bthread 50683:
#0 bthread::TaskGroup::sched_to (pg=0x7ffc42cb5000, next_meta=0x0) at /home/project/third_party/brpc/src/bthread/task_group.cpp:662
#1 0x00005611f605ba89 in bthread::TaskGroup::sched_to (next_tid=, pg=0x7fb4b94cb888)
at /home/project/third_party/brpc/src/bthread/task_group_inl.h:79
#2 bthread::TaskGroup::sched (pg=pg@entry=0x7fb4b94cb888) at /home/project/third_party/brpc/src/bthread/task_group.cpp:600
#3 0x00005611f6226f7e in bthread::butex_wait (arg=arg@entry=0x7fc7b2d11d80, expected_value=expected_value@entry=257, abstime=abstime@entry=0x0, prepend=prepend@entry=false)
at /home/project/third_party/brpc/src/bthread/butex.cpp:705
#4 0x00005611f60518c5 in bthread::mutex_lock_contended_impl (abstime=0x0, m=) at /home/project/third_party/brpc/src/bthread/mutex.cpp:1068
#5 bthread_mutex_lock_impl (m=, abstime=0x0) at /home/project/third_party/brpc/src/bthread/mutex.cpp:1220
#6 0x00005611f5dadde6 in bthread::Mutex::lock (this=) at /home/project/deps/brpc/include/bthread/mutex.h:59
#7 0x00005611f5dabc54 in comm::LockGuard::LockGuard (mtx=..., this=) at /home/project/src/comm/mutex.h:14
#8 extnode::ECExtent::<lambda(comm::ErrorCode, uint32_t, s2s::dn::ReadAtExtentResponse*, bool)>::operator()(comm::ErrorCode, uint32_t, s2s::dn::ReadAtExtentResponse , bool) const (__closure=0x7fcb5
8d73cd0, error_code=comm::REPLICA_PEER_CALL_ERR, shard_id=6, read_at_rsp=0x7fc89056ea00, reissue=)
at /home/project/src/extentnode/redundant/ec_extent.cc:568
#9 0x00005611f5da0c9a in std::function<void (comm::ErrorCode, unsigned int, s2s::dn::ReadAtExtentResponse
, bool)>::operator()(comm::ErrorCode, unsigned int, s2s::dn::ReadAtExtentResponse*, bool) co
nst (__args#3=, __args#2=, __args#1=, __args#0=, this=0x7fcb5847a498)
at /usr/include/c++/9/bits/std_function.h:683
#10 extnode::ECExtent::<lambda(comm::ErrorCode)>::operator()(comm::ErrorCode) const (__closure=0x7fcb5847a450, error_code=comm::REPLICA_PEER_CALL_ERR)
at /home/project/src/extentnode/redundant/ec_extent.cc:875
#11 0x00005611f5d3f3e8 in std::function<void (comm::ErrorCode)>::operator()(comm::ErrorCode) const (__args#0=, this=)
at /usr/include/c++/9/bits/std_function.h:683
#12 extnode::BrpcShardCaller::<lambda()>::operator() (__closure=, __closure=)
at /home/project/src/extentnode/rpc_caller/brpc_extnode_caller.cc:162
#13 std::_Function_handler<void(), extnode::BrpcShardCaller::readAtRpcCb(comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, comm::Callback)::<lambda()> >::_M_invoke(const std::_Any_data &
) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#14 0x00005611f5d40ddc in std::function<void ()>::operator()() const (this=0x7fb4b94cbda0) at /usr/include/c++/9/bits/std_function.h:683
#15 comm::DeferHelper::~DeferHelper (this=0x7fb4b94cbda0, __in_chrg=) at /home/project/src/comm/defer.h:12
#16 extnode::BrpcShardCaller::readAtRpcCb(comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, std::function<void (comm::ErrorCode)>) (this=,
ctx=, cntl=0x7f874bebdd00, rpc_rsp=, cb=...) at /home/project/src/extentnode/rpc_caller/brpc_extnode_caller.cc:181
#17 0x00005611f5d445f1 in brpc::internal::MethodClosure4<extnode::BrpcShardCaller, extnode::BrpcShardCaller*, comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, std::function<void (comm::ErrorCode)> >::Run() (this=0x7fc55ff2eb70) at /usr/include/c++/9/bits/std_function.h:564
#18 0x00005611f6072c1b in brpc::Controller::EndRPC (this=0x7f874bebdd00, info=...) at /home/project/third_party/brpc/src/brpc/controller.cpp:968
#19 0x00005611f6072ef4 in brpc::Controller::RunEndRPC (arg=) at /home/project/third_party/brpc/src/brpc/controller.cpp:757
#20 0x00005611f605bdc7 in bthread::TaskGroup::task_runner (skip_remained=) at /home/project/third_party/brpc/src/bthread/task_group.cpp:305
#21 0x00005611f62288b1 in bthread_make_fcontext ()
#22 0x0000000000000000 in ?? ()

Bthread 50684:
#0 bthread::TaskGroup::sched_to (pg=0x7ffc42cb5000, next_meta=0x0) at /home/project/third_party/brpc/src/bthread/task_group.cpp:662
#1 0x00005611f605ba89 in bthread::TaskGroup::sched_to (next_tid=, pg=0x7fc118bcb7c8)
at /home/project/third_party/brpc/src/bthread/task_group_inl.h:79
#2 bthread::TaskGroup::sched (pg=pg@entry=0x7fc118bcb7c8) at /home/project/third_party/brpc/src/bthread/task_group.cpp:600
#3 0x00005611f6226f7e in bthread::butex_wait (arg=arg@entry=0x7fc6552c6d80, expected_value=expected_value@entry=257, abstime=abstime@entry=0x0, prepend=prepend@entry=false)
at /home/project/third_party/brpc/src/bthread/butex.cpp:705
#4 0x00005611f60518c5 in bthread::mutex_lock_contended_impl (abstime=0x0, m=) at /home/project/third_party/brpc/src/bthread/mutex.cpp:1068
#5 bthread_mutex_lock_impl (m=, abstime=0x0) at /home/project/third_party/brpc/src/bthread/mutex.cpp:1220
#6 0x00005611f5dadde6 in bthread::Mutex::lock (this=) at /home/project/deps/brpc/include/bthread/mutex.h:59
#7 0x00005611f5da0de5 in comm::LockGuard::LockGuard (mtx=..., this=) at /home/project/src/comm/mutex.h:14
#8 extnode::ECExtent::<lambda(comm::ErrorCode)>::operator()(comm::ErrorCode) const (__closure=0x7fc96cb98ad0, error_code=)
at /home/project/src/extentnode/redundant/ec_extent.cc:890
#9 0x00005611f5d3f3e8 in std::function<void (comm::ErrorCode)>::operator()(comm::ErrorCode) const (__args#0=, this=)
at /usr/include/c++/9/bits/std_function.h:683
#10 extnode::BrpcShardCaller::<lambda()>::operator() (__closure=, __closure=)
at /home/project/src/extentnode/rpc_caller/brpc_extnode_caller.cc:162
#11 std::_Function_handler<void(), extnode::BrpcShardCaller::readAtRpcCb(comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, comm::Callback)::<lambda()> >::_M_invoke(const std::_Any_data &
) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#12 0x00005611f5d40ddc in std::function<void ()>::operator()() const (this=0x7fc118bcbab0) at /usr/include/c++/9/bits/std_function.h:683
#13 comm::DeferHelper::~DeferHelper (this=0x7fc118bcbab0, __in_chrg=) at /home/project/src/comm/defer.h:12
#14 extnode::BrpcShardCaller::readAtRpcCb(comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, std::function<void (comm::ErrorCode)>) (this=,
ctx=, cntl=0x7fc9df812620, rpc_rsp=, cb=...) at /home/project/src/extentnode/rpc_caller/brpc_extnode_caller.cc:181
#15 0x00005611f5d445f1 in brpc::internal::MethodClosure4<extnode::BrpcShardCaller, extnode::BrpcShardCaller*, comm::Ctx*, brpc::Controller*, s2s::dn::ReadAtExtentResponse*, std::function<void (comm::
ErrorCode)> >::Run() (this=0x7fc9df812970) at /usr/include/c++/9/bits/std_function.h:564
#16 0x00005611f6072c1b in brpc::Controller::EndRPC (this=0x7fc9df812620, info=...) at /home/project/third_party/brpc/src/brpc/controller.cpp:968
#17 0x00005611f607436e in brpc::Controller::OnVersionedRPCReturned (this=this@entry=0x7fc9df812620, info=..., new_bthread=new_bthread@entry=false,
saved_error=saved_error@entry=0) at /home/project/third_party/brpc/src/brpc/controller.cpp:751
#18 0x00005611f60a76ab in brpc::ControllerPrivateAccessor::OnResponse (this=, saved_error=0, id=...)
at /home/project/third_party/brpc/src/brpc/details/controller_private_accessor.h:48
#19 brpc::policy::ProcessRpcResponse (msg_base=0x7fcb90062040) at /home/project/third_party/brpc/src/brpc/policy/baidu_rpc_protocol.cpp:819
#20 0x00005611f609d2db in brpc::ProcessInputMessage (void_arg=) at /home/project/third_party/brpc/src/brpc/input_messenger.cpp:184
#21 0x00005611f609f94b in brpc::InputMessenger::OnNewMessages (m=0x7fcb2cb32440) at /usr/include/c++/9/bits/atomic_base.h:493
#22 0x00005611f6180f95 in brpc::Socket::ProcessEvent (arg=0x7fcb2cb32440) at /home/project/third_party/brpc/src/brpc/socket.cpp:1196
#23 0x00005611f605bdc7 in bthread::TaskGroup::task_runner (skip_remained=) at /home/project/third_party/brpc/src/bthread/task_group.cpp:305
#24 0x00005611f62288b1 in bthread_make_fcontext ()
#25 0x0000000000000000 in ?? ()

有两个地方比较疑惑:
1、从 next_tid 为参数的 sched_to 跳转到以 next_meta 为参数的 sched_to 函数时,pg 值被改变了。且所有堆栈被改变的值都指 0x7ffc42cb5000。
2、next_meta 的值都变成 0x0。

To Reproduce (复现方法)
在我的服务环境里,在触发大量读数据请求(比如 33K QPS),每秒从网卡读 30Gb/s 左右的数据,持续 60~120s。读到的数据在本地处理,处理逻辑里有使用 mutex 和 conditionvariable,所以在短时间内服务的内存会瞬间暴涨到 100GB+。

Expected behavior (期望行为)

Versions (各种版本)
OS: 5.15.0
Compiler: c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
brpc: v1.12.1
protobuf:

Additional context/screenshots (更多上下文/截图)
补充两个不同的 pg 结构的输出

Image

补充栈顶 sched_to 栈帧的代码位置

Image

补充监控:bthread count 暴涨后不下降(与内存不下降一致)

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions