Skip to content

test: fix the race of release read filter in FakeRawConnection#26099

Merged
kyessenov merged 4 commits intoenvoyproxy:mainfrom
soulxu:fix_fake_upstream
Mar 24, 2023
Merged

test: fix the race of release read filter in FakeRawConnection#26099
kyessenov merged 4 commits intoenvoyproxy:mainfrom
soulxu:fix_fake_upstream

Conversation

@soulxu
Copy link
Copy Markdown
Member

@soulxu soulxu commented Mar 15, 2023

Commit Message: test: fix the race of release read filter in FakeRawConnection
Additional Description:
There is a race when the destructor of FakeRawConnection is invoked in the test main thread.

The shared ptr of FakeRawConnection::read_filter_ will be released both in the test main thread and upstream connection's own thread by calling removeReadFilter.

This PR just use the shared ptr for FakeRawConnection::read_filter_, then after the read_filter_ move into the lambda, there will be no release work for the read_filter in the test main thread anymore.

Risk Level: low
Testing: integration test
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes part of #26082

Signed-off-by: He Jie Xu <hejie.xu@intel.com>
@soulxu
Copy link
Copy Markdown
Member Author

soulxu commented Mar 15, 2023

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #26099 (comment) was created by @soulxu.

see: more, trace.

@soulxu
Copy link
Copy Markdown
Member Author

soulxu commented Mar 15, 2023

/wait

emm... I don't think I got the right fix.

@soulxu
Copy link
Copy Markdown
Member Author

soulxu commented Mar 16, 2023

The race happened between the destructor of shared_ptr and weak_ptr

__on_zero_shared will be called by shared_ptr. __shared_ptr_pointer will be called by weak_ptr

template <class _Tp, class _Dp, class _Alloc>
void
__shared_ptr_pointer<_Tp, _Dp, _Alloc>::__on_zero_shared() _NOEXCEPT
{
    __data_.first().second()(__data_.first().first());
    __data_.first().second().~_Dp();
}

template <class _Tp, class _Dp, class _Alloc>
void
__shared_ptr_pointer<_Tp, _Dp, _Alloc>::__on_zero_shared_weak() _NOEXCEPT
{
    typedef typename __allocator_traits_rebind<_Alloc, __shared_ptr_pointer>::type _Al;
    typedef allocator_traits<_Al> _ATraits;
    typedef pointer_traits<typename _ATraits::pointer> _PTraits;

    _Al __a(__data_.second());
    __data_.second().~_Alloc();
    __a.deallocate(_PTraits::pointer_to(*this), 1);
}

But seems like the shared_ptr race is at line of calling on_zero_shared, not inside the on_zero_shared

#0 std::__1::__shared_count::__release_shared() /home/hejiexu/clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04/bin/../include/c++/v1/__memory/shared_ptr.h:174:9 (listener_lds_integration_test+0x32ceb30)

the weak_ptr part is at line of deallocate

__a.deallocate(_PTraits::pointer_to(*this), 1);
#5 std::__1::__shared_ptr_pointer<Envoy::FakeRawConnection::ReadFilter*, std::__1::shared_ptr<Envoy::Network::ReadFilter>::__shared_ptr_default_delete<Envoy::Network::ReadFilter, Envoy::FakeRawConnection::ReadFilter>, std::__1::allocator<Envoy::FakeRawConnection::ReadFilter> >::__on_zero_shared_weak() /home/hejiexu/clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04/bin/../include/c++/v1/__memory/shared_ptr.h:278:9 (listener_lds_integration_test+0x36f6aa0

The shared_ptr and weak_ptr share the same control block __shared_ptr_pointer. So that 8 size race data is pointed to the control block __shared_ptr_pointer itself.

template <class _Tp, class _Dp, class _Alloc>
class __shared_ptr_pointer
    : public __shared_weak_count
``

@soulxu
Copy link
Copy Markdown
Member Author

soulxu commented Mar 16, 2023

finally I think this is a way to fix it. This is ready for review.

@soulxu
Copy link
Copy Markdown
Member Author

soulxu commented Mar 16, 2023

one more note:

From https://en.cppreference.com/w/cpp/memory/shared_ptr

The destructor of shared_ptr decrements the number of shared owners of the control block. If that counter reaches zero, the control block calls the destructor of the managed object. The control block does not deallocate itself until the [std::weak_ptr](https://en.cppreference.com/w/cpp/memory/weak_ptr) counter reaches zero as well.
``

And then it isn't thread safety even for shared_ptr itself.

@phlax phlax mentioned this pull request Mar 16, 2023

std::string data_ ABSL_GUARDED_BY(lock_);
std::weak_ptr<Network::ReadFilter> read_filter_;
std::shared_ptr<Network::ReadFilter> read_filter_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connections don't own the filters in the production code. This seems like it'll reduce coverage for cases when filters get deleted earlier than before.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyessenov thanks for the review!

Sorry, I wasn't sure I understand your comment. Do you mean if we remove the filters before the end of connection, the read_filter won't be released as expected since it is shared ptr?

The FakeRawConnection has a shorter lifetime than the shared connection, so the read filter will be removed early than the connection end. And the read filter at here is just for removeReadFilter to query the correct filter.

Is there any case the filter will be removed and not by the FakeRawConnection end?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but this comment makes me concerned:

  // If the filter was already deleted, it means the shared_connection_ was too, so don't try to
  // access it.

But then shared connection is not a weak ptr, so is the comment out of date?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comes from another flake #14067

Copy link
Copy Markdown
Member Author

@soulxu soulxu Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but this comment makes me concerned:

  // If the filter was already deleted, it means the shared_connection_ was too, so don't try to
  // access it.

But then shared connection is not a weak ptr, so is the comment out of date?

I updated the check if (read_filter_ != nullptr && read_filter_.use_count() > 1) {, it should be safer.

But I still didn't see the shared_connection is gonna be free before the FakeRawConnection, since FakeRawConnection is a reference to the SharedConnectionWrapper, and SharedConnectionWrapper is a reference to Network::Connection. So if the Network::Connection is released early, all those references are broken also.

But the Network::Connection can end early indeed, I think it protected by a flag in the SharedConnectionWrapper

disconnected_ = true;

When you still want to execute something on the disconnect connection, that flag protect that

if (disconnected_) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use disconnected_ flag in the fake raw connection destructor, too?
I think the idea is that we only need to remove the filter if the connection is alive.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I can use the disconnected_ and it protected by lock also.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm... actually, executeOnDispatcher check the disconnected_

testing::AssertionResult
executeOnDispatcher(std::function<void(Network::Connection&)> f,
std::chrono::milliseconds timeout = TestUtility::DefaultTimeout,
bool allow_disconnects = true) {
absl::MutexLock lock(&lock_);
if (disconnected_) {
return testing::AssertionSuccess();
}

So it should be safe just call executeOnDispatcher directly.

soulxu added 2 commits March 21, 2023 00:16
Signed-off-by: He Jie Xu <hejie.xu@intel.com>
Signed-off-by: He Jie Xu <hejie.xu@intel.com>
Comment thread test/integration/fake_upstream.cc Outdated
// If the filter was already deleted, it means the shared_connection_ was too, so don't try to
// access it.
if (auto filter = read_filter_.lock(); filter != nullptr) {
if (read_filter_ != nullptr && read_filter_.use_count() > 1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_count is not safe for multi-threaded environments as far as I remember. weak_ptr is safe.

Signed-off-by: He Jie Xu <hejie.xu@intel.com>
@soulxu
Copy link
Copy Markdown
Member Author

soulxu commented Mar 24, 2023

@kyessenov gentle ping :)

Copy link
Copy Markdown
Contributor

@kyessenov kyessenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix!

@kyessenov
Copy link
Copy Markdown
Contributor

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #26099 (comment) was created by @kyessenov.

see: more, trace.

@kyessenov kyessenov enabled auto-merge (squash) March 24, 2023 05:50
@kyessenov kyessenov merged commit 81b674c into envoyproxy:main Mar 24, 2023
ashishb-90 pushed a commit to ashishb-90/envoy that referenced this pull request Mar 24, 2023
…proxy#26099)

Commit Message: test: fix the race of release read filter in FakeRawConnection
Additional Description:
There is a race when the destructor of FakeRawConnection is invoked in the test main thread.

The shared ptr of FakeRawConnection::read_filter_ will be released both in the test main thread and upstream connection's own thread by calling removeReadFilter.

This PR just use the shared ptr for FakeRawConnection::read_filter_, then after the read_filter_ move into the lambda, there will be no release work for the read_filter in the test main thread anymore.

Risk Level: low
Testing: integration test
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes part of envoyproxy#26082
Signed-off-by: Ashish Banerjee <ashish.banerjee@solo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants