Skip to content

upstream: fix for race condition in tcp conn management#30807

Merged
alyssawilk merged 5 commits intoenvoyproxy:mainfrom
cpakulski:tcp_pool_race
Nov 21, 2023
Merged

upstream: fix for race condition in tcp conn management#30807
alyssawilk merged 5 commits intoenvoyproxy:mainfrom
cpakulski:tcp_pool_race

Conversation

@cpakulski
Copy link
Copy Markdown
Contributor

@cpakulski cpakulski commented Nov 9, 2023

Commit Message:
fix for race condition in tcp conn management

Additional Description:
I discovered this race in very simple setup:
2 clients -> [listener->tcp_proxy->cluster->endpoint] -> server
clients runs in a loop and open short lived tcp sessions.
Envoy debug build triggers assertion. Release build does not crash, but as explained below, it may lead to unexpected behaviour.

Here is analysis of the problem. As with any race condition it is hard to describe without going to details.
The root cause of the race is that when upstream tcp session terminates, ConnPoolImplBase::checkForIdleAndNotify is run twice.

It is run first time when Local/RemoteClose happens in https://github.com/envoyproxy/envoy/blob/release/v1.28/source/common/conn_pool/conn_pool_base.cc#L549.
ConnPoolImplBase::checkForIdleAndNotify iterates through registered idle_callbacks_. The callback is a lambda with captured host and hash_key which removes the tcp connection pool: source/common/upstream/cluster_manager_impl.cc.

The second time ConnPoolImplBase::checkForIdleAndNotify is called from Envoy::Tcp::ActiveTcpClient::~ActiveTcpClient via ConnPoolImplBase::checkForIdleAndCloseIdleConnsIfDraining. It again will go through registered idle callbacks. It will try to remove the tcp connection pool based on hash_key, but it was already removed when the ConnPoolImplBase::checkForIdleAndNotify was called the first time, so it is no-op.

The problem happens when a new connection comes between first and second call to ConnPoolImplBase::checkForIdleAndNotify. The new connection will allocate a new tcp connection pool with the same hash_key as the one just deleted in the first call to ConnPoolImplBase::checkForIdleAndNotify. Then, the second call to ConnPoolImplBase::checkForIdleAndNotify invokes registered callbacks the second time. The callback basically tries to delete a connection pool indexed by hash_key. It finds the newly allocated session which was created between calls to ConnPoolImplBase::checkForIdleAndNotify.

There are few ways to make sure that callbacks are run only once when upstream tcp session terminates. A boolean flag is one possibility or clearing the list of callbacks after they have been called.

@alyssawilk @ggreenway

Risk Level: Low
Testing: Manual. Before the fix assertion triggered within 5 minutes. After the fix crash does not happen.
Docs Changes: No
Release Notes: No
Platform Specific Features: No
Fixes: #22583

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski cpakulski marked this pull request as draft November 9, 2023 15:47
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski cpakulski marked this pull request as ready for review November 9, 2023 18:40
@adisuissa
Copy link
Copy Markdown
Contributor

@ggreenway can you PTAL?

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed description of what's happening!

I think the fix you have here makes sense and is straightforward.

Is it possible to create a test to validate that the idle callback cannot be called multiple times?

Also, can you add a comment with your analysis of this bug next to the fix so a future reader understands the issue? I don't want this knowledge to be lost.

/wait

@ggreenway
Copy link
Copy Markdown
Member

@alyssawilk can you also take a look at this fix and double-check that it is the correct fix?

@cpakulski
Copy link
Copy Markdown
Contributor Author

@ggreenway thanks for reviewing. I will try add a unit test to make sure that callbacks are called exactly once regardless of path taken (Local/RemoteClose) or destructor.

alyssawilk
alyssawilk previously approved these changes Nov 16, 2023
Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM though yeah it'd be good to have a unit test especially as I'm not sure if there's a non-bug situation where it should get called twice so we may just be masking a larger issue.

@alyssawilk
Copy link
Copy Markdown
Contributor

/wait on test

@kyessenov
Copy link
Copy Markdown
Contributor

kyessenov commented Nov 17, 2023

The root cause is the same for #22583 and istio/proxy#5151.

@cpakulski
Copy link
Copy Markdown
Contributor Author

I will finish this PR by Monday and we can do final review.

@cpakulski
Copy link
Copy Markdown
Contributor Author

Updated PR description with "Fixes ...". Thanks for linking those issues @kyessenov.

cb();
}
// Clear callbacks, so they are not executed if checkForIdleAndNotify is called again.
idle_callbacks_.clear();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fix was to put

if (deferred_deleting_) {
  return;
}

since it was protecting against future accidents. Either way SGTM. Can I suggest to change the debug log to indicate whether it's firing or not? It's useful for debugging.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense. I will change the log.

Updated comments.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@cpakulski
Copy link
Copy Markdown
Contributor Author

I have added new unit test to verify that callbacks are called exactly once regardless whether upstream connection has been fully established (Connected, Local/Remote Reset events) or partially created (ActiveTcp client constructor and destructor have been called only).

I think this PR is finished and ready for the final review.

concurrent_streams_, false);

testing::MockFunction<void()> idle_pool_callback;
EXPECT_CALL(idle_pool_callback, Call());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check this failed befre? not sure if with nice mocks calling twice would pass. If so, LGTM

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it did not failed before. This test verifies that callbacks are called when a client does not go through full connect cycle (Connected, Local/Remote Reset). If callbacks have not been called here, it would leave "hanging" conn pool. This is just a safeguard if logic is modified in the future and for some reason callbacks are not called for partially created upstream clients.
Other tests failed because callbacks have been called twice and those tests have been corrected: https://github.com/envoyproxy/envoy/pull/30807/files/a1e37c378b6d16d1323bba6ec61d8fd06800adb5#diff-66377346b7f86ba261b6008401df93fbc31e0036d98f225f7ce1a0ee2198afebL500

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, it's a clear improvement and I guess as long as it's tested

@alyssawilk
Copy link
Copy Markdown
Contributor

@ggreenway has my LGTM but auto-merge blocked by your requested changes. resolve when you have a few?

@alyssawilk alyssawilk dismissed ggreenway’s stale review November 21, 2023 13:31

1 day timeout =P

@alyssawilk alyssawilk merged commit fb7598b into envoyproxy:main Nov 21, 2023
@zirain
Copy link
Copy Markdown
Member

zirain commented Nov 21, 2023

@alyssawilk need backport?

@alyssawilk
Copy link
Copy Markdown
Contributor

I'd say if this fixes istio's issue then yes. They said they'd confirm this/next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

debug assertion failure: assert failure: isIdleImpl()

6 participants