upstream: fix for race condition in tcp conn management#30807
upstream: fix for race condition in tcp conn management#30807alyssawilk merged 5 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
@ggreenway can you PTAL? |
ggreenway
left a comment
There was a problem hiding this comment.
Thank you for the detailed description of what's happening!
I think the fix you have here makes sense and is straightforward.
Is it possible to create a test to validate that the idle callback cannot be called multiple times?
Also, can you add a comment with your analysis of this bug next to the fix so a future reader understands the issue? I don't want this knowledge to be lost.
/wait
|
@alyssawilk can you also take a look at this fix and double-check that it is the correct fix? |
|
@ggreenway thanks for reviewing. I will try add a unit test to make sure that callbacks are called exactly once regardless of path taken (Local/RemoteClose) or destructor. |
alyssawilk
left a comment
There was a problem hiding this comment.
LGTM though yeah it'd be good to have a unit test especially as I'm not sure if there's a non-bug situation where it should get called twice so we may just be masking a larger issue.
|
/wait on test |
|
The root cause is the same for #22583 and istio/proxy#5151. |
|
I will finish this PR by Monday and we can do final review. |
|
Updated PR description with "Fixes ...". Thanks for linking those issues @kyessenov. |
| cb(); | ||
| } | ||
| // Clear callbacks, so they are not executed if checkForIdleAndNotify is called again. | ||
| idle_callbacks_.clear(); |
There was a problem hiding this comment.
My fix was to put
if (deferred_deleting_) {
return;
}since it was protecting against future accidents. Either way SGTM. Can I suggest to change the debug log to indicate whether it's firing or not? It's useful for debugging.
There was a problem hiding this comment.
Yeah, makes sense. I will change the log.
Updated comments. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
|
I have added new unit test to verify that callbacks are called exactly once regardless whether upstream connection has been fully established (Connected, Local/Remote Reset events) or partially created (ActiveTcp client constructor and destructor have been called only). I think this PR is finished and ready for the final review. |
| concurrent_streams_, false); | ||
|
|
||
| testing::MockFunction<void()> idle_pool_callback; | ||
| EXPECT_CALL(idle_pool_callback, Call()); |
There was a problem hiding this comment.
Just to check this failed befre? not sure if with nice mocks calling twice would pass. If so, LGTM
There was a problem hiding this comment.
No, it did not failed before. This test verifies that callbacks are called when a client does not go through full connect cycle (Connected, Local/Remote Reset). If callbacks have not been called here, it would leave "hanging" conn pool. This is just a safeguard if logic is modified in the future and for some reason callbacks are not called for partially created upstream clients.
Other tests failed because callbacks have been called twice and those tests have been corrected: https://github.com/envoyproxy/envoy/pull/30807/files/a1e37c378b6d16d1323bba6ec61d8fd06800adb5#diff-66377346b7f86ba261b6008401df93fbc31e0036d98f225f7ce1a0ee2198afebL500
alyssawilk
left a comment
There was a problem hiding this comment.
hm, it's a clear improvement and I guess as long as it's tested
|
@ggreenway has my LGTM but auto-merge blocked by your requested changes. resolve when you have a few? |
|
@alyssawilk need backport? |
|
I'd say if this fixes istio's issue then yes. They said they'd confirm this/next week |
Commit Message:
fix for race condition in tcp conn management
Additional Description:
I discovered this race in very simple setup:
2 clients -> [listener->tcp_proxy->cluster->endpoint] -> serverclients runs in a loop and open short lived tcp sessions.
Envoy debug build triggers assertion. Release build does not crash, but as explained below, it may lead to unexpected behaviour.
Here is analysis of the problem. As with any race condition it is hard to describe without going to details.
The root cause of the race is that when upstream tcp session terminates, ConnPoolImplBase::checkForIdleAndNotify is run twice.
It is run first time when Local/RemoteClose happens in https://github.com/envoyproxy/envoy/blob/release/v1.28/source/common/conn_pool/conn_pool_base.cc#L549.
ConnPoolImplBase::checkForIdleAndNotify iterates through registered idle_callbacks_. The callback is a lambda with captured host and hash_key which removes the tcp connection pool: source/common/upstream/cluster_manager_impl.cc.
The second time ConnPoolImplBase::checkForIdleAndNotify is called from Envoy::Tcp::ActiveTcpClient::~ActiveTcpClient via ConnPoolImplBase::checkForIdleAndCloseIdleConnsIfDraining. It again will go through registered idle callbacks. It will try to remove the tcp connection pool based on hash_key, but it was already removed when the ConnPoolImplBase::checkForIdleAndNotify was called the first time, so it is no-op.
The problem happens when a new connection comes between first and second call to ConnPoolImplBase::checkForIdleAndNotify. The new connection will allocate a new tcp connection pool with the same hash_key as the one just deleted in the first call to ConnPoolImplBase::checkForIdleAndNotify. Then, the second call to ConnPoolImplBase::checkForIdleAndNotify invokes registered callbacks the second time. The callback basically tries to delete a connection pool indexed by hash_key. It finds the newly allocated session which was created between calls to ConnPoolImplBase::checkForIdleAndNotify.
There are few ways to make sure that callbacks are run only once when upstream tcp session terminates. A boolean flag is one possibility or clearing the list of callbacks after they have been called.
@alyssawilk @ggreenway
Risk Level: Low
Testing: Manual. Before the fix assertion triggered within 5 minutes. After the fix crash does not happen.
Docs Changes: No
Release Notes: No
Platform Specific Features: No
Fixes: #22583