Skip to content

ZOOKEEPER-4293: Lock Contention in ClientCnxnSocketNetty (3.5.x)#1713

Closed
MikeEdgar wants to merge 2 commits intoapache:branch-3.5from
MikeEdgar:ZOOKEEPER-4293-3.5
Closed

ZOOKEEPER-4293: Lock Contention in ClientCnxnSocketNetty (3.5.x)#1713
MikeEdgar wants to merge 2 commits intoapache:branch-3.5from
MikeEdgar:ZOOKEEPER-4293-3.5

Conversation

@MikeEdgar
Copy link
Contributor

Check for failed/cancelled ChannelFutures before acquiring connectLock. This prevents lock contention where cleanup is unable to complete.

Signed-off-by: Michael Edgar medgar@redhat.com

@MikeEdgar
Copy link
Contributor Author

@maoling, would you mind doing a review of this change?

It resolves a lock contention situation where a slow connection allows the client's SendThread to enter cleanUp in ClientCnxnSocketNetty just as the epollEventLoopGroup thread is calling the ChannelFutureListener. The send thread is able to cancel the connectFuture, however it is unable to proceed beyond closing the channel with the epollEventLoopGroup thread waiting on the lock.

The fix is to allow the epollEventLoopGroup thread to detect the cancellation without acquiring the lock.

@maoling maoling requested a review from anmolnar June 17, 2021 11:33
@maoling
Copy link
Member

maoling commented Jun 17, 2021

@MikeEdgar I will take a look. Thanks for your contribution.

@MikeEdgar
Copy link
Contributor Author

Thanks @maoling, @anmolnar. I would be grateful for a sanity check on the approach taken in this PR.

@MikeEdgar
Copy link
Contributor Author

Hi @maoling, @anmolnar - I'm curious if you've had an opportunity to review these changes.

@maoling
Copy link
Member

maoling commented Jul 23, 2021

@MikeEdgar Sorry for our late, I'll take a look at this weekend. Thanks for your reminder

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a way to test this fix with an unit test ? (even with Mockito)

@MikeEdgar
Copy link
Contributor Author

do we have a way to test this fix with an unit test ? (even with Mockito)

I am open to suggestions on an approach... Since this is dealing with thread timing, my solution to test it during development was with the debugger and breakpoints to simulate latency, which I understand is not ideal. Are there any other test cases that might deal with similar concepts?

MikeEdgar and others added 2 commits July 13, 2022 06:41
Check for failed/cancelled ChannelFutures before acquiring `connectLock`

Signed-off-by: Michael Edgar <medgar@redhat.com>
@MikeEdgar MikeEdgar force-pushed the ZOOKEEPER-4293-3.5 branch from 931d28a to 99b7b00 Compare July 13, 2022 10:42
@MikeEdgar MikeEdgar requested a review from eolivelli July 13, 2022 10:43
@MikeEdgar
Copy link
Contributor Author

@eolivelli , @maoling - I've finally added some tests with mocked dependencies. Please take a look. Also, this is for the 3.5 branch - should I instead replace the PR with one for master? What is the process for back-ports to 3.6, 3.7, etc.?

@MikeEdgar
Copy link
Contributor Author

@maoling - please let me know if this PR is OK as-is or if it should target master (or another release version) instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants