-
Notifications
You must be signed in to change notification settings - Fork 2.1k
fix: concurrency safe net.UnixConn
#4878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d0502a2 to
f5f8f86
Compare
|
Yeah, we should probably look if we indeed need the for-loop, and if we do, if we can terminate it early, or in what conditions it should be terminated. |
thaJeztah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just quick blurbs; would have to take a closer look at this, but perhaps @laurazard is able to 🤗
|
Implementation looks good but I do think we might be able to just remove the loop if that's what's causing the issue. Iirc (as discussed a bit in #4872 (comment)) it was a choice at the time (sounded more predictable) to let plugins redial the socket. However, in the meantime we've also made some general changes to how we handle the sockets, and have a different implementation on macOS/freebsd vs other places, etc. so I think it might just be simpler to remove it if we can. We could try to remove it, build the CLI and then run the Compose tests against it/test the changes to see if we break anything, to get some signal back. Also, wondering if we could be running tests with the race detector in CI too, maybe could have caught the other issue earlier – wdyt @thaJeztah? |
f5f8f86 to
561eef0
Compare
Benehiko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still busy with the PR. Found some of the tests to still be flaky (timeouts of 10ms reached).
I think it's okay to leave in the loop, since we now have the |
|
I don't know this bit of code at all. |
561eef0 to
504dacc
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #4878 +/- ##
==========================================
- Coverage 61.32% 61.30% -0.02%
==========================================
Files 287 287
Lines 20063 20068 +5
==========================================
Hits 12303 12303
- Misses 6867 6870 +3
- Partials 893 895 +2 |
So it seems the reason why there is an infinite loop is to accept reconnects. I refactored this to return a receive-only channel. This forces the consumer to wait for the connection to actually be made before it tries to access the |
Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>
504dacc to
e531096
Compare
laurazard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think we're getting caught up in these discussions over two different things – one is the issue with the loop around accepting the connection, which is important to fix but also orthogonal to the race conditions these changes try to fix (which we should also try to fix).
Re: the race condition and addressing them, both of the approaches (the one with the lock and the one with the channel) would do, but I think the channel approach will lead to more complex code (and timeouts, etc.) that we should avoid. We could easily do the same with a wrapper around net.UnixConn with a mutex (your other approach), and also probably use TryLock in the hot path when we need to check whether we can use the connection or not to signal the plugin to exit.
The issue with the loop however is still there, since in both of these approaches you're trying to preserve the "allow reconnects" behavior. The loop is dangerous in it's current form because if anything causes the accept to continuously fail, we'll enter a tight loop there, which we really don't want. To solve that, we either need to:
- only retry accepting the connection when the first accept succeeded
- only retry when the first accept succeeded or there was an error in some defined list of "okay errors that we know of"
- some retry limit
- remove the loop altogether
I'm in favour of either the first or last option
44328fd to
1c44e04
Compare
neersighted
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM with the one nit @laurazard pointed out.
Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>
1c44e04 to
c920d2d
Compare
laurazard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're changing the contract here from "always allow redialing" to "allow redialing only when the first attempt failed", which should have been caught by our tests but wasn't.
The test wasn't testing we could actually use the connection after redialing, so when this changed, we missed it. Before, we stayed in the loop, and if we accepted another connection, we'd update the pointer to the connection to the new one.
Now, if we wanted to keep accepting connections, we'd have to a) keep the loop open and b) send on that channel again so the consumer can make sure they're using the "latest" connection.
This just makes me lean harder towards the "remove the loop/allow redialing behavior" – this piece of code/functionality is obviously not easy to get right, and the benefits we are currently getting out of it (none, afaict) do not outweigh the complexity it adds to the code.
|
Why do we treat "redial" as anything different than "a new connection that needs to be handled" and implement the handler per connection? |
This PR currently implements redial in this way. With this implementation we would have a redial support only when the dial errors for some reason (e.g. timeout). Any new connections won't be able to communicate with the Right now the tests and @laurazard is implying For example: I'll just remove this behavior and remove the reconnect tests, as @laurazard suggested. |
Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>
the implementation has been refactored
|
@Benehiko My point is it is not behaving this way. |
Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>
I suppose so, right now I think that would be out of the scope of this PR. |
krissetto
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now LGTM. If you want we can create a tracking issue for making the general behavior more http handler-like as @cpuguy83 was mentioning
neersighted
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit, overall LGTM
Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>
cef877e to
e1a028c
Compare
|
Closing this in favor of #4905 |
- What I did
Create a
*net.UnixConnreceive-only channel, forcing the consumer to wait for thelistener.AcceptUnix()to return a new connection before it can be used. We don't actually use the returned connection much outside of the tests, only when the CLI receives a Termination signal.The race condition seemed to also only affect tests that were accessing the
net.UnixConninstance before it was set or maybe even while it was being set. Since there was no proper mechanism for waiting for the connection to be set, we ended up with race conditions.See below for some go code explaining this.
- How I did it
Return a receive-only channel containing the
net.UnixConninstance.- How to verify it
CI should not fail on tests:
--- FAIL: TestSetupConn (0.00s) --- FAIL: TestSetupConn/allows_reconnects (0.00s) --- FAIL: TestConnectAndWait (0.00s) --- FAIL: TestConnectAndWait/connect_goroutine_exits_after_EOF (0.00s)Locally I cannot find a race condition for these test anymore:
- Description for the changelog
The
socket.goimplementation introduced race conditions on thenet.UnixConninstance passed on to theSetupConnfunction which is now resolved.- A picture of a cute animal (not mandatory but encouraged)
(https://www.pickpik.com/kitten-cat-cute-kitten-kitty-cute-cat-curious-135563)