fix: concurrency safe `net.UnixConn` #4878

Benehiko · 2024-02-15T10:12:39Z

- What I did

Create a *net.UnixConn receive-only channel, forcing the consumer to wait for the listener.AcceptUnix() to return a new connection before it can be used. We don't actually use the returned connection much outside of the tests, only when the CLI receives a Termination signal.

The race condition seemed to also only affect tests that were accessing the net.UnixConn instance before it was set or maybe even while it was being set. Since there was no proper mechanism for waiting for the connection to be set, we ended up with race conditions.

See below for some go code explaining this.

var conn *net.UnixConn
// sets up an inifinte loop in the background
listener, err := SetupConn(&conn)

// poll the conn for non nill 
pollConnNotNil(t, conn) // <-- can succeed or fail here

// closes the connection
conn.Close() // <-- can succeed or fail here

// Close closes the connection.
func (c *conn) Close() error {
	if !c.ok() {
		return syscall.EINVAL
	}
	err := c.fd.Close() // <--- fails
	if err != nil {
		err = &OpError{Op: "close", Net: c.fd.net, Source: c.fd.laddr, Addr: c.fd.raddr, Err: err}
	}
	return err
}

go test -v -race .
---
WARNING: DATA RACE
Read at 0x00c00013322c by goroutine 16:
  internal/poll.(*FD).Close()
      /usr/lib/go/src/internal/poll/fd_unix.go:112 +0x98
  net.(*netFD).Close()
      /usr/lib/go/src/net/fd_posix.go:37 +0x3d
  net.(*conn).Close()
      /usr/lib/go/src/net/net.go:203 +0x6a
  github.com/docker/cli/cli-plugins/socket.TestConnectAndWait.func1()
      /home/benehiko/go/src/github.com/docker/cli/cli-plugins/socket/socket_test.go:107 +0x2dd
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1595 +0x261
  testing.(*T).Run.func1()
      /usr/lib/go/src/testing/testing.go:1648 +0x44

- How I did it
Return a receive-only channel containing the net.UnixConn instance.

- How to verify it
CI should not fail on tests:

--- FAIL: TestSetupConn (0.00s)
    --- FAIL: TestSetupConn/allows_reconnects (0.00s)
--- FAIL: TestConnectAndWait (0.00s)
    --- FAIL: TestConnectAndWait/connect_goroutine_exits_after_EOF (0.00s)

Locally I cannot find a race condition for these test anymore:

⋊> ~/g/s/g/d/c/c/socket on fix-socket-concurrency ⨯ go test -v -race ./...  11:55:12
=== RUN   TestSetupConn
=== RUN   TestSetupConn/updates_conn_when_connected
=== RUN   TestSetupConn/allows_reconnects
=== RUN   TestSetupConn/does_not_leak_sockets_to_local_directory
--- PASS: TestSetupConn (0.00s)
    --- PASS: TestSetupConn/updates_conn_when_connected (0.00s)
    --- PASS: TestSetupConn/allows_reconnects (0.00s)
    --- PASS: TestSetupConn/does_not_leak_sockets_to_local_directory (0.00s)
=== RUN   TestConnectAndWait
=== RUN   TestConnectAndWait/calls_cancel_func_on_EOF
=== RUN   TestConnectAndWait/connect_goroutine_exits_after_EOF
--- PASS: TestConnectAndWait (0.00s)
    --- PASS: TestConnectAndWait/calls_cancel_func_on_EOF (0.00s)
    --- PASS: TestConnectAndWait/connect_goroutine_exits_after_EOF (0.00s)
PASS
ok  	github.com/docker/cli/cli-plugins/socket	1.019s

- Description for the changelog

The socket.go implementation introduced race conditions on the net.UnixConn instance passed on to the SetupConn function which is now resolved.

- A picture of a cute animal (not mandatory but encouraged)

(https://www.pickpik.com/kitten-cat-cute-kitten-kitty-cute-cat-curious-135563)

thaJeztah · 2024-02-15T11:22:08Z

Yeah, we should probably look if we indeed need the for-loop, and if we do, if we can terminate it early, or in what conditions it should be terminated.

thaJeztah

just quick blurbs; would have to take a closer look at this, but perhaps @laurazard is able to 🤗

cli-plugins/socket/socket.go

laurazard · 2024-02-15T12:20:50Z

Implementation looks good but I do think we might be able to just remove the loop if that's what's causing the issue. Iirc (as discussed a bit in #4872 (comment)) it was a choice at the time (sounded more predictable) to let plugins redial the socket. However, in the meantime we've also made some general changes to how we handle the sockets, and have a different implementation on macOS/freebsd vs other places, etc. so I think it might just be simpler to remove it if we can.

We could try to remove it, build the CLI and then run the Compose tests against it/test the changes to see if we break anything, to get some signal back.

Also, wondering if we could be running tests with the race detector in CI too, maybe could have caught the other issue earlier – wdyt @thaJeztah?

Benehiko

I'm still busy with the PR. Found some of the tests to still be flaky (timeouts of 10ms reached).

cli-plugins/socket/socket.go

cli-plugins/socket/socket_test.go

Benehiko · 2024-02-15T16:40:42Z

We could try to remove it, build the CLI and then run the Compose tests against it/test the changes to see if we break anything, to get some signal back.

I think it's okay to leave in the loop, since we now have the net.UnixConn wrapped and properly do locking/unlocking, it should be fine :)

cpuguy83 · 2024-02-15T22:54:02Z

I don't know this bit of code at all.
But this for loop looks really bad.
I would be in favor of doing this right rather than adding more stuff to deal with the bad implementation.

codecov-commenter · 2024-02-16T09:40:59Z

Codecov Report

Merging #4878 (13c6b68) into master (3b5e814) will decrease coverage by 0.02%.
Report is 36 commits behind head on master.
The diff coverage is 57.89%.

❗ Current head 13c6b68 differs from pull request most recent head e1a028c. Consider uploading reports for the commit e1a028c to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4878      +/-   ##
==========================================
- Coverage   61.32%   61.30%   -0.02%     
==========================================
  Files         287      287              
  Lines       20063    20068       +5     
==========================================
  Hits        12303    12303              
- Misses       6867     6870       +3     
- Partials      893      895       +2

Benehiko · 2024-02-16T09:55:33Z

I don't know this bit of code at all. But this for loop looks really bad. I would be in favor of doing this right rather than adding more stuff to deal with the bad implementation.

So it seems the reason why there is an infinite loop is to accept reconnects. I refactored this to return a receive-only channel. This forces the consumer to wait for the connection to actually be made before it tries to access the net.UnixConn instance.
https://github.com/docker/cli/blob/master/cli-plugins/socket/socket_test.go#L31-L46

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>

laurazard

Overall, I think we're getting caught up in these discussions over two different things – one is the issue with the loop around accepting the connection, which is important to fix but also orthogonal to the race conditions these changes try to fix (which we should also try to fix).

Re: the race condition and addressing them, both of the approaches (the one with the lock and the one with the channel) would do, but I think the channel approach will lead to more complex code (and timeouts, etc.) that we should avoid. We could easily do the same with a wrapper around net.UnixConn with a mutex (your other approach), and also probably use TryLock in the hot path when we need to check whether we can use the connection or not to signal the plugin to exit.

The issue with the loop however is still there, since in both of these approaches you're trying to preserve the "allow reconnects" behavior. The loop is dangerous in it's current form because if anything causes the accept to continuously fail, we'll enter a tight loop there, which we really don't want. To solve that, we either need to:

only retry accepting the connection when the first accept succeeded
only retry when the first accept succeeded or there was an error in some defined list of "okay errors that we know of"
some retry limit
remove the loop altogether

I'm in favour of either the first or last option

cli-plugins/socket/socket.go

cmd/docker/docker.go

neersighted

Overall LGTM with the one nit @laurazard pointed out.

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>

laurazard

We're changing the contract here from "always allow redialing" to "allow redialing only when the first attempt failed", which should have been caught by our tests but wasn't.

The test wasn't testing we could actually use the connection after redialing, so when this changed, we missed it. Before, we stayed in the loop, and if we accepted another connection, we'd update the pointer to the connection to the new one.
Now, if we wanted to keep accepting connections, we'd have to a) keep the loop open and b) send on that channel again so the consumer can make sure they're using the "latest" connection.

This just makes me lean harder towards the "remove the loop/allow redialing behavior" – this piece of code/functionality is obviously not easy to get right, and the benefits we are currently getting out of it (none, afaict) do not outweigh the complexity it adds to the code.

cli-plugins/socket/socket_test.go

cli-plugins/socket/socket.go

cpuguy83 · 2024-02-21T17:32:58Z

Why do we treat "redial" as anything different than "a new connection that needs to be handled" and implement the handler per connection?

Benehiko · 2024-02-22T09:30:52Z

Why do we treat "redial" as anything different than "a new connection that needs to be handled" and implement the handler per connection?

This PR currently implements redial in this way. With this implementation we would have a redial support only when the dial errors for some reason (e.g. timeout). Any new connections won't be able to communicate with the CLI since the channel closes and the goroutine returns, which requires a new connection setup.

Right now the tests and @laurazard is implying reconnect means the same connection is being reused.

For example:
The CLI has a listener with a singular connection instance conn.
A dial to the CLI - conn can now communicate with this connection.
A second dial to the CLI - conn is now communicating with second connection, replacing the first and so on.

I'll just remove this behavior and remove the reconnect tests, as @laurazard suggested.

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>

refactored the implementation

the implementation has been refactored

cli-plugins/socket/socket.go

cmd/docker/docker.go

cli-plugins/socket/socket.go

cpuguy83 · 2024-02-22T14:56:16Z

@Benehiko My point is it is not behaving this way.
Even before this change it is treating a redial as some special condition.
But this should be handled not much differently than an http handler.

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>

Benehiko · 2024-02-22T15:30:26Z

@Benehiko My point is it is not behaving this way. Even before this change it is treating a redial as some special condition. But this should be handled not much differently than an http handler.

I suppose so, right now I think that would be out of the scope of this PR.

krissetto

For now LGTM. If you want we can create a tracking issue for making the general behavior more http handler-like as @cpuguy83 was mentioning

neersighted

One nit, overall LGTM

cmd/docker/docker.go

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>

Benehiko · 2024-03-04T10:20:30Z

Closing this in favor of #4905

Benehiko requested a review from laurazard February 15, 2024 10:12

Benehiko mentioned this pull request Feb 15, 2024

cli-plugins/socket: remove use of deprecated distribution uuid package #4872

Merged

Benehiko force-pushed the fix-socket-concurrency branch from d0502a2 to f5f8f86 Compare February 15, 2024 10:58

thaJeztah reviewed Feb 15, 2024

View reviewed changes

cli-plugins/socket/socket.go Outdated Show resolved Hide resolved

cli-plugins/socket/socket.go Outdated Show resolved Hide resolved

Benehiko force-pushed the fix-socket-concurrency branch from f5f8f86 to 561eef0 Compare February 15, 2024 16:35

Benehiko commented Feb 15, 2024

View reviewed changes

Benehiko force-pushed the fix-socket-concurrency branch from 561eef0 to 504dacc Compare February 16, 2024 09:38

Benehiko requested a review from thaJeztah February 16, 2024 09:55

fix: concurrency safe net.UnixConn

e531096

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>

Benehiko force-pushed the fix-socket-concurrency branch from 504dacc to e531096 Compare February 16, 2024 11:56

laurazard requested changes Feb 19, 2024

View reviewed changes

laurazard requested a review from neersighted February 19, 2024 18:02

Benehiko force-pushed the fix-socket-concurrency branch 2 times, most recently from 44328fd to 1c44e04 Compare February 20, 2024 13:59

Benehiko requested a review from laurazard February 20, 2024 14:08

neersighted previously approved these changes Feb 20, 2024

View reviewed changes

fix: infinite socket connection retries

c920d2d

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>

Benehiko force-pushed the fix-socket-concurrency branch from 1c44e04 to c920d2d Compare February 20, 2024 17:16

laurazard previously requested changes Feb 21, 2024

View reviewed changes

cli-plugins/socket/socket_test.go Outdated Show resolved Hide resolved

cli-plugins/socket/socket.go Outdated Show resolved Hide resolved

refactor: remove reconnect behavior

ad778cf

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>

Benehiko requested a review from neersighted February 22, 2024 10:40

Benehiko requested review from krissetto and laurazard February 22, 2024 10:40

krissetto reviewed Feb 22, 2024

View reviewed changes

cli-plugins/socket/socket.go Outdated Show resolved Hide resolved

cmd/docker/docker.go Outdated Show resolved Hide resolved

cli-plugins/socket/socket.go Outdated Show resolved Hide resolved

chore: update docstring and address nit

13c6b68

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>

krissetto approved these changes Feb 23, 2024

View reviewed changes

Benehiko mentioned this pull request Feb 27, 2024

The socket package should hanndle reconnects and possibly mutliple connections #4899

Open

neersighted requested changes Feb 29, 2024

View reviewed changes

cmd/docker/docker.go Outdated Show resolved Hide resolved

Benehiko mentioned this pull request Mar 1, 2024

plugin: closer-based plugin notification socket #4905

Merged

fix: fallthrough for non-blocking channel

e1a028c

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>

Benehiko force-pushed the fix-socket-concurrency branch from cef877e to e1a028c Compare March 1, 2024 12:06

Benehiko requested a review from neersighted March 1, 2024 13:55

Benehiko closed this Mar 4, 2024

fix: concurrency safe net.UnixConn #4878

fix: concurrency safe net.UnixConn #4878

Uh oh!

Conversation

Benehiko commented Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thaJeztah commented Feb 15, 2024

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

laurazard commented Feb 15, 2024

Uh oh!

Benehiko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Benehiko commented Feb 15, 2024

Uh oh!

cpuguy83 commented Feb 15, 2024

Uh oh!

codecov-commenter commented Feb 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Benehiko commented Feb 16, 2024

Uh oh!

laurazard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neersighted left a comment

Choose a reason for hiding this comment

Uh oh!

laurazard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cpuguy83 commented Feb 21, 2024

Uh oh!

Benehiko commented Feb 22, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cpuguy83 commented Feb 22, 2024

Uh oh!

Benehiko commented Feb 22, 2024

Uh oh!

krissetto left a comment

Choose a reason for hiding this comment

Uh oh!

neersighted left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Benehiko commented Mar 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fix: concurrency safe `net.UnixConn` #4878

fix: concurrency safe `net.UnixConn` #4878

Benehiko commented Feb 15, 2024 •

edited

Loading

codecov-commenter commented Feb 16, 2024 •

edited

Loading