Skip to content

Conversation

@Benehiko
Copy link
Member

@Benehiko Benehiko commented Feb 15, 2024

- What I did

Create a *net.UnixConn receive-only channel, forcing the consumer to wait for the listener.AcceptUnix() to return a new connection before it can be used. We don't actually use the returned connection much outside of the tests, only when the CLI receives a Termination signal.

The race condition seemed to also only affect tests that were accessing the net.UnixConn instance before it was set or maybe even while it was being set. Since there was no proper mechanism for waiting for the connection to be set, we ended up with race conditions.

See below for some go code explaining this.

var conn *net.UnixConn
// sets up an inifinte loop in the background
listener, err := SetupConn(&conn)

// poll the conn for non nill 
pollConnNotNil(t, conn) // <-- can succeed or fail here

// closes the connection
conn.Close() // <-- can succeed or fail here
// Close closes the connection.
func (c *conn) Close() error {
	if !c.ok() {
		return syscall.EINVAL
	}
	err := c.fd.Close() // <--- fails
	if err != nil {
		err = &OpError{Op: "close", Net: c.fd.net, Source: c.fd.laddr, Addr: c.fd.raddr, Err: err}
	}
	return err
}
go test -v -race .
---
WARNING: DATA RACE
Read at 0x00c00013322c by goroutine 16:
  internal/poll.(*FD).Close()
      /usr/lib/go/src/internal/poll/fd_unix.go:112 +0x98
  net.(*netFD).Close()
      /usr/lib/go/src/net/fd_posix.go:37 +0x3d
  net.(*conn).Close()
      /usr/lib/go/src/net/net.go:203 +0x6a
  github.com/docker/cli/cli-plugins/socket.TestConnectAndWait.func1()
      /home/benehiko/go/src/github.com/docker/cli/cli-plugins/socket/socket_test.go:107 +0x2dd
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1595 +0x261
  testing.(*T).Run.func1()
      /usr/lib/go/src/testing/testing.go:1648 +0x44

- How I did it
Return a receive-only channel containing the net.UnixConn instance.

- How to verify it
CI should not fail on tests:

--- FAIL: TestSetupConn (0.00s)
    --- FAIL: TestSetupConn/allows_reconnects (0.00s)
--- FAIL: TestConnectAndWait (0.00s)
    --- FAIL: TestConnectAndWait/connect_goroutine_exits_after_EOF (0.00s)

Locally I cannot find a race condition for these test anymore:

> ~/g/s/g/d/c/c/socket on fix-socket-concurrency ⨯ go test -v -race ./...  11:55:12
=== RUN   TestSetupConn
=== RUN   TestSetupConn/updates_conn_when_connected
=== RUN   TestSetupConn/allows_reconnects
=== RUN   TestSetupConn/does_not_leak_sockets_to_local_directory
--- PASS: TestSetupConn (0.00s)
    --- PASS: TestSetupConn/updates_conn_when_connected (0.00s)
    --- PASS: TestSetupConn/allows_reconnects (0.00s)
    --- PASS: TestSetupConn/does_not_leak_sockets_to_local_directory (0.00s)
=== RUN   TestConnectAndWait
=== RUN   TestConnectAndWait/calls_cancel_func_on_EOF
=== RUN   TestConnectAndWait/connect_goroutine_exits_after_EOF
--- PASS: TestConnectAndWait (0.00s)
    --- PASS: TestConnectAndWait/calls_cancel_func_on_EOF (0.00s)
    --- PASS: TestConnectAndWait/connect_goroutine_exits_after_EOF (0.00s)
PASS
ok  	github.com/docker/cli/cli-plugins/socket	1.019s

- Description for the changelog

The socket.go implementation introduced race conditions on the net.UnixConn instance passed on to the SetupConn function which is now resolved.

- A picture of a cute animal (not mandatory but encouraged)

kitten-cat-cute-kitten-kitty-97c7d2448f0682f1f578c0b9f57aacab
(https://www.pickpik.com/kitten-cat-cute-kitten-kitty-cute-cat-curious-135563)

@thaJeztah
Copy link
Member

Yeah, we should probably look if we indeed need the for-loop, and if we do, if we can terminate it early, or in what conditions it should be terminated.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just quick blurbs; would have to take a closer look at this, but perhaps @laurazard is able to 🤗

@laurazard
Copy link
Collaborator

Implementation looks good but I do think we might be able to just remove the loop if that's what's causing the issue. Iirc (as discussed a bit in #4872 (comment)) it was a choice at the time (sounded more predictable) to let plugins redial the socket. However, in the meantime we've also made some general changes to how we handle the sockets, and have a different implementation on macOS/freebsd vs other places, etc. so I think it might just be simpler to remove it if we can.

We could try to remove it, build the CLI and then run the Compose tests against it/test the changes to see if we break anything, to get some signal back.


Also, wondering if we could be running tests with the race detector in CI too, maybe could have caught the other issue earlier – wdyt @thaJeztah?

@Benehiko Benehiko force-pushed the fix-socket-concurrency branch from f5f8f86 to 561eef0 Compare February 15, 2024 16:35
Copy link
Member Author

@Benehiko Benehiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still busy with the PR. Found some of the tests to still be flaky (timeouts of 10ms reached).

@Benehiko
Copy link
Member Author

We could try to remove it, build the CLI and then run the Compose tests against it/test the changes to see if we break anything, to get some signal back.

I think it's okay to leave in the loop, since we now have the net.UnixConn wrapped and properly do locking/unlocking, it should be fine :)

@cpuguy83
Copy link
Collaborator

I don't know this bit of code at all.
But this for loop looks really bad.
I would be in favor of doing this right rather than adding more stuff to deal with the bad implementation.

@Benehiko Benehiko force-pushed the fix-socket-concurrency branch from 561eef0 to 504dacc Compare February 16, 2024 09:38
@codecov-commenter
Copy link

codecov-commenter commented Feb 16, 2024

Codecov Report

Merging #4878 (13c6b68) into master (3b5e814) will decrease coverage by 0.02%.
Report is 36 commits behind head on master.
The diff coverage is 57.89%.

❗ Current head 13c6b68 differs from pull request most recent head e1a028c. Consider uploading reports for the commit e1a028c to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4878      +/-   ##
==========================================
- Coverage   61.32%   61.30%   -0.02%     
==========================================
  Files         287      287              
  Lines       20063    20068       +5     
==========================================
  Hits        12303    12303              
- Misses       6867     6870       +3     
- Partials      893      895       +2     

@Benehiko
Copy link
Member Author

I don't know this bit of code at all. But this for loop looks really bad. I would be in favor of doing this right rather than adding more stuff to deal with the bad implementation.

So it seems the reason why there is an infinite loop is to accept reconnects. I refactored this to return a receive-only channel. This forces the consumer to wait for the connection to actually be made before it tries to access the net.UnixConn instance.
https://github.com/docker/cli/blob/master/cli-plugins/socket/socket_test.go#L31-L46

@Benehiko Benehiko requested a review from thaJeztah February 16, 2024 09:55
Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>
@Benehiko Benehiko force-pushed the fix-socket-concurrency branch from 504dacc to e531096 Compare February 16, 2024 11:56
Copy link
Collaborator

@laurazard laurazard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think we're getting caught up in these discussions over two different things – one is the issue with the loop around accepting the connection, which is important to fix but also orthogonal to the race conditions these changes try to fix (which we should also try to fix).

Re: the race condition and addressing them, both of the approaches (the one with the lock and the one with the channel) would do, but I think the channel approach will lead to more complex code (and timeouts, etc.) that we should avoid. We could easily do the same with a wrapper around net.UnixConn with a mutex (your other approach), and also probably use TryLock in the hot path when we need to check whether we can use the connection or not to signal the plugin to exit.

The issue with the loop however is still there, since in both of these approaches you're trying to preserve the "allow reconnects" behavior. The loop is dangerous in it's current form because if anything causes the accept to continuously fail, we'll enter a tight loop there, which we really don't want. To solve that, we either need to:

  • only retry accepting the connection when the first accept succeeded
  • only retry when the first accept succeeded or there was an error in some defined list of "okay errors that we know of"
  • some retry limit
  • remove the loop altogether

I'm in favour of either the first or last option

@Benehiko Benehiko force-pushed the fix-socket-concurrency branch 2 times, most recently from 44328fd to 1c44e04 Compare February 20, 2024 13:59
@Benehiko Benehiko requested a review from laurazard February 20, 2024 14:08
neersighted
neersighted previously approved these changes Feb 20, 2024
Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM with the one nit @laurazard pointed out.

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>
@Benehiko Benehiko force-pushed the fix-socket-concurrency branch from 1c44e04 to c920d2d Compare February 20, 2024 17:16
Copy link
Collaborator

@laurazard laurazard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're changing the contract here from "always allow redialing" to "allow redialing only when the first attempt failed", which should have been caught by our tests but wasn't.

The test wasn't testing we could actually use the connection after redialing, so when this changed, we missed it. Before, we stayed in the loop, and if we accepted another connection, we'd update the pointer to the connection to the new one.
Now, if we wanted to keep accepting connections, we'd have to a) keep the loop open and b) send on that channel again so the consumer can make sure they're using the "latest" connection.

This just makes me lean harder towards the "remove the loop/allow redialing behavior" – this piece of code/functionality is obviously not easy to get right, and the benefits we are currently getting out of it (none, afaict) do not outweigh the complexity it adds to the code.

@cpuguy83
Copy link
Collaborator

Why do we treat "redial" as anything different than "a new connection that needs to be handled" and implement the handler per connection?

@Benehiko
Copy link
Member Author

Why do we treat "redial" as anything different than "a new connection that needs to be handled" and implement the handler per connection?

This PR currently implements redial in this way. With this implementation we would have a redial support only when the dial errors for some reason (e.g. timeout). Any new connections won't be able to communicate with the CLI since the channel closes and the goroutine returns, which requires a new connection setup.

Right now the tests and @laurazard is implying reconnect means the same connection is being reused.

For example:
The CLI has a listener with a singular connection instance conn.
A dial to the CLI - conn can now communicate with this connection.
A second dial to the CLI - conn is now communicating with second connection, replacing the first and so on.

I'll just remove this behavior and remove the reconnect tests, as @laurazard suggested.

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>
@Benehiko Benehiko dismissed laurazard’s stale review February 22, 2024 10:40

refactored the implementation

@Benehiko Benehiko requested a review from neersighted February 22, 2024 10:40
@Benehiko Benehiko dismissed neersighted’s stale review February 22, 2024 10:43

the implementation has been refactored

@cpuguy83
Copy link
Collaborator

@Benehiko My point is it is not behaving this way.
Even before this change it is treating a redial as some special condition.
But this should be handled not much differently than an http handler.

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>
@Benehiko
Copy link
Member Author

@Benehiko My point is it is not behaving this way. Even before this change it is treating a redial as some special condition. But this should be handled not much differently than an http handler.

I suppose so, right now I think that would be out of the scope of this PR.

Copy link
Contributor

@krissetto krissetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now LGTM. If you want we can create a tracking issue for making the general behavior more http handler-like as @cpuguy83 was mentioning

Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, overall LGTM

Signed-off-by: Alano Terblanche <18033717+Benehiko@users.noreply.github.com>
@Benehiko Benehiko force-pushed the fix-socket-concurrency branch from cef877e to e1a028c Compare March 1, 2024 12:06
@Benehiko Benehiko requested a review from neersighted March 1, 2024 13:55
@Benehiko
Copy link
Member Author

Benehiko commented Mar 4, 2024

Closing this in favor of #4905

@Benehiko Benehiko closed this Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants