[agent] Fix session rebuilding against a local dispatcher#2134
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2134 +/- ##
==========================================
+ Coverage 59.78% 60.01% +0.23%
==========================================
Files 119 119
Lines 19665 19668 +3
==========================================
+ Hits 11756 11804 +48
+ Misses 6576 6532 -44
+ Partials 1333 1332 -1 |
| // of event loop. | ||
| func (s *session) close() error { | ||
| s.closeOnce.Do(func() { | ||
| if s.cancel != nil { |
There was a problem hiding this comment.
I don't believe it is possible for s.cancel to be nil. It is filled in when the session is created in newSession.
There was a problem hiding this comment.
if we do away with the nil check, it needs to be well-documented in the code that this field can never be nil.
There was a problem hiding this comment.
Specifically by well-documented I mean commented.
There was a problem hiding this comment.
Removed the check and added the comment to the struct definition.
| subscriptions: make(chan *api.SubscriptionMessage), | ||
| registered: make(chan struct{}), | ||
| closed: make(chan struct{}), | ||
| cancel: sessionCancel, |
There was a problem hiding this comment.
Maybe (*session).start should use this cancel function instead of forking the context again.
There was a problem hiding this comment.
Using this cancelfunc instead of the forked context's cancel func in start seems like it muddles responsibilities. Unless there's some overhead concern I'm not aware of, this seems fine to me as-is.
There was a problem hiding this comment.
I think either would work. If the session fails or times out, the session needs to be closed and restarted anyway, so I think the same session + cancellation can be applicable. I'm happy to go either way.
There was a problem hiding this comment.
I went with the same session+cancellation because it seems like newSession is the only thing that calls session.run, which is contingent upon session.start succeeding. Again though, happy to go either way.
There was a problem hiding this comment.
Actually @dperny was correct - we do need to fork the context again. If we don't, and we use the same cancel function, then in the case were a session times out:
The select loop in run (https://github.com/docker/swarmkit/pull/2134/files#diff-15ffe95a2da45f70696ffa3c01949601R89) may select ctx.Done() first and not write the error to s.err, in which case the session is not closed and rebuilt.
There was a problem hiding this comment.
Is there any way that weird case can be documented in a comment somewhere? It's one of those cases where moving parts in very disparate parts of the system affect each other and it'll be really nonobvious what's going on here in the future
There was a problem hiding this comment.
@dperny there's quite a long comment in start :)
There was a problem hiding this comment.
Oh I suppose I should look at the diff before I say things I'm sorry today is not a 10/10 day for me.
| subscriptions: make(chan *api.SubscriptionMessage), | ||
| registered: make(chan struct{}), | ||
| closed: make(chan struct{}), | ||
| cancel: sessionCancel, |
There was a problem hiding this comment.
Using this cancelfunc instead of the forked context's cancel func in start seems like it muddles responsibilities. Unless there's some overhead concern I'm not aware of, this seems fine to me as-is.
| // of event loop. | ||
| func (s *session) close() error { | ||
| s.closeOnce.Do(func() { | ||
| if s.cancel != nil { |
There was a problem hiding this comment.
Specifically by well-documented I mean commented.
|
|
||
| var localDispatcher = false | ||
|
|
||
| // TestMain runs every test in this file twice - once with a local dispatcher, and |
There was a problem hiding this comment.
Are there separate code paths for local and remote dispatchers?
There was a problem hiding this comment.
I think we are forking the context twice for the same reason.
There was a problem hiding this comment.
Are there separate code paths for local and remote dispatchers?
Sort of, although it's not here. We are adding the extra context so we can cancel it because closing the connection doesn't work on local connections. connectionbroker ignores closes on local connections, so the session can't actually be restarted if it's a local connection, if we just close the connection. This is sort of a regression test - it fails without the context changes in the rest of this PR.
|
LGTM, regardless of the answer to |
c231906 to
e86620e
Compare
|
LGTM |
|
(there seem to be some non-spurious integration test failures - am tracking them down) |
e86620e to
6b6a8e4
Compare
|
(am going to try to write an agent test to make sure that on failure it always reconnects, after that this should be ready to go) |
6b6a8e4 to
8849f8a
Compare
|
I've added a test that tends to fail with the bug that @dperny found. PTAL |
| defer anotherDispatcher.SetSessionHandler(nil) | ||
| select { | ||
| case <-stream.Context().Done(): | ||
| } |
There was a problem hiding this comment.
nit: replace the select with <-stream.Context().Done()
session can’t really be restarted because closing a session just closes the connection. When this happens, it just starts up another session without closing the previous one. Since we need to restart a session to push new TLS data up to the dispatcher from the agent, change "closing" a session to mean first shutting down all the clients with a context.cancel before closing the connection. Signed-off-by: cyli <ying.li@docker.com>
8849f8a to
51309e8
Compare
|
Just checking to see if this is mergable? |
|
Thank you! |
Connections to a local dispatcher can’t really be closed, so a session can’t really be restarted because closing a session just closes the connection. When this happens, it just starts up another session without closing the previous one.
Since we need to restart a session to push new TLS data up to the dispatcher from the agent, change "closing" a session to mean first shutting down all the clients with a context.cancel before
closing the connection.
Signed-off-by: cyli ying.li@docker.com