Skip to content

[agent] Fix session rebuilding against a local dispatcher#2134

Merged
aaronlehmann merged 1 commit into
moby:masterfrom
cyli:fix-reconnecting-local-sessions
May 4, 2017
Merged

[agent] Fix session rebuilding against a local dispatcher#2134
aaronlehmann merged 1 commit into
moby:masterfrom
cyli:fix-reconnecting-local-sessions

Conversation

@cyli
Copy link
Copy Markdown
Contributor

@cyli cyli commented Apr 22, 2017

Connections to a local dispatcher can’t really be closed, so a session can’t really be restarted because closing a session just closes the connection. When this happens, it just starts up another session without closing the previous one.

Since we need to restart a session to push new TLS data up to the dispatcher from the agent, change "closing" a session to mean first shutting down all the clients with a context.cancel before
closing the connection.

Signed-off-by: cyli ying.li@docker.com

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2017

Codecov Report

Merging #2134 into master will increase coverage by 0.23%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2134      +/-   ##
==========================================
+ Coverage   59.78%   60.01%   +0.23%     
==========================================
  Files         119      119              
  Lines       19665    19668       +3     
==========================================
+ Hits        11756    11804      +48     
+ Misses       6576     6532      -44     
+ Partials     1333     1332       -1

Comment thread agent/session.go Outdated
// of event loop.
func (s *session) close() error {
s.closeOnce.Do(func() {
if s.cancel != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it is possible for s.cancel to be nil. It is filled in when the session is created in newSession.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do away with the nil check, it needs to be well-documented in the code that this field can never be nil.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically by well-documented I mean commented.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the check and added the comment to the struct definition.

Comment thread agent/session.go
subscriptions: make(chan *api.SubscriptionMessage),
registered: make(chan struct{}),
closed: make(chan struct{}),
cancel: sessionCancel,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe (*session).start should use this cancel function instead of forking the context again.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this cancelfunc instead of the forked context's cancel func in start seems like it muddles responsibilities. Unless there's some overhead concern I'm not aware of, this seems fine to me as-is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either would work. If the session fails or times out, the session needs to be closed and restarted anyway, so I think the same session + cancellation can be applicable. I'm happy to go either way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the same session+cancellation because it seems like newSession is the only thing that calls session.run, which is contingent upon session.start succeeding. Again though, happy to go either way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @dperny was correct - we do need to fork the context again. If we don't, and we use the same cancel function, then in the case were a session times out:

The select loop in run (https://github.com/docker/swarmkit/pull/2134/files#diff-15ffe95a2da45f70696ffa3c01949601R89) may select ctx.Done() first and not write the error to s.err, in which case the session is not closed and rebuilt.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way that weird case can be documented in a comment somewhere? It's one of those cases where moving parts in very disparate parts of the system affect each other and it'll be really nonobvious what's going on here in the future

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dperny there's quite a long comment in start :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I suppose I should look at the diff before I say things I'm sorry today is not a 10/10 day for me.

Comment thread agent/session.go
subscriptions: make(chan *api.SubscriptionMessage),
registered: make(chan struct{}),
closed: make(chan struct{}),
cancel: sessionCancel,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this cancelfunc instead of the forked context's cancel func in start seems like it muddles responsibilities. Unless there's some overhead concern I'm not aware of, this seems fine to me as-is.

Comment thread agent/session.go Outdated
// of event loop.
func (s *session) close() error {
s.closeOnce.Do(func() {
if s.cancel != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically by well-documented I mean commented.

Comment thread agent/agent_test.go

var localDispatcher = false

// TestMain runs every test in this file twice - once with a local dispatcher, and
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there separate code paths for local and remote dispatchers?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are forking the context twice for the same reason.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there separate code paths for local and remote dispatchers?

Sort of, although it's not here. We are adding the extra context so we can cancel it because closing the connection doesn't work on local connections. connectionbroker ignores closes on local connections, so the session can't actually be restarted if it's a local connection, if we just close the connection. This is sort of a regression test - it fails without the context changes in the rest of this PR.

@dperny
Copy link
Copy Markdown
Collaborator

dperny commented Apr 24, 2017

LGTM, regardless of the answer to Are there separate code paths for local and remote dispatchers?.

@cyli cyli force-pushed the fix-reconnecting-local-sessions branch from c231906 to e86620e Compare April 25, 2017 22:41
@aaronlehmann
Copy link
Copy Markdown
Collaborator

LGTM

@cyli
Copy link
Copy Markdown
Contributor Author

cyli commented Apr 26, 2017

(there seem to be some non-spurious integration test failures - am tracking them down)

@cyli cyli force-pushed the fix-reconnecting-local-sessions branch from e86620e to 6b6a8e4 Compare April 27, 2017 00:45
@cyli cyli mentioned this pull request Apr 27, 2017
10 tasks
@cyli
Copy link
Copy Markdown
Contributor Author

cyli commented Apr 27, 2017

(am going to try to write an agent test to make sure that on failure it always reconnects, after that this should be ready to go)

@cyli cyli force-pushed the fix-reconnecting-local-sessions branch from 6b6a8e4 to 8849f8a Compare April 29, 2017 07:25
@cyli
Copy link
Copy Markdown
Contributor Author

cyli commented May 1, 2017

I've added a test that tends to fail with the bug that @dperny found. PTAL

Comment thread agent/agent_test.go Outdated
defer anotherDispatcher.SetSessionHandler(nil)
select {
case <-stream.Context().Done():
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace the select with <-stream.Context().Done()

session can’t really be restarted because closing a session just
closes the connection.  When this happens, it just starts up
another session without closing the previous one.

Since we need to restart a session to push new TLS data up to the
dispatcher from the agent, change "closing" a session to mean
first shutting down all the clients with a context.cancel before
closing the connection.

Signed-off-by: cyli <ying.li@docker.com>
@cyli cyli force-pushed the fix-reconnecting-local-sessions branch from 8849f8a to 51309e8 Compare May 3, 2017 01:23
@cyli
Copy link
Copy Markdown
Contributor Author

cyli commented May 4, 2017

Just checking to see if this is mergable?

@aaronlehmann aaronlehmann merged commit 0c09e6d into moby:master May 4, 2017
@cyli
Copy link
Copy Markdown
Contributor Author

cyli commented May 4, 2017

Thank you!

@cyli cyli deleted the fix-reconnecting-local-sessions branch May 4, 2017 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants