-
Notifications
You must be signed in to change notification settings - Fork 656
[agent] Fix session rebuilding against a local dispatcher #2134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,9 +14,8 @@ import ( | |
| "google.golang.org/grpc/codes" | ||
| ) | ||
|
|
||
| const dispatcherRPCTimeout = 5 * time.Second | ||
|
|
||
| var ( | ||
| dispatcherRPCTimeout = 5 * time.Second | ||
| errSessionDisconnect = errors.New("agent: session disconnect") // instructed to disconnect | ||
| errSessionClosed = errors.New("agent: session closed") | ||
| ) | ||
|
|
@@ -39,12 +38,14 @@ type session struct { | |
| assignments chan *api.AssignmentsMessage | ||
| subscriptions chan *api.SubscriptionMessage | ||
|
|
||
| cancel func() // this is assumed to be never nil, and set whenever a session is created | ||
| registered chan struct{} // closed registration | ||
| closed chan struct{} | ||
| closeOnce sync.Once | ||
| } | ||
|
|
||
| func newSession(ctx context.Context, agent *Agent, delay time.Duration, sessionID string, description *api.NodeDescription) *session { | ||
| sessionCtx, sessionCancel := context.WithCancel(ctx) | ||
| s := &session{ | ||
| agent: agent, | ||
| sessionID: sessionID, | ||
|
|
@@ -54,6 +55,7 @@ func newSession(ctx context.Context, agent *Agent, delay time.Duration, sessionI | |
| subscriptions: make(chan *api.SubscriptionMessage), | ||
| registered: make(chan struct{}), | ||
| closed: make(chan struct{}), | ||
| cancel: sessionCancel, | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using this cancelfunc instead of the forked context's cancel func in start seems like it muddles responsibilities. Unless there's some overhead concern I'm not aware of, this seems fine to me as-is.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think either would work. If the session fails or times out, the session needs to be closed and restarted anyway, so I think the same session + cancellation can be applicable. I'm happy to go either way.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I went with the same session+cancellation because it seems like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually @dperny was correct - we do need to fork the context again. If we don't, and we use the same cancel function, then in the case were a session times out: The select loop in
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any way that weird case can be documented in a comment somewhere? It's one of those cases where moving parts in very disparate parts of the system affect each other and it'll be really nonobvious what's going on here in the future
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dperny there's quite a long comment in
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh I suppose I should look at the diff before I say things I'm sorry today is not a 10/10 day for me. |
||
| } | ||
|
|
||
| // TODO(stevvooe): Need to move connection management up a level or create | ||
|
|
@@ -69,7 +71,7 @@ func newSession(ctx context.Context, agent *Agent, delay time.Duration, sessionI | |
| } | ||
| s.conn = cc | ||
|
|
||
| go s.run(ctx, delay, description) | ||
| go s.run(sessionCtx, delay, description) | ||
| return s | ||
| } | ||
|
|
||
|
|
@@ -114,6 +116,14 @@ func (s *session) start(ctx context.Context, description *api.NodeDescription) e | |
| // Note: we don't defer cancellation of this context, because the | ||
| // streaming RPC is used after this function returned. We only cancel | ||
| // it in the timeout case to make sure the goroutine completes. | ||
|
|
||
| // We also fork this context again from the `run` context, because on | ||
| // `dispatcherRPCTimeout`, we want to cancel establishing a session and | ||
| // return an error. If we cancel the `run` context instead of forking, | ||
| // then in `run` it's possible that we just terminate the function because | ||
| // `ctx` is done and hence fail to propagate the timeout error to the agent. | ||
| // If the error is not propogated to the agent, the agent will not close | ||
| // the session or rebuild a new sesssion. | ||
| sessionCtx, cancelSession := context.WithCancel(ctx) | ||
|
|
||
| // Need to run Session in a goroutine since there's no way to set a | ||
|
|
@@ -402,10 +412,10 @@ func (s *session) sendError(err error) { | |
| // of event loop. | ||
| func (s *session) close() error { | ||
| s.closeOnce.Do(func() { | ||
| s.cancel() | ||
| if s.conn != nil { | ||
| s.conn.Close(false) | ||
| } | ||
|
|
||
| close(s.closed) | ||
| }) | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there separate code paths for local and remote dispatchers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are forking the context twice for the same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of, although it's not here. We are adding the extra context so we can cancel it because closing the connection doesn't work on local connections.
connectionbrokerignores closes on local connections, so the session can't actually be restarted if it's a local connection, if we just close the connection. This is sort of a regression test - it fails without the context changes in the rest of this PR.