[agent/dispatcher] Keep track of a node's TLS info and provide root CA changes for an agent#2077
Conversation
| "sync" | ||
| "time" | ||
|
|
||
| "github.com/docker/distribution/digest" |
There was a problem hiding this comment.
Renamed to "github.com/opencontainers/go-digest"
| // AssignmentChange is a set of changes to apply on this node. | ||
| repeated AssignmentChange changes = 4; | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
The last newline was removed from the file.
| @@ -0,0 +1,90 @@ | |||
| package main | |||
There was a problem hiding this comment.
Looks like this was added by accident
|
Also note I am probably going to pull the security config changes out into a separate PR, if this approach generally makes sense |
| @@ -0,0 +1,10 @@ | |||
| -----BEGIN CERTIFICATE----- | |||
There was a problem hiding this comment.
This looks unwanted as well
There was a problem hiding this comment.
Yep, sorry, these two have been removed.
| CertIssuerPublicKey: newIssuer.PublicKey, | ||
| CertIssuerSubject: newIssuer.Subject, | ||
| TrustRootHash: newRootCA.Digest.String(), | ||
| } |
There was a problem hiding this comment.
I get worried about blocking writes like this that aren't wrapped in a select. We've had a few bugs on shutdown where writes to channels can hang. In this case, I think there will be problems if the agent shuts down at the moment we try to write on this channel. Let's make sure there's a strategy to ensure that either a channel write is guaranteed to have a goroutine receiving it, or there's another case in a select that handles the receiver going away.
| NotifyTLSChange <-chan *api.NodeTLSInfo | ||
|
|
||
| // NotifyRootChange channel receives new trust root certificate changes from session assignments | ||
| NotifyRootChange chan<- []byte |
There was a problem hiding this comment.
I don't see any code that initializes this. Not sure if it's an oversight or if this part is just not written yet.
There was a problem hiding this comment.
Now I do see it. Maybe it wasn't in the version I was looking at a few seconds ago.
There was a problem hiding this comment.
(this should be in node.go - apologies, I had some of the changes partially stashed - should be pushed now)
| } | ||
| } | ||
| if len(message.RootCA) > 0 && digest.FromBytes(message.RootCA).String() != currentTLSInfo.TrustRootHash { | ||
| a.config.NotifyRootChange <- message.RootCA |
There was a problem hiding this comment.
I believe this channel send is okay. The goroutine in node that receives from it won't return until the agent shuts down.
| keyMgrUpdates, keyMgrCancel := d.keyMgrQueue.Watch() | ||
| defer keyMgrCancel() | ||
| rootCAUpdates, rootCACancel := d.rootCAQueue.Watch() | ||
| defer rootCACancel() |
There was a problem hiding this comment.
The dispatcher changes look fine overall, but I'm wondering if we might combine these three channels into one. Each queue spawns its own goroutine. We could send a structure over the single channel that has fields for managers, the node object, and the root CA. Having such a structure would also make it clearer what types are involved.
| manager *manager.Manager | ||
| notifyNodeChange chan *api.Node // used to send role updates from the dispatcher api on promotion/demotion | ||
| notifyTLSChange chan *api.NodeTLSInfo // used to send tls info to the agent on TLS credential when the node changes its trust root | ||
| notifyRootCAChange chan []byte // used to send role updates from the dispatcher api on root CA change |
There was a problem hiding this comment.
This doesn't seem to be initialized. I think it should be created in New like notifyNodeChange.
| renewCert() | ||
| case newRoot := <-n.notifyRootCAChange: | ||
| n.Lock() | ||
| role := n.role |
There was a problem hiding this comment.
notifyNodeChange is a buffered channel. If the role is changed at the same time as the root CA, this case and the one above might happen in either order. It's possible to skip this root CA change even if we are no longer a manager, or to handle it even if we have just become a manager.
I feel like it would be safer to combine these two channels into a single channel, that passes a struct with a node pointer (which may be nil) and a new root (which may be empty).
I think that on managers, we should only skip the call to UpdateRootCA, and not the send on n.notifyTLSChange. Otherwise, the agent will continue to report to the dispatcher that it is using the old TLS root. I think there will be less potential for bugs if all nodes report up-to-date information to the dispatcher. It might simplify our logic for deciding when the root rotation is complete.
|
(still in progress) |
| return | ||
| } | ||
| // get the current node description | ||
| newNodeDescription, err := a.nodeDescriptionWithHostname(ctx, nodeTLSInfo) |
There was a problem hiding this comment.
so, a.nodeDescriptionWithHostname can return nil, nil? Shouldn't we simply return inside of the if err != nil?
There was a problem hiding this comment.
Yes, I'm not entirely sure about this one...
| if !reflect.DeepEqual(nodeDescription, newNodeDescription) { | ||
| nodeDescription = newNodeDescription | ||
| // close the session | ||
| log.G(ctx).Info("agent: found node update") |
There was a problem hiding this comment.
found sounds weird. Received?
There was a problem hiding this comment.
It's actually sending the node update
| log.G(ctx).WithError(err).Error("node configure failed") | ||
| } | ||
| } | ||
| if len(message.RootCA) > 0 && (nti == nil || digest.FromBytes(message.RootCA).String() != nti.TrustRootHash) { |
There was a problem hiding this comment.
Didn't we talk about not using hashes here, and instead compare the bytes of the pub key itself?
There was a problem hiding this comment.
The agent doesn't actually have access to the root bytes - node is going to do that.
| repeated EncryptionKey network_bootstrap_keys = 4; | ||
|
|
||
| // Which root certificates to trust | ||
| bytes RootCA = 5; |
There was a problem hiding this comment.
Why does the SessionMessage need a new RootCA field? Don't the nodes know exactly who to trust from the RootCA inside of the message that comes from the renewal of the certificate?
There was a problem hiding this comment.
The NodeCertificateStatusResponse does not actually contain the root cert bytes, only the TLS certificate. We could add it to the response, but it seemed less expensive to push the trusted root cert down this way, rather than make the all the nodes have to contact the CA and all renew their TLS certificates a second time just to get the root cert.
|
|
||
| message NodeTLSInfo { | ||
| // Information about which root certs the node trusts | ||
| string trust_root_hash = 1; |
There was a problem hiding this comment.
Why the inconsistency of using bytes for the current TLS cert, but a hash for the root hash?
There was a problem hiding this comment.
Do you mean bytes on the SessionMessage vs the hash here? No particular reason to use the hash instead - I can just send the bytes. But on the SessionMessage it has to be bytes because that's how the node gets the new root certificate to trust.
| } | ||
|
|
||
| // SetWatch sets a queue on which changes to security config (roots and certificate issuer info) are published | ||
| func (s *SecurityConfig) SetWatch(q *watch.Queue) { |
There was a problem hiding this comment.
Why would we like to override the internal queue?
There was a problem hiding this comment.
It seemed easier to set watch on the security config, so the setter can clean it up, as opposed to adding a cleanup function to the security config so the watch can be closed when it's done. If we prefer that I can add it instead, but it means a lot more changes to the tests, etc.
There was a problem hiding this comment.
I don't see this used anywhere yet. I guess that's coming later?
There was a problem hiding this comment.
Yes, sorry, this PR is still in progress
| defer s.mu.Unlock() | ||
| return s.updateTLSCredentials(tlsKeyPair, issuerInfo) | ||
| err = s.updateTLSCredentials(tlsKeyPair, issuerInfo) | ||
| if err != nil && s.queue != nil { |
There was a problem hiding this comment.
Can we create a constructor for SecurityConfigs and always ensure there is a queue? (instead of sprinkling checks for nil queues throughout the code?
There was a problem hiding this comment.
(see above comment w.r.t. closing the queue)
| agent *agent.Agent | ||
| manager *manager.Manager | ||
| notifyNodeChange chan *api.Node // used to send role updates from the dispatcher api on promotion/demotion | ||
| notifyNodeChange chan *agent.NodeChanges // used to send role updates from the dispatcher api on promotion/demotion |
There was a problem hiding this comment.
Are we doing this (having node changes and TLSChanges separate) so that nodes don't rotate their certificates twice for a root rotation?
There was a problem hiding this comment.
The nodeTLSChange channel is so that the agent can report a TLS change to the dispatcher after the security config has changed. The notifyNodeChange channel is for the dispatcher to notify the node that it needs to change something about it's TLS configuration (either renew the cert, because the node's status has changed, or replace its root certificate).
There was a problem hiding this comment.
Maybe the comment could say "used by the agent to relay node updates from the dispatcher Session stream to (*Node).run"
| "github.com/docker/swarmkit/ca" | ||
| "github.com/docker/swarmkit/ca/testutils" | ||
| "github.com/docker/swarmkit/connectionbroker" | ||
| raftutils "github.com/docker/swarmkit/manager/state/raft/testutils" |
There was a problem hiding this comment.
Non-blocking: at some point we should reorganize packages so that PollFuncWithTimeout is not in raftutils.
| require.FailNow(t, fmt.Sprintf("starting agent errored with: %v", err)) | ||
| case <-agent.Ready(): | ||
| case <-time.After(5 * time.Second): | ||
| require.FailNow(t, "agent not ready withou 500 milliseconds") |
| }() | ||
| select { | ||
| case err := <-getErr: | ||
| require.FailNow(t, fmt.Sprintf("starting agent errored with: %v", err)) |
There was a problem hiding this comment.
I believe the Sprintf is not necessary. It looks like FailNow can accept printf-style formatting arguments.
There was a problem hiding this comment.
It can, but for some reason go vet does not like it :|
There was a problem hiding this comment.
Hm... maybe I just didn't update my go vet - it seems to pass now.
| return s.updateTLSCredentials(s.certificate, s.issuerInfo) | ||
| err := s.updateTLSCredentials(s.certificate, s.issuerInfo) | ||
| if err != nil && s.queue != nil { | ||
| s.queue.Publish(rootCA) |
There was a problem hiding this comment.
rootCA is only published on error?
There was a problem hiding this comment.
Nope, that's an error in the logic. It will be fixed in the next commit.
| manager *manager.Manager | ||
| notifyNodeChange chan *api.Node // used to send role updates from the dispatcher api on promotion/demotion | ||
| notifyNodeChange chan *agent.NodeChanges // used to send role updates from the dispatcher api on promotion/demotion | ||
| notifyTLSChange chan *api.NodeTLSInfo // used to send tls info to the agent on TLS credential when the node changes its trust root |
There was a problem hiding this comment.
Where does this get initialized?
There was a problem hiding this comment.
Sorry the node/security config parts of the PR aren't quite done - I was going to set this to be the watch channel instead.
| CertIssuerPublicKey: newIssuer.PublicKey, | ||
| CertIssuerSubject: newIssuer.Subject, | ||
| TrustRootHash: newRootCA.Digest.String(), | ||
| } |
There was a problem hiding this comment.
This needs to be in a select along with <-agentDone to avoid a deadlock.
| CertIssuerPublicKey: tlsIssuerInfo.PublicKey, | ||
| CertIssuerSubject: tlsIssuerInfo.Subject, | ||
| TrustRootHash: securityConfig.RootCA().Digest.String(), | ||
| } |
There was a problem hiding this comment.
There's a potential deadlock here if the agent exits before this goroutine does. Put this in a select along with <-agentDone.
Codecov Report
@@ Coverage Diff @@
## master #2077 +/- ##
==========================================
+ Coverage 54.71% 56.98% +2.26%
==========================================
Files 114 113 -1
Lines 19749 19783 +34
==========================================
+ Hits 10806 11273 +467
+ Misses 7659 7201 -458
- Partials 1284 1309 +25Continue to review full report at Codecov.
|
1ad684b to
f9af531
Compare
| ) | ||
|
|
||
| // TestExecutor is executor for integration tests | ||
| type TestExecutor struct { |
There was a problem hiding this comment.
Note: TestExecutor and TestController are just the functions from integration/exec.go, modified so that we can inject arbitrary node descriptions.
|
Apologies for taking so long - @aaronlehmann @diogomonica I think this is finally done, if you wouldn't mind taking another look. |
| require.FailNow(t, "starting agent errored with: %v", err) | ||
| case <-agent.Ready(): | ||
| case <-time.After(5 * time.Second): | ||
| require.FailNow(t, "agent not ready without 500 milliseconds") |
There was a problem hiding this comment.
The failure message looks questionable
| // updateTLSCredentials updates the client, server, and TLS credentials on a security config. This function expects | ||
| // something else to have taken out a lock on the SecurityConfig. | ||
| // UpdateTLSCredentials updates the client, server, and TLS credentials on a security config. | ||
| func (s *SecurityConfig) UpdateTLSCredentials(certificate *tls.Certificate, issuerInfo *IssuerInfo) error { |
There was a problem hiding this comment.
Is this meant to be used from other packages?
| CertIssuerPublicKey: s.issuerInfo.PublicKey, | ||
| CertIssuerSubject: s.issuerInfo.Subject, | ||
| }) | ||
| } |
There was a problem hiding this comment.
It seems like both places that call updateTLSCredentials have this code just afterwards. Can it be folded into updateTLSCredentials?
There was a problem hiding this comment.
Sure - I just updated it because just seemed a little weird to have the update be handled from a completely different function as well, but that's probably fine.
There was a problem hiding this comment.
Hm... no, nm, that was weird... Fixing I guess again? :)
There was a problem hiding this comment.
Oh wait, no, this comment just got moved to a weird place.
| CertIssuerPublicKey: issuer.PublicKey, | ||
| CertIssuerSubject: issuer.Subject, | ||
| }, nodeTLSInfo) | ||
| case <-time.After(200 * time.Millisecond): |
There was a problem hiding this comment.
I recommend a slightly higher timeout, perhaps 1 second.
| CertIssuerPublicKey: issuer.PublicKey, | ||
| CertIssuerSubject: issuer.Subject, | ||
| }, nodeTLSInfo) | ||
| case <-time.After(200 * time.Millisecond): |
There was a problem hiding this comment.
I recommend a slightly higher timeout, perhaps 1 second.
|
|
||
| "golang.org/x/net/context" | ||
|
|
||
| "os" |
There was a problem hiding this comment.
os can go with the other standard library imports above.
| target: 0 | ||
| changes: false | ||
| ignore: | ||
| -**/testutils No newline at end of file |
There was a problem hiding this comment.
This file is missing a final newline.
There was a problem hiding this comment.
Have added the newline to the end
| // If the server is sending us a ForceRenewal State, renew | ||
| if nodeChanges.Node.Certificate.Status.State == api.IssuanceStateRotate { | ||
| renewCert() | ||
| continue |
There was a problem hiding this comment.
continue doesn't seem right here, as that would ignore a change to the root certificate.
Splitting this code into a separate functions for handling node changes and root certificate changes may help make this clearer, but I fully admit I probably wouldn't bother with this myself.
| } | ||
| if n.role == role { | ||
| n.Unlock() | ||
| continue |
There was a problem hiding this comment.
continue doesn't seem right here.
| n.Unlock() | ||
| continue | ||
|
|
||
| if role == ca.ManagerRole || bytes.Equal(nodeChanges.RootCert, securityConfig.RootCA().Certs) { |
There was a problem hiding this comment.
Might be worth a comment explaining that managers pick up root certificate changes through raft.
There was a problem hiding this comment.
Suppose the manager stops running because it is demoted, but the role change is not seen yet (this is the <-m.RemovedFromRaft() case in runManager). Would we miss a root certificate update that comes just after this?
There was a problem hiding this comment.
We would miss the immediate one, but the next session message that changes the role will also come with the desired root cert, and since it is different than the one currently in security config, it should update. E.g. it should eventually get the root certificate update, although perhaps not immediately.
There was a problem hiding this comment.
But couldn't this potentially block the root rotation indefinitely?
There was a problem hiding this comment.
Apologies, could you elaborate on what would block it? Is it because there might not be a new session message, even if the role has changed?
There was a problem hiding this comment.
I misread what you said before. This sounds good.
e8c01ab to
45d8105
Compare
|
LGTM |
…TLS state, namely what roots the node trusts and the signer info of its current TLS certificate. Also update the session message to include the desired root CA for the node. Signed-off-by: cyli <ying.li@docker.com>
… certificate to all connected agents. Signed-off-by: cyli <ying.li@docker.com>
… can be notified of any changes. Signed-off-by: cyli <ying.li@docker.com>
…cription, and to restart the session if the information has changed. Signed-off-by: cyli <ying.li@docker.com>
…ave the root CA certs. Signed-off-by: cyli <ying.li@docker.com>
sets up a security config watch queue and passes the watch channel to the agent, so the agent can be notified of TLS updates. Also, when the agent notifies the node that the root CA has changed, node updates the TLS config if the node is a worker. Signed-off-by: cyli <ying.li@docker.com>
|
(rebased to make sure there are no semantic conflicts) |
This PR does the following:
Whenever a session is established with a dispatcher, the agent sends the current node's TLS info (the issuer of its TLS certificate, which root certs it trusts) as part of the node description to the dispatcher, which can update it in the store.
Whenever a node renews its TLS certificate and the cert's issuer has changed, or when the node changes the roots that it trusts, the session is restarted and the updated node TLS info sent to the dispatcher.
Whenever a dispatcher notices a root CA change, it pushes the new root certs down to the agent as part of the session message. The agent, if it detects a change from the root it trusts, will notify the node, which (if the node is a worker) will trust the new root instead.
If it's a manager, the manager will update the root CA on its own and doesn't need info from the dispatcher - however, this doesn't cause the agent to immediately re-establish a session. The periodic node update should pick it up, however.
Note that the tests will probably fail - also, the agent/dispatcher code is completely untested.
Part of #1990.