ca+node: Allow expired certs when forcing a new cluster#1919
Conversation
…ot certificate so it's more obvious whether a swarm token is mistyped (for instance, copy-and-pasted wrong). Signed-off-by: cyli <ying.li@docker.com>
… with an expiry error. Signed-off-by: cyli <ying.li@docker.com>
…eck to be bypassed (although the NotBefore time is still checked - this just bypasses the NotAfter check). This is needed in order to load TLS certificates that may be expired when forcing a new cluster. Signed-off-by: cyli <ying.li@docker.com>
e8b97c4 to
31bc84d
Compare
|
@cyli Do you think we should target this for a 1.13.x release? |
|
@aluzzardi If you mean 1.13.2 - sure, I think there's no reason not to. I'm not sure we can make 1.13.1, though, and this is probably not important enough to block it, unless a lot of folks are encountering this issue. |
31bc84d to
0effc89
Compare
|
|
||
| // Attempt to load certificate from disk | ||
| securityConfig, err = ca.LoadSecurityConfig(ctx, rootCA, krw) | ||
| securityConfig, err = ca.LoadSecurityConfig(ctx, rootCA, krw, n.config.ForceNewCluster) |
There was a problem hiding this comment.
There doesn't seem to be a test case for this, but this second call to LoadSecurityConfig may be for the case where maybe the root CA file is deleted (but the TLS keys are on disk), and a join address is provided, so we can re-download the root CA file.
Question:
- Should we actually just download the root CA file to recover, or should this configuration just be an error?
- What should happen if the root CA is on disk but corrupt? Should it also try to generate/download a new CA and download/load the TLS certs?
There was a problem hiding this comment.
This is inside if securityConfig == nil. My understanding is that the second call is for the case where we didn't previously have the root CA, and we just either bootstrapped one or downloaded the cert.
As to the questions, I don't have a specific opinion on the behavior in this case.
This is really key. Problems have come up before with single-node swarms whose certificates expire, which makes swarm mode features inoperable. Single-node swarms should absolutely be able to recover from this. |
| return | ||
| } | ||
|
|
||
| log.Warnf("TLS certificate is expired - must renew ASAP") |
There was a problem hiding this comment.
Use log.Warn when there is no string formatting taking place.
I'd suggest rephrasing the log message to make it clear that this renewal attempt is something that will happen internally, not an action the user needs to do.
| } else { | ||
| // If we have an expired certificate, we let's stick with the starting default in | ||
| // the hope that this is a temporary clock skew. | ||
| // If we have an expired certificate, try to renew immediately: the hope that this is a temporary clock skew, or |
There was a problem hiding this comment.
If this is clock skew, I see two cases, both of which have potential problems:
- If a different manager is issuing certs, it looks like we'd go into a renewal loop, since we reset the exponential backoff after each attempt when the certificate comes back expired.
- If this node is issuing its own certs, it would be issuing certs with validity periods based on its skewed clock, which means the rest of the world probably won't be able to communicate with it (since we're seeing an expired cert, the clock would be skewed towards the future, meaning
NotBeforewould become a timestamp in the future).
I think we should address the renewal loop issue by not retrying if we receive a certificate that appears expired.
The self-issuing case is an inherent consequence of reissuing certs if they expire. If we decide to behave this way (which I think is the best choice, overall), we probably have to live with it. The only alternatives I can think of are:
- Require manual intervention to renew a certificate after it has expired (but provide some kind of flag for this, so it doesn't require starting over, like today)
- Only automatically renew expired certificates in a single-node swarm
There was a problem hiding this comment.
I think we should address the renewal loop issue by not retrying if we receive a certificate that appears expired.
Are we assuming clock skew can eventually fix itself (for instance, maybe it can't contact the NTP server for now)? If so, rather than terminate the loop entirely, would it make sense to set the retry interval to something really long?
There was a problem hiding this comment.
But if it's really clock skew, is there any need to keep renewing? Presumably the certificate will become valid once the clock skew is fixed, unless we're talking about really short expiry intervals.
There was a problem hiding this comment.
But if we terminate the renewal loop, even if the clock skew goes away, we won't be able to renew near when the certificate actually nears expiry. Let's say that the cert expires in a month, but our clock skew is 4 months off. If we kill the renew loop because it's expired, then even if the clock skew fixes itself, the cert will eventually expire without because it won't be renewed unless the node is restarted.
There was a problem hiding this comment.
Maybe that's ok, though, and our position is if there's a clock skew so drastic that it makes the certificate look expired, the only way to recover is to fix the clock skew and restart the node? (One has to restart, otherwise even if the clock skew fixes itself the node won't auto-renew certificates as it nears expiry)
There was a problem hiding this comment.
That seems reasonable. Or we could renew every few minutes. But doing it every few seconds seems excessive.
There was a problem hiding this comment.
I originally set it to renew every few seconds because it could take a little while for the cluster to start up in the ForceNewCluster case (since we have to wait for the cluster data to load).
But it renews based on exponential backoff, so given the current configuration, it should be:
0s35s1m5s2m5s4m5s
and then maxes out at 5 minutes (so it renews every 5 minutes). Although looking at the actual exponential backoff code, the actual time seems to be chosen from a uniform distribution between 0 and those times, so could be every few seconds if unlucky.
We could max it out at 1 hour instead.
There was a problem hiding this comment.
Oh wait, sorry for the silly enumeration of the backoff, I think I misunderstood your comment:
it looks like we'd go into a renewal loop, since we reset the exponential backoff after each attempt when the certificate comes back expired.
If the certificate expires, then the renewal fails, since RenewTLSConfigNow calls RequestAndSaveNewCertificates, which validates the cert, including expiry before returning. If it fails the expiry validation, RequestAndSaveCertificates and thus RenewTLSConfigNow fails, which means the exponential backoff doesn't get reset to 0.
There was a problem hiding this comment.
I've updated the backoff config to use a factor of a minute and max out at an hour. So the progression would be:
0s -> 1m5s -> 2m5s -> 4m5s -> 8m5s -> 16m5s -> 32m5s -> 1h0m0s
This may make the integration test take longer though.
There was a problem hiding this comment.
Pausing 65 seconds in the integration test doesn't sound good to me. Let's make this backoff configurable by the test if necessary.
I misunderstood the original comment to mean that a single-node swarm would be able to start up normally and recover from an expired certificate, but I see now that this only applies if you use |
0effc89 to
2bcfbf2
Compare
Can we tell it's a single node swarm by the following:
Or would we have to wait until the raft store was loaded to check raft members? |
|
|
|
LGTM |
In that case would it make sense to do the following instead:
This way, we can get both single-node swarms and multi-node swarms with (note this logic is just on initializing the node, not in the renewal loop) |
|
Yes, I think that would be great. I'm not sure how much complexity it would add. It could be a followup if that makes sense. |
|
@aaronlehmann sounds good - agree a followup would probably be better |
…ith expired TLS certificates. This also changes the certificate renewal loop to continue trying to renew if the TLS certificates are expired, since the renewal will still fail unless the node happens to be the CA server (since other nodes must connect to the CA in order to renew). Signed-off-by: cyli <ying.li@docker.com>
2bcfbf2 to
d6df02b
Compare
|
LGTM |
This is based on #1912.
This PR makes the following changes:
ForceNewClusteris setForceNewCluster'd node to get new TLS certificates soon. This means that a leader can also recover from expired certificates. No other node can recover, however, since renewing TLS certs would mean that they have to request a new cert from the CA, and would not be able to establish a mTLS connection to the CA due to expired certs. This mean that we keep making requests to the CA even when there's no chance of recovery, which may not be desirable.cc @aaronlehmann @LK4D4