ca+node: Allow expired certs when forcing a new cluster by cyli · Pull Request #1919 · moby/swarmkit

cyli · 2017-02-02T02:01:26Z

This is based on #1912.

This PR makes the following changes:

Allows a node to start up with expired certs if ForceNewCluster is set
Changes the certificate renewal loop renew immediately if the certificate is expired. This allows a ForceNewCluster'd node to get new TLS certificates soon. This means that a leader can also recover from expired certificates. No other node can recover, however, since renewing TLS certs would mean that they have to request a new cert from the CA, and would not be able to establish a mTLS connection to the CA due to expired certs. This mean that we keep making requests to the CA even when there's no chance of recovery, which may not be desirable.

…ot certificate so it's more obvious whether a swarm token is mistyped (for instance, copy-and-pasted wrong). Signed-off-by: cyli <ying.li@docker.com>

… with an expiry error. Signed-off-by: cyli <ying.li@docker.com>

…eck to be bypassed (although the NotBefore time is still checked - this just bypasses the NotAfter check). This is needed in order to load TLS certificates that may be expired when forcing a new cluster. Signed-off-by: cyli <ying.li@docker.com>

aluzzardi · 2017-02-02T02:24:35Z

@cyli Do you think we should target this for a 1.13.x release?

cyli · 2017-02-02T02:28:00Z

@aluzzardi If you mean 1.13.2 - sure, I think there's no reason not to. I'm not sure we can make 1.13.1, though, and this is probably not important enough to block it, unless a lot of folks are encountering this issue.

cyli · 2017-02-02T02:43:57Z


 		// Attempt to load certificate from disk
-		securityConfig, err = ca.LoadSecurityConfig(ctx, rootCA, krw)
+		securityConfig, err = ca.LoadSecurityConfig(ctx, rootCA, krw, n.config.ForceNewCluster)


There doesn't seem to be a test case for this, but this second call to LoadSecurityConfig may be for the case where maybe the root CA file is deleted (but the TLS keys are on disk), and a join address is provided, so we can re-download the root CA file.

Question:

Should we actually just download the root CA file to recover, or should this configuration just be an error?

What should happen if the root CA is on disk but corrupt? Should it also try to generate/download a new CA and download/load the TLS certs?

This is inside if securityConfig == nil. My understanding is that the second call is for the case where we didn't previously have the root CA, and we just either bootstrapped one or downloaded the cert.

As to the questions, I don't have a specific opinion on the behavior in this case.

aaronlehmann · 2017-02-02T10:16:20Z

This means that a leader can also recover from expired certificates.

This is really key. Problems have come up before with single-node swarms whose certificates expire, which makes swarm mode features inoperable. Single-node swarms should absolutely be able to recover from this.

aaronlehmann · 2017-02-02T10:28:10Z

-						return
-					}
-
+					log.Warnf("TLS certificate is expired - must renew ASAP")


Use log.Warn when there is no string formatting taking place.

I'd suggest rephrasing the log message to make it clear that this renewal attempt is something that will happen internally, not an action the user needs to do.

aaronlehmann · 2017-02-02T10:40:43Z

 			} else {
-				// If we have an expired certificate, we let's stick with the starting default in
-				// the hope that this is a temporary clock skew.
+				// If we have an expired certificate, try to renew immediately: the hope that this is a temporary clock skew, or


If this is clock skew, I see two cases, both of which have potential problems:

If a different manager is issuing certs, it looks like we'd go into a renewal loop, since we reset the exponential backoff after each attempt when the certificate comes back expired.

If this node is issuing its own certs, it would be issuing certs with validity periods based on its skewed clock, which means the rest of the world probably won't be able to communicate with it (since we're seeing an expired cert, the clock would be skewed towards the future, meaning NotBefore would become a timestamp in the future).

I think we should address the renewal loop issue by not retrying if we receive a certificate that appears expired.

The self-issuing case is an inherent consequence of reissuing certs if they expire. If we decide to behave this way (which I think is the best choice, overall), we probably have to live with it. The only alternatives I can think of are:

Require manual intervention to renew a certificate after it has expired (but provide some kind of flag for this, so it doesn't require starting over, like today)

Only automatically renew expired certificates in a single-node swarm

I think we should address the renewal loop issue by not retrying if we receive a certificate that appears expired.

Are we assuming clock skew can eventually fix itself (for instance, maybe it can't contact the NTP server for now)? If so, rather than terminate the loop entirely, would it make sense to set the retry interval to something really long?

But if it's really clock skew, is there any need to keep renewing? Presumably the certificate will become valid once the clock skew is fixed, unless we're talking about really short expiry intervals.

But if we terminate the renewal loop, even if the clock skew goes away, we won't be able to renew near when the certificate actually nears expiry. Let's say that the cert expires in a month, but our clock skew is 4 months off. If we kill the renew loop because it's expired, then even if the clock skew fixes itself, the cert will eventually expire without because it won't be renewed unless the node is restarted.

Maybe that's ok, though, and our position is if there's a clock skew so drastic that it makes the certificate look expired, the only way to recover is to fix the clock skew and restart the node? (One has to restart, otherwise even if the clock skew fixes itself the node won't auto-renew certificates as it nears expiry)

That seems reasonable. Or we could renew every few minutes. But doing it every few seconds seems excessive.

I originally set it to renew every few seconds because it could take a little while for the cluster to start up in the ForceNewCluster case (since we have to wait for the cluster data to load).

But it renews based on exponential backoff, so given the current configuration, it should be:

0s

35s

1m5s

2m5s

4m5s

and then maxes out at 5 minutes (so it renews every 5 minutes). Although looking at the actual exponential backoff code, the actual time seems to be chosen from a uniform distribution between 0 and those times, so could be every few seconds if unlucky.

We could max it out at 1 hour instead.

Oh wait, sorry for the silly enumeration of the backoff, I think I misunderstood your comment:

it looks like we'd go into a renewal loop, since we reset the exponential backoff after each attempt when the certificate comes back expired.

If the certificate expires, then the renewal fails, since RenewTLSConfigNow calls RequestAndSaveNewCertificates, which validates the cert, including expiry before returning. If it fails the expiry validation, RequestAndSaveCertificates and thus RenewTLSConfigNow fails, which means the exponential backoff doesn't get reset to 0.

I've updated the backoff config to use a factor of a minute and max out at an hour. So the progression would be:

0s -> 1m5s -> 2m5s -> 4m5s -> 8m5s -> 16m5s -> 32m5s -> 1h0m0s

This may make the integration test take longer though.

Pausing 65 seconds in the integration test doesn't sound good to me. Let's make this backoff configurable by the test if necessary.

aaronlehmann · 2017-02-02T10:52:36Z

This means that a leader can also recover from expired certificates.
This is really key. Problems have come up before with single-node swarms whose certificates expire, which makes swarm mode features inoperable. Single-node swarms should absolutely be able to recover from this.

I misunderstood the original comment to mean that a single-node swarm would be able to start up normally and recover from an expired certificate, but I see now that this only applies if you use --force-new-cluster. I like the idea of having a single-node swarm be able to recover automatically, but given the risks and tradeoffs, I think the approach in this PR is appropriate. We should document the steps to recover from an expired certificate, though.

cyli · 2017-02-02T19:58:19Z

I like the idea of having a single-node swarm be able to recover automatically

Can we tell it's a single node swarm by the following:

checking to see if it's a manager
checking it's peer list to ensure that there are no peers?

Or would we have to wait until the raft store was loaded to check raft members?

aaronlehmann · 2017-02-02T20:49:00Z

node does persist a list of managers (see persistentRemotes), so I guess we could do this without consulting raft.

aaronlehmann · 2017-02-02T20:53:56Z

LGTM

cyli · 2017-02-02T21:40:42Z

node does persist a list of managers (see persistentRemotes), so I guess we could do this without consulting raft

In that case would it make sense to do the following instead:

load the cert always without checking expiry
if the loaded cert indicates that the node is a manager, and there are other nodes in the peer list, and if ForceNewCluster is not set, fail if the loaded cert is expired

This way, we can get both single-node swarms and multi-node swarms with ForceNewCluster enabled.

(note this logic is just on initializing the node, not in the renewal loop)

aaronlehmann · 2017-02-02T22:02:46Z

Yes, I think that would be great. I'm not sure how much complexity it would add. It could be a followup if that makes sense.

cyli · 2017-02-02T22:27:54Z

@aaronlehmann sounds good - agree a followup would probably be better

…ith expired TLS certificates. This also changes the certificate renewal loop to continue trying to renew if the TLS certificates are expired, since the renewal will still fail unless the node happens to be the CA server (since other nodes must connect to the CA in order to renew). Signed-off-by: cyli <ying.li@docker.com>

LK4D4 · 2017-02-02T23:31:14Z

LGTM

cyli added 3 commits February 1, 2017 02:47

Do slightly more validation of the join token when downloading the ro…

335b4d5

…ot certificate so it's more obvious whether a swarm token is mistyped (for instance, copy-and-pasted wrong). Signed-off-by: cyli <ying.li@docker.com>

Include timestamp information when validating a TLS certificate fails…

aefc3e9

… with an expiry error. Signed-off-by: cyli <ying.li@docker.com>

cyli force-pushed the allow-expired-certs-when-forcing-new-cluster branch 2 times, most recently from e8b97c4 to 31bc84d Compare February 2, 2017 02:08

cyli force-pushed the allow-expired-certs-when-forcing-new-cluster branch from 31bc84d to 0effc89 Compare February 2, 2017 02:39

cyli commented Feb 2, 2017

View reviewed changes

cyli changed the title ~~Allow expired certs when forcing a new cluster~~ ca+node: Allow expired certs when forcing a new cluster Feb 2, 2017

aaronlehmann reviewed Feb 2, 2017

View reviewed changes

cyli force-pushed the allow-expired-certs-when-forcing-new-cluster branch from 0effc89 to 2bcfbf2 Compare February 2, 2017 19:28

cyli force-pushed the allow-expired-certs-when-forcing-new-cluster branch from 2bcfbf2 to d6df02b Compare February 2, 2017 22:30

LK4D4 merged commit 9334c8c into moby:master Feb 2, 2017

cyli deleted the allow-expired-certs-when-forcing-new-cluster branch February 2, 2017 23:41

cyli mentioned this pull request Feb 3, 2017

Fix integration test data race and slowness #1920

Merged

cyli mentioned this pull request Mar 2, 2017

[ca] Add support for verifying and loading certificate chains which include intermediates #2000

Merged

Conversation

cyli commented Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aluzzardi commented Feb 2, 2017

Uh oh!

cyli commented Feb 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Feb 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyli Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyli Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyli Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Feb 2, 2017

Uh oh!

cyli commented Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented Feb 2, 2017

Uh oh!

aaronlehmann commented Feb 2, 2017

Uh oh!

cyli commented Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented Feb 2, 2017

Uh oh!

cyli commented Feb 2, 2017

Uh oh!

LK4D4 commented Feb 2, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cyli commented Feb 2, 2017 •

edited

Loading

cyli Feb 2, 2017 •

edited

Loading

cyli Feb 2, 2017 •

edited

Loading

cyli Feb 2, 2017 •

edited

Loading

cyli commented Feb 2, 2017 •

edited

Loading

cyli commented Feb 2, 2017 •

edited

Loading