Skip to content

ca+node: Allow expired certs when forcing a new cluster#1919

Merged
LK4D4 merged 4 commits into
moby:masterfrom
cyli:allow-expired-certs-when-forcing-new-cluster
Feb 2, 2017
Merged

ca+node: Allow expired certs when forcing a new cluster#1919
LK4D4 merged 4 commits into
moby:masterfrom
cyli:allow-expired-certs-when-forcing-new-cluster

Conversation

@cyli
Copy link
Copy Markdown
Contributor

@cyli cyli commented Feb 2, 2017

This is based on #1912.

This PR makes the following changes:

  1. Allows a node to start up with expired certs if ForceNewCluster is set
  2. Changes the certificate renewal loop renew immediately if the certificate is expired. This allows a ForceNewCluster'd node to get new TLS certificates soon. This means that a leader can also recover from expired certificates. No other node can recover, however, since renewing TLS certs would mean that they have to request a new cert from the CA, and would not be able to establish a mTLS connection to the CA due to expired certs. This mean that we keep making requests to the CA even when there's no chance of recovery, which may not be desirable.

cc @aaronlehmann @LK4D4

cyli added 3 commits February 1, 2017 02:47
…ot certificate so

it's more obvious whether a swarm token is mistyped (for instance, copy-and-pasted wrong).

Signed-off-by: cyli <ying.li@docker.com>
… with an expiry error.

Signed-off-by: cyli <ying.li@docker.com>
…eck to be bypassed

(although the NotBefore time is still checked - this just bypasses the NotAfter check).
This is needed in order to load TLS certificates that may be expired when forcing a new
cluster.

Signed-off-by: cyli <ying.li@docker.com>
@cyli cyli force-pushed the allow-expired-certs-when-forcing-new-cluster branch 2 times, most recently from e8b97c4 to 31bc84d Compare February 2, 2017 02:08
@aluzzardi
Copy link
Copy Markdown
Member

@cyli Do you think we should target this for a 1.13.x release?

@cyli
Copy link
Copy Markdown
Contributor Author

cyli commented Feb 2, 2017

@aluzzardi If you mean 1.13.2 - sure, I think there's no reason not to. I'm not sure we can make 1.13.1, though, and this is probably not important enough to block it, unless a lot of folks are encountering this issue.

@cyli cyli force-pushed the allow-expired-certs-when-forcing-new-cluster branch from 31bc84d to 0effc89 Compare February 2, 2017 02:39
Comment thread node/node.go

// Attempt to load certificate from disk
securityConfig, err = ca.LoadSecurityConfig(ctx, rootCA, krw)
securityConfig, err = ca.LoadSecurityConfig(ctx, rootCA, krw, n.config.ForceNewCluster)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be a test case for this, but this second call to LoadSecurityConfig may be for the case where maybe the root CA file is deleted (but the TLS keys are on disk), and a join address is provided, so we can re-download the root CA file.

Question:

  1. Should we actually just download the root CA file to recover, or should this configuration just be an error?
  2. What should happen if the root CA is on disk but corrupt? Should it also try to generate/download a new CA and download/load the TLS certs?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inside if securityConfig == nil. My understanding is that the second call is for the case where we didn't previously have the root CA, and we just either bootstrapped one or downloaded the cert.

As to the questions, I don't have a specific opinion on the behavior in this case.

@cyli cyli changed the title Allow expired certs when forcing a new cluster ca+node: Allow expired certs when forcing a new cluster Feb 2, 2017
@aaronlehmann
Copy link
Copy Markdown
Collaborator

This means that a leader can also recover from expired certificates.

This is really key. Problems have come up before with single-node swarms whose certificates expire, which makes swarm mode features inoperable. Single-node swarms should absolutely be able to recover from this.

Comment thread ca/config.go Outdated
return
}

log.Warnf("TLS certificate is expired - must renew ASAP")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use log.Warn when there is no string formatting taking place.

I'd suggest rephrasing the log message to make it clear that this renewal attempt is something that will happen internally, not an action the user needs to do.

Comment thread ca/config.go
} else {
// If we have an expired certificate, we let's stick with the starting default in
// the hope that this is a temporary clock skew.
// If we have an expired certificate, try to renew immediately: the hope that this is a temporary clock skew, or
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is clock skew, I see two cases, both of which have potential problems:

  • If a different manager is issuing certs, it looks like we'd go into a renewal loop, since we reset the exponential backoff after each attempt when the certificate comes back expired.
  • If this node is issuing its own certs, it would be issuing certs with validity periods based on its skewed clock, which means the rest of the world probably won't be able to communicate with it (since we're seeing an expired cert, the clock would be skewed towards the future, meaning NotBefore would become a timestamp in the future).

I think we should address the renewal loop issue by not retrying if we receive a certificate that appears expired.

The self-issuing case is an inherent consequence of reissuing certs if they expire. If we decide to behave this way (which I think is the best choice, overall), we probably have to live with it. The only alternatives I can think of are:

  • Require manual intervention to renew a certificate after it has expired (but provide some kind of flag for this, so it doesn't require starting over, like today)
  • Only automatically renew expired certificates in a single-node swarm

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should address the renewal loop issue by not retrying if we receive a certificate that appears expired.

Are we assuming clock skew can eventually fix itself (for instance, maybe it can't contact the NTP server for now)? If so, rather than terminate the loop entirely, would it make sense to set the retry interval to something really long?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if it's really clock skew, is there any need to keep renewing? Presumably the certificate will become valid once the clock skew is fixed, unless we're talking about really short expiry intervals.

Copy link
Copy Markdown
Contributor Author

@cyli cyli Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we terminate the renewal loop, even if the clock skew goes away, we won't be able to renew near when the certificate actually nears expiry. Let's say that the cert expires in a month, but our clock skew is 4 months off. If we kill the renew loop because it's expired, then even if the clock skew fixes itself, the cert will eventually expire without because it won't be renewed unless the node is restarted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that's ok, though, and our position is if there's a clock skew so drastic that it makes the certificate look expired, the only way to recover is to fix the clock skew and restart the node? (One has to restart, otherwise even if the clock skew fixes itself the node won't auto-renew certificates as it nears expiry)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable. Or we could renew every few minutes. But doing it every few seconds seems excessive.

Copy link
Copy Markdown
Contributor Author

@cyli cyli Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally set it to renew every few seconds because it could take a little while for the cluster to start up in the ForceNewCluster case (since we have to wait for the cluster data to load).

But it renews based on exponential backoff, so given the current configuration, it should be:

  1. 0s
  2. 35s
  3. 1m5s
  4. 2m5s
  5. 4m5s

and then maxes out at 5 minutes (so it renews every 5 minutes). Although looking at the actual exponential backoff code, the actual time seems to be chosen from a uniform distribution between 0 and those times, so could be every few seconds if unlucky.

We could max it out at 1 hour instead.

Copy link
Copy Markdown
Contributor Author

@cyli cyli Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, sorry for the silly enumeration of the backoff, I think I misunderstood your comment:

it looks like we'd go into a renewal loop, since we reset the exponential backoff after each attempt when the certificate comes back expired.

If the certificate expires, then the renewal fails, since RenewTLSConfigNow calls RequestAndSaveNewCertificates, which validates the cert, including expiry before returning. If it fails the expiry validation, RequestAndSaveCertificates and thus RenewTLSConfigNow fails, which means the exponential backoff doesn't get reset to 0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the backoff config to use a factor of a minute and max out at an hour. So the progression would be:

0s -> 1m5s -> 2m5s -> 4m5s -> 8m5s -> 16m5s -> 32m5s -> 1h0m0s

This may make the integration test take longer though.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pausing 65 seconds in the integration test doesn't sound good to me. Let's make this backoff configurable by the test if necessary.

@aaronlehmann
Copy link
Copy Markdown
Collaborator

This means that a leader can also recover from expired certificates.
This is really key. Problems have come up before with single-node swarms whose certificates expire, which makes swarm mode features inoperable. Single-node swarms should absolutely be able to recover from this.

I misunderstood the original comment to mean that a single-node swarm would be able to start up normally and recover from an expired certificate, but I see now that this only applies if you use --force-new-cluster. I like the idea of having a single-node swarm be able to recover automatically, but given the risks and tradeoffs, I think the approach in this PR is appropriate. We should document the steps to recover from an expired certificate, though.

@cyli cyli force-pushed the allow-expired-certs-when-forcing-new-cluster branch from 0effc89 to 2bcfbf2 Compare February 2, 2017 19:28
@cyli
Copy link
Copy Markdown
Contributor Author

cyli commented Feb 2, 2017

I like the idea of having a single-node swarm be able to recover automatically

Can we tell it's a single node swarm by the following:

  1. checking to see if it's a manager
  2. checking it's peer list to ensure that there are no peers?

Or would we have to wait until the raft store was loaded to check raft members?

@aaronlehmann
Copy link
Copy Markdown
Collaborator

node does persist a list of managers (see persistentRemotes), so I guess we could do this without consulting raft.

@aaronlehmann
Copy link
Copy Markdown
Collaborator

LGTM

@cyli
Copy link
Copy Markdown
Contributor Author

cyli commented Feb 2, 2017

node does persist a list of managers (see persistentRemotes), so I guess we could do this without consulting raft

In that case would it make sense to do the following instead:

  1. load the cert always without checking expiry
  2. if the loaded cert indicates that the node is a manager, and there are other nodes in the peer list, and if ForceNewCluster is not set, fail if the loaded cert is expired

This way, we can get both single-node swarms and multi-node swarms with ForceNewCluster enabled.

(note this logic is just on initializing the node, not in the renewal loop)

@aaronlehmann
Copy link
Copy Markdown
Collaborator

Yes, I think that would be great. I'm not sure how much complexity it would add. It could be a followup if that makes sense.

@cyli
Copy link
Copy Markdown
Contributor Author

cyli commented Feb 2, 2017

@aaronlehmann sounds good - agree a followup would probably be better

…ith expired TLS certificates.

This also changes the certificate renewal loop to continue trying to renew if the TLS certificates
are expired, since the renewal will still fail unless the node happens to be the CA server (since
other nodes must connect to the CA in order to renew).

Signed-off-by: cyli <ying.li@docker.com>
@cyli cyli force-pushed the allow-expired-certs-when-forcing-new-cluster branch from 2bcfbf2 to d6df02b Compare February 2, 2017 22:30
@LK4D4
Copy link
Copy Markdown
Contributor

LK4D4 commented Feb 2, 2017

LGTM

@LK4D4 LK4D4 merged commit 9334c8c into moby:master Feb 2, 2017
@cyli cyli deleted the allow-expired-certs-when-forcing-new-cluster branch February 2, 2017 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants