[controlapi] Allow starting and aborting root CA rotations during swarm update API#2052
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2052 +/- ##
=========================================
+ Coverage 54.21% 54.51% +0.3%
=========================================
Files 111 112 +1
Lines 19354 19497 +143
=========================================
+ Hits 10492 10629 +137
+ Misses 7597 7590 -7
- Partials 1265 1278 +13Continue to review full report at Codecov.
|
| // ForceRotate is a counter that triggers an root CA rotation even if no relevant | ||
| // parameters have been in the spec. This will force the manager to generate a new | ||
| // certificate and key, if none have been provided. | ||
| uint64 force_rotate = 5;} |
There was a problem hiding this comment.
Missing newline here
|
|
||
| // ForceRotationVersion matches the Cluster Spec's CAConfig's ForceRotation counter. | ||
| // It indicates when the current CA cert and key were generated (or updated). | ||
| uint64 force_rotation_version = 6; |
There was a problem hiding this comment.
Maybe just RotationVersion?
Maybe LastForcedRotation is better, actually, because that is less likely to be confused with the Version field in Meta.
There was a problem hiding this comment.
You're right, we probably only need the top level one.
| bytes cross_signed_ca_cert = 3 [(gogoproto.customname) = "CrossSignedCACert"]; | ||
| // ForceRotationVersion matches the Cluster Spec's CAConfig's ForceRotation counter. | ||
| // It indicates when the current to-be-rotated CA cert and key were generated (or updated). | ||
| uint64 force_rotation_version = 6; |
There was a problem hiding this comment.
Why does this appear in both RootRotation and one level above it? Is that intentional?
There was a problem hiding this comment.
Never mind, read the comments... But I'm still surprised we would need to keep track of both. Will read the code next.
|
ping @aluzzardi for API review I think this approach makes sense. It lets a client request a rotation in a declarative way, either by providing a new key/cert or asking the manager to generate one. We also discussed passing flags to controlapi's |
1e44ff4 to
4d986d2
Compare
diogomonica
left a comment
There was a problem hiding this comment.
In general LGTM, but I would love to have a deeper look at validateAndUpdateCA with a high-level diagram that describes the four big blocks of that function.
| repeated ExternalCA external_cas = 2 [(gogoproto.customname) = "ExternalCAs"]; | ||
|
|
||
| // SigningCACert is the desired CA certificate to be used as the root and | ||
| // signing CA for the swarm. If not provided, |
| // to sign certificates for the swarm | ||
| bytes signing_ca_key = 4 [(gogoproto.customname) = "SigningCAKey"]; | ||
|
|
||
| // ForceRotate is a counter that triggers an root CA rotation even if no relevant |
|
|
||
| // LastForcedRotation matches the Cluster Spec's CAConfig's ForceRotation counter. | ||
| // It indicates when the current CA cert and key were generated (or updated). | ||
| uint64 last_forced_rotation = 6; |
There was a problem hiding this comment.
aren't all rotations forced? why not just last_rotation?
There was a problem hiding this comment.
I don't particularly feel strongly - the name is to make it clearer that it corresponds with the ForceRotation value in the config.
| "github.com/docker/swarmkit/log" | ||
| ) | ||
|
|
||
| var minRootExpiration = helpers.OneYear |
There was a problem hiding this comment.
My first instinct was: this is reasonable. But on further consideration, and given that fact that:
- This is the root CA
- We have the ability of rotating at any time if compromised
This makes me think that maybe we should help our users not shoot themselves in the foot and increase the minimum a bit more (3 years)? Just a thought though, I'm 100% ok with 1 year.
/cc @sul3n3t
There was a problem hiding this comment.
I also don't feel strongly - happy to change it to 3 years too. I just picked some value. :)
| // Creates and updates the api.RootRotation object and the cross-signed intermediate. | ||
| // This function assumes that a root rotation is not already in progress, and that the root | ||
| // cert and key to be rotated in or the external CAs in the cluster have already been validated | ||
| func updateRootRotationObject(ctx context.Context, securityConfig *ca.SecurityConfig, cluster *api.Cluster, newRootCA ca.RootCA, version uint64) error { |
There was a problem hiding this comment.
Should this be a method on cluster?
There was a problem hiding this comment.
I dislike that this is not a method on cluster but has side-effects on cluster.
There was a problem hiding this comment.
The api module is just data, and I don't know if we should be adding this to that module, particular could create import issues since then nothing from ca would be able to import anything from api, since api would then depend in ca.
| // whether it's a locally signing CA or an external CA. The key can be replaced with the desired key (which will | ||
| // either be the correct key or nil) without having to do any kind of root rotation, so we can abort any outstanding | ||
| // root rotations and just finish the validation. | ||
| if bytes.Equal(cluster.RootCA.CACert, newConfig.SigningCACert) { |
There was a problem hiding this comment.
can we just change cluster at will?
| return grpc.Errorf(codes.InvalidArgument, "CA certificate expires too soon") | ||
| } | ||
|
|
||
| // If the desired cert has the same public key and subject as current root, then the private signer is the same |
There was a problem hiding this comment.
Isn't this the exact same conditional as the one on line 147?
There was a problem hiding this comment.
In this case, the cert has changed, it just has the same subject and public key. One such example is if the cert is renewed. In this case, leaf certs issued by the previous CA cert will still correctly validate, but the cert digest has changed, and hence join tokens need to be re-generated.
There was a problem hiding this comment.
So: Root CA Cert is expiring, and we self-sign a longer lived one?
| return grpc.Errorf(codes.InvalidArgument, "the desired CA certificate cannot contain multiple certificates") | ||
| } | ||
|
|
||
| // If the desired cert is the same as the current root CA cert then the private signer is the same |
There was a problem hiding this comment.
What are the situations for each of these? Why doesn't this one update the tokens?
There was a problem hiding this comment.
It's byte-for-byte the exact same cert, so the tokens do not need to be re-generated, since the hash is the same.
| @@ -0,0 +1,623 @@ | |||
| package controlapi | |||
There was a problem hiding this comment.
Thanks for making this table-driven.
| { | ||
| rootCA: initialLocalRootCA, | ||
| oldCAConfig: api.CAConfig{ForceRotate: 1}, | ||
| expectErrorString: "ForceRotate value must be nondecreasing", |
There was a problem hiding this comment.
what about testing the same?
There was a problem hiding this comment.
Ignore, the same is a valid case.
There was a problem hiding this comment.
Added the same value test case to the valid cases.
|
@aaronlehmann @cyli I commented it inline, but why not have a simpler logic of bumping the rotation version to kick a rotation, and if valid Cert/Key are provided we use them, if not, we generate new ones. Not sure I see the advantage of having two mechanisms. |
6e8ae5e to
0bf8867
Compare
|
@cyli: Is this still WIP? |
0bf8867 to
60d5cdd
Compare
|
@aaronlehmann It wasn't before - just pushed the latest :) So it is now, thanks for checking! |
60d5cdd to
01833ab
Compare
|
|
||
| // TODO(cyli): in go 1.8 this will be easier as we can use *url.URL.Port() and *url.URL.Hostname() | ||
| func getConnectionStringFromHost(hostport string) string { | ||
| colon := strings.IndexByte(hostport, ':') |
There was a problem hiding this comment.
Use net.SplitHostPort
There was a problem hiding this comment.
Oh nice! Thanks! I didn't know about that one - go 1.8's code doesn't use it either. :D
| func getConnectionStringFromHost(hostport string) string { | ||
| colon := strings.IndexByte(hostport, ':') | ||
| if colon == -1 { | ||
| return fmt.Sprintf("%s:443", hostport) // no port, default to 443 |
There was a problem hiding this comment.
Use net.JoinHostPort
| } | ||
| conn, err := tls.DialWithDialer(dialer, "tcp", getConnectionStringFromHost(parsed.Host), tlsOpts) | ||
| if conn != nil { | ||
| defer conn.Close() |
There was a problem hiding this comment.
It shouldn't be necessary to defer this.
| case api.CAConfig: | ||
| return len(b.SigningCACert) > 0 && len(b.SigningCAKey) == 0 | ||
| default: | ||
| return false |
| if err == nil { | ||
| return true | ||
| } | ||
| log.G(ctx).WithError(err).Debugf("external CA # %d is unreachable or invalid", i+1) |
There was a problem hiding this comment.
Seems like a potentially useful log message. Any downsides to making it a warning?
There was a problem hiding this comment.
Not particularly - this may be noisy if we have to re-validate every time, but perhaps if I reflect.DeepEqual(<oldCAConfig spec>, <newCAConfig spec> in UpdateCluster it would be fine? That'd assume nothing else goes and changes api.RootCA in the store (which is probably a fair assumption).
There was a problem hiding this comment.
This is in controlapi, so at most you would get a warning per unreachable CA every time the cluster object is updated through the API, right?
There was a problem hiding this comment.
Yes. Do we want to warn every time a cluster update is attempted, or just when the the user attempts to change the CAConfig?
There was a problem hiding this comment.
I don't see any harm in warning every time the cluster is updated, as long as the unreachable CAs continue to be referenced in the spec. We're checking the CAs anyway, so if we discover that one is unreachable we might as well expose that information.
There was a problem hiding this comment.
Sounds good, although this is on attempted spec update as well, not just when the spec update succeeds. I'm ok with that though, just wanted to make sure it's clear.
| // Do not copy secret key | ||
| // Do not copy secret keys | ||
| redactedSpec := cluster.Spec.Copy() | ||
| redactedSpec.CAConfig.SigningCAKey = nil |
There was a problem hiding this comment.
do we have a test for this being redacted?
There was a problem hiding this comment.
yup, it's there, never mind!
| grpclog.SetLogger(log.New(ioutil.Discard, "", log.LstdFlags)) | ||
| logrus.SetOutput(ioutil.Discard) | ||
| } | ||
| // func init() { |
There was a problem hiding this comment.
why do we have commented out code?
There was a problem hiding this comment.
Oops sorry, I needed this while debugging. Forgot to uncomment it, thanks!
| JoinTokens: api.JoinTokens{ | ||
| Worker: ca.GenerateJoinToken(newCARootCA), | ||
| Manager: ca.GenerateJoinToken(newCARootCA), | ||
| }, |
There was a problem hiding this comment.
Are the join tokens being regenerated because the digest of the certificate might be changing even if the subject and public key are the same?
There was a problem hiding this comment.
I guess the digest is guaranteed to change in this case, because otherwise the bytes.Equal above would return true, right?
There was a problem hiding this comment.
After some discussion with @diogomonica, it may not be worth it to check this particular case. We're optimizing for not having to generate a new cross-signed certificate, but it might simplify the code if we don't try to do this optimization and just do a full root rotation object instead. How would you feel about removing this check?
| "github.com/docker/swarmkit/log" | ||
| ) | ||
|
|
||
| var minRootExpiration = 3 * helpers.OneYear |
There was a problem hiding this comment.
I regret my comment. Maybe 1 year makes this more flexible?
| "reflect" | ||
| "testing" | ||
| "time" | ||
|
|
| "github.com/docker/swarmkit/ca" | ||
| "github.com/docker/swarmkit/ca/testutils" | ||
| "github.com/docker/swarmkit/manager/state/store" | ||
| digest "github.com/opencontainers/go-digest" |
There was a problem hiding this comment.
digest shouldn't be necessary.
There was a problem hiding this comment.
Good point, thanks! My auto-importer insists on adding it every time I save. :)
| require.NoError(t, err, valid.description) | ||
|
|
||
| // ensure that the cluster was not mutated | ||
| require.True(t, reflect.DeepEqual(valid.rootCA, cluster.RootCA)) |
There was a problem hiding this comment.
BTW require.Equal will do the same thing, and show a diff if there is an unexpected change.
|
|
||
| // Because join tokens are random, we can't predict exactly what it is, so this needs to be manually checked | ||
| require.Equal(t, valid.expectJoinTokenChange, !reflect.DeepEqual(result.JoinTokens, valid.rootCA.JoinTokens), | ||
| fmtString, valid.description, result.JoinTokens, valid.rootCA.JoinTokens) |
There was a problem hiding this comment.
Slightly more verbose, but I think this would be easier to read and more useful as:
if valid.expectJoinTokenChange {
require.NotEqual(t, result.JoinTokens, valid.rootCA.JoinTokens, fmtString, valid.description, result.JoinTokens, valid.rootCA.JoinTokens)
} else {
require.Equal(t, result.JoinTokens, valid.rootCA.JoinTokens, fmtString, valid.description, result.JoinTokens, valid.rootCA.JoinTokens)
}| if valid.expectedGeneratedCross || valid.expectGeneratedRootRotation { // both generate cross signed certs | ||
| require.NotNil(t, result.RootRotation, valid.description) | ||
| require.True(t, len(result.RootRotation.CrossSignedCACert) > 0, | ||
| "%s: if there is a root rotation object there must be a cross-signed cert", valid.description) |
| // just assert that the value has changed. | ||
| if valid.expectGeneratedRootRotation { | ||
| require.NotNil(t, result.RootRotation, valid.description) | ||
| require.False(t, reflect.DeepEqual(valid.rootCA.RootRotation, result.RootRotation), valid.description) |
|
|
||
| require.True(t, reflect.DeepEqual(result, &valid.expectRootCA), | ||
| "should be equal: "+fmtString, valid.description, result, &valid.expectRootCA) | ||
| } |
| testcase.rootCA = getRootCAWithRotation(testcase.rootCA, rotationCert, rotationKey, crossSigned) | ||
| testcases[2].description = "different cert than root rotation contents, even if internal->external, results in replacing root rotation" | ||
| } | ||
| } |
There was a problem hiding this comment.
I don't understand the purpose of this loop.
There was a problem hiding this comment.
That's fair - I'm just testing that if the desired cert is different than the current root rotation cert, then the root rotation is replaced. That can just be 1 extra case, instead of 3 extra cases.
There was a problem hiding this comment.
Oh sorry, unless you mean because I'm only checking the last one (testCases[2:])? That was something accidentally left over from when I was debugging. Regardless, I probably don't need the extra cases anyway.
There was a problem hiding this comment.
Just confused about why you're looping and then doing something different for each value of the loop counter. I think I'm missing something.
There was a problem hiding this comment.
I was trying to preserve the original testcases' configurations and expected root CA values (e.g. I was too lazy to type it again). It was to test that whatever the initial RootRotation value was, because the cert is different, it will just be replaced completely by a new root rotation with differentInitialCert, and that any external CAs required by the old root rotation will no longer be required.
I think this can probably be replaced by a single case.
| return nil, grpc.Errorf(codes.InvalidArgument, "the desired CA certificate cannot contain multiple certificates") | ||
| } | ||
|
|
||
| // because newRootCA was parsed without error, this should not fail either |
There was a problem hiding this comment.
Is this saying that NewRootCA parses SigningCACert, so we expect parsing it again will succeed?
There was a problem hiding this comment.
Yes - that's much clearer, thanks. I'll update the comment to say that.
There was a problem hiding this comment.
My preference is to check the error anyway, as a future-proofing step.
There was a problem hiding this comment.
Ok, can do that as well.
8bd77a9 to
acae0c6
Compare
| if err != nil { | ||
| return err | ||
| } | ||
| if port == "" { |
There was a problem hiding this comment.
I'm not sure this is reachable - it looks like SplitHostPort returns an error in this case. I expected that if SplitHostPort returns an error, we would assume the address does not have a port.
There was a problem hiding this comment.
Oh hm... I didn't realize a port was required. It also looks like it returns an error if there are too many colons, though, and that does not seem to be something url.Parse will return an error on: https://play.golang.org/p/lAmzyII0-z.
There was a problem hiding this comment.
Poking a bit more I guess go1.8's URL.Hostname() interprets "http://x:1:2" as a single host with no port, at least SplitHostPort will assume the same thing.
And I guess the dialer will fail if it's a weird hostname.
| port = "443" //https | ||
| } | ||
|
|
||
| conn, err := tls.DialWithDialer(dialer, "tcp", net.JoinHostPort(host, port), tlsOpts) |
There was a problem hiding this comment.
Add comment as to why we use TCP vs HTTP?
There was a problem hiding this comment.
Added it to the comment for the function (at the top of this function).
|
This LGTM /cc @aluzzardi |
acae0c6 to
03c8d56
Compare
| if err != nil { | ||
| // It either has no port or is otherwise invalid (e.g. too many colons). If it's otherwise invalid the dialer | ||
| // will error later, so just assume it's no port and set the port to the default HTTPS port. | ||
| port = "443" |
There was a problem hiding this comment.
Let's also have host = parsed.Host here. I don't know if SplitHostPort returns the argument as host if it also returns an error (and even if it does, it seems better not to rely on the other values when an error is returned).
03c8d56 to
621a36d
Compare
|
LGTM |
|
Feel free to reorganize the commits. |
621a36d to
8f3f0a7
Compare
The cluster RootCA object will reflect the current status of the actual cluster RootCA, and the goal is to converge towards the desired CA cert and key. Signed-off-by: cyli <ying.li@docker.com>
…update, which ensures that 1. A minimum of required external CAs are provided and reachable and valid. 2. If a desired signing certificate and/or signing key are provided, that they are valid. If the CAConfig spec differs from the current root CA + current root rotation configuration, this function will also start, update, or abort a root rotation. It returns a new api.RootCA object (which may be completely unchanged from the current one) to replace the existing api.RootCA object in the cluster. Signed-off-by: cyli <ying.li@docker.com>
…Cluster function. Also make sure that when listing and getting clusters, the CAConfig's signing key and the root rotation object's signing key are redacted, if present. Signed-off-by: cyli <ying.li@docker.com>
8f3f0a7 to
1b1c030
Compare
|
Just had a quick glance, overall LGTM |
This adds the ability to specify, using the
ClusterSpec, whether the swarm root CA should be rotated, and if so, what the new cert and key should be.If no desired cert or key are provided, then a pair will be generated so long as the
ForceRotatecounter is bumped up.If a desired cert is provided, the
ForceRotatecounter needs not be bumped, but it would not break anything if it were.This is stacked on top of #2050 and #2051.Update: These have been merged.Remaining TODOs:
ClusterUpdatethat at least one of the remote CAs on the list is up.