-
Notifications
You must be signed in to change notification settings - Fork 78
feat: add support for Talos CA rotation #1991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add support for Talos CA rotation #1991
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements support for rotating Talos CA (Certificate Authority) certificates across cluster nodes. The implementation adds a multi-phase rotation process (PRE_ROTATE, ROTATE, POST_ROTATE) that ensures zero-downtime certificate rotation by managing both old and new CAs during the transition period.
Key changes:
- New controllers for managing secret rotation status and machine-specific rotation states
- Extended ClusterSecrets resource to track rotation phases and store both current and rotated secrets
- CLI commands for initiating and monitoring CA rotation
- Frontend UI components to display rotation progress
Reviewed changes
Copilot reviewed 38 out of 40 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/backend/runtime/omni/controllers/omni/secrets/secret_rotation_status.go | New controller managing cluster-wide rotation orchestration |
| internal/backend/runtime/omni/controllers/omni/secrets.go | Extended to handle rotation phase transitions |
| internal/backend/runtime/omni/controllers/omni/cluster_machine_config.go | Adds rotation logic to machine configurations |
| internal/pkg/siderolink/trustd/*.go | Updates certificate handling to support dual CAs during rotation |
| client/api/omni/specs/omni.proto | Defines new protobuf resources for rotation |
| client/pkg/omnictl/cluster/secret/*.go | New CLI commands for rotation management |
| frontend/src/views/cluster/Overview/components/OverviewContent.vue | UI for displaying rotation status |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/backend/runtime/omni/controllers/omni/cluster_machine_config.go
Show resolved
Hide resolved
internal/backend/runtime/omni/controllers/omni/secrets/rotation_status.go
Show resolved
Hide resolved
4a1c94f to
db57ed6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 62 out of 64 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
frontend/src/views/cluster/Overview/components/OverviewContent.vue
Outdated
Show resolved
Hide resolved
| const getCurrentComponent = (item: Resource<OngoingTaskSpec>) => { | ||
| if (item.spec.secret_rotation) { | ||
| switch (item.spec.secret_rotation.component) { | ||
| case 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using case ClusterSecretsRotationStatusSpecComponent.TALOS_CA: would be more clear
frontend/src/views/cluster/Overview/components/OverviewContent.vue
Outdated
Show resolved
Hide resolved
frontend/src/views/cluster/Overview/components/OverviewContent.vue
Outdated
Show resolved
Hide resolved
| func (c *Candidates) filter(filterFunc func(candidate Candidate) bool) []string { | ||
| var cp, w []string | ||
|
|
||
| for _, candidate := range c.Candidates { | ||
| if filterFunc(candidate) { | ||
| if candidate.ControlPlane { | ||
| cp = append(cp, candidate.Hostname) | ||
| } else { | ||
| w = append(w, candidate.Hostname) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if len(cp) > 0 { | ||
| return cp | ||
| } | ||
|
|
||
| return w | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit unclear to me. This does something beyond filtering, there's business logic in it, but is not reflected in the function. I feel like this type can be reworked (maybe removed even) for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment explaining the why
| if state.IsNotFoundError(err) { // need to wait for the secret rotation status to be created | ||
| return xerrors.NewTagged[qtransform.SkipReconcileTag](err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this probably needs to be moved up
0838b4d to
b161380
Compare
Slessi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frontend LTGM 👍
d5b82f1 to
c96b5e6
Compare
| qtransform.WithExtraMappedInput[*omni.ClusterMachineStatus]( | ||
| mappers.MapByClusterLabel[*omni.ClusterSecrets](), | ||
| ), | ||
| qtransform.WithExtraMappedInput[*omni.MachineStatus]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend using system.ResourceLabels[*omni.MachineStatus instead, as you don't use anything other than labels there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah wait, it's not used anywhere actually. I guess we don't need this input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in the internal package secretrotation where we are building the talos client for validation.
if err != nil {
if state.IsNotFoundError(err) {
return nil, xerrors.NewTagged[qtransform.SkipReconcileTag](err)
}
return nil, fmt.Errorf("failed to get machine status for machine %q: %w", machineID, err)
}
address := machineStatus.TypedSpec().Value.ManagementAddress```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see.
ClusterMachineStatus also has ManagementAddress IIRC.
|
|
||
| for _, candidate := range rotationsToUpdate.Candidates { | ||
| secretRotation, ok := secretRotationsMap[candidate.MachineID] | ||
| if !ok || (secretRotation != nil && secretRotation.TypedSpec().Value.Phase == clusterSecrets.TypedSpec().Value.RotationPhase) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit hard to understand, can we please add a comment here?
| allDestroyed := true | ||
|
|
||
| for cmStatus := range cmStatuses.All() { | ||
| if cmStatus.Metadata().Phase() == resource.PhaseRunning { | ||
| allDestroyed = false | ||
|
|
||
| continue | ||
| } | ||
|
|
||
| var destroyed bool | ||
|
|
||
| destroyed, err = ctrl.handleDestroy(ctx, r, cmStatus.Metadata().ID()) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| if !destroyed { | ||
| allDestroyed = false | ||
| } | ||
| } | ||
|
|
||
| if !allDestroyed { | ||
| return xerrors.NewTagged[qtransform.SkipReconcileTag](fmt.Errorf("waiting for all cluster machine statuses to be deleted before destroying machine secrets rotations resources")) | ||
| } | ||
|
|
||
| return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just do?
allDestroyed, err := helpers.TeardownAndDestroyAll(ctx, r, cmStatuses.Pointers())
if !allDestroyed {
return xerrors.NewTagged[qtransform.SkipReconcileTag](fmt.Errorf("waiting for all cluster machine statuses to be deleted before destroying machine secrets rotations resources"))
}
for res := range cmStatuses.Pointers() {
if err := r.RemoveFinalizer(ctx, omni.NewClusterMachineStatus(resources.DefaultNamespace, res.Metadata().ID()).Metadata(), ctrl.Name()); err != nil && !state.IsNotFoundError(err) {
return err
}
}
return nilI know that it will remove all cluster machine status finalizers in the end, when all resources are deleted. Does it slow down something, or does something get stuck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would block ClusterMachineStatuses from being deleted before ALL of the ClusterMachineSecretRotations are torn down and destroyed. It'd work with that limitation. Not sure if we should go this direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should go this direction.
I think it should be fine, but I'm not insisting. The change I proposed doesn't save too much anyways.
| } | ||
|
|
||
| // ClusterMachineStatus is being deleted, delete the corresponding ClusterMachineSecretsRotation | ||
| if machineStatus.Metadata().Phase() == resource.PhaseTearingDown { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should rename the variable too, to avoid confusion
| rotationToUpdate := rotationsToUpdate.Candidates[0] | ||
|
|
||
| if !rotationToUpdate.Ready { | ||
| logger.Warn("Waiting for machine to become ready", zap.String("machine", rotationToUpdate.MachineID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should start with lowercase letter. I think we usually use lowercase everywhere
6bf7c73 to
b0f2127
Compare
| return downloadedBackupData, nil | ||
| } | ||
|
|
||
| func (s *SecretsController) handleCARotation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a thought: what if we put this swapping cert logic into the rotation_status controller?
Secrets will be kept clean this way --will just read whatever is set as primary cert and secondary from the rotation status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise the logic gets scattered across two controllers.
Add support for Talos CA rotation Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
b0f2127 to
f2078d5
Compare
| secretsBundle, err := omni.ToSecretsBundle(clusterSecrets.TypedSpec().Value.GetData()) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| acceptedCAs := []*x509.PEMEncodedCertificate{{Crt: secretsBundle.Certs.OS.Crt}} | ||
|
|
||
| if clusterSecrets.TypedSpec().Value.RotationPhase != specs.ClusterSecretsRotationStatusSpec_OK { | ||
| var rotateSecretsBundle *secrets.Bundle | ||
|
|
||
| rotateSecretsBundle, err = omni.ToSecretsBundle(clusterSecrets.TypedSpec().Value.GetRotateData()) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| acceptedCAs = append(acceptedCAs, &x509.PEMEncodedCertificate{Crt: rotateSecretsBundle.Certs.OS.Crt}) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll start with easy ones. I would propose to build the list of accepted CAs in the Secrets controller.
| secretsBundle, err := omni.ToSecretsBundle(clusterSecrets.TypedSpec().Value.GetData()) | |
| if err != nil { | |
| return nil, err | |
| } | |
| acceptedCAs := []*x509.PEMEncodedCertificate{{Crt: secretsBundle.Certs.OS.Crt}} | |
| if clusterSecrets.TypedSpec().Value.RotationPhase != specs.ClusterSecretsRotationStatusSpec_OK { | |
| var rotateSecretsBundle *secrets.Bundle | |
| rotateSecretsBundle, err = omni.ToSecretsBundle(clusterSecrets.TypedSpec().Value.GetRotateData()) | |
| if err != nil { | |
| return nil, err | |
| } | |
| acceptedCAs = append(acceptedCAs, &x509.PEMEncodedCertificate{Crt: rotateSecretsBundle.Certs.OS.Crt}) | |
| } | |
| secretsBundle, err := getSecretsBundle(ctx, h.state, tcpAddr.IP.String()) | |
| if err != nil { | |
| return err | |
| } | |
| acceptedCAs := clusterSecrets.TypedSpec().Value.AcceptedCAs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need acceptedCAs for each component. In this case this is CAs for Talos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also combine acceptedCAs into a single []byte slice to avoid doing join in all the places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, at least we can reuse it here I guess: https://github.com/siderolabs/omni/pull/1991/changes#diff-6d353c79a6b1e8967f874ccdbdd3d8807c570e56aa4166a658d23378a5649530R88
| secretsBundle, err := omni.ToSecretsBundle(secrets.TypedSpec().Value.GetData()) | ||
| if err != nil { | ||
| return nil, nil, err | ||
| } | ||
|
|
||
| clientCert, err := talossecrets.NewAdminCertificateAndKey(time.Now(), secretBundle.Certs.OS, roles, certificateValidity) | ||
| clientCert, err := talossecrets.NewAdminCertificateAndKey(time.Now(), secretsBundle.Certs.OS, roles, certificateValidity) | ||
| if err != nil { | ||
| return nil, nil, fmt.Errorf("error generating Talos API certificate: %w", err) | ||
| } | ||
|
|
||
| return clientCert, secretBundle.Certs.OS.Crt, nil | ||
| acceptedCAs := []*talosx509.PEMEncodedCertificate{{Crt: secretsBundle.Certs.OS.Crt}} | ||
|
|
||
| // While rotating secrets, use both the old and new CA certificates in Talosconfig | ||
| // This is to ensure that connectivity with Talos is never lost regardless of the issuing CA used for apid server certificate | ||
| if secrets.TypedSpec().Value.RotationPhase != specs.ClusterSecretsRotationStatusSpec_OK { | ||
| rotateSecretsBundle, rotateErr := omni.ToSecretsBundle(secrets.TypedSpec().Value.GetRotateData()) | ||
| if rotateErr != nil { | ||
| return nil, nil, rotateErr | ||
| } | ||
|
|
||
| acceptedCAs = append(acceptedCAs, &talosx509.PEMEncodedCertificate{Crt: rotateSecretsBundle.Certs.OS.Crt}) | ||
|
|
||
| // At this stage all Talos nodes should have their acceptedCAs field updated. So we can create the client cert using the new CA. | ||
| if secrets.TypedSpec().Value.RotationPhase == specs.ClusterSecretsRotationStatusSpec_ROTATE { | ||
| clientCert, rotateErr = talossecrets.NewAdminCertificateAndKey(time.Now(), rotateSecretsBundle.Certs.OS, roles, certificateValidity) | ||
| if rotateErr != nil { | ||
| return nil, nil, fmt.Errorf("error generating Talos API certificate: %w", rotateErr) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return clientCert, bytes.Join( | ||
| xslices.Map( | ||
| acceptedCAs, | ||
| func(cert *talosx509.PEMEncodedCertificate) []byte { | ||
| return cert.Crt | ||
| }, | ||
| ), | ||
| nil, | ||
| ), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can probably reuse the generated AcceptedCAs.
And also we can set the active MachineCA in the Secrets controller.
| secretsBundle, err := omni.ToSecretsBundle(secrets.TypedSpec().Value.GetData()) | |
| if err != nil { | |
| return nil, nil, err | |
| } | |
| clientCert, err := talossecrets.NewAdminCertificateAndKey(time.Now(), secretBundle.Certs.OS, roles, certificateValidity) | |
| clientCert, err := talossecrets.NewAdminCertificateAndKey(time.Now(), secretsBundle.Certs.OS, roles, certificateValidity) | |
| if err != nil { | |
| return nil, nil, fmt.Errorf("error generating Talos API certificate: %w", err) | |
| } | |
| return clientCert, secretBundle.Certs.OS.Crt, nil | |
| acceptedCAs := []*talosx509.PEMEncodedCertificate{{Crt: secretsBundle.Certs.OS.Crt}} | |
| // While rotating secrets, use both the old and new CA certificates in Talosconfig | |
| // This is to ensure that connectivity with Talos is never lost regardless of the issuing CA used for apid server certificate | |
| if secrets.TypedSpec().Value.RotationPhase != specs.ClusterSecretsRotationStatusSpec_OK { | |
| rotateSecretsBundle, rotateErr := omni.ToSecretsBundle(secrets.TypedSpec().Value.GetRotateData()) | |
| if rotateErr != nil { | |
| return nil, nil, rotateErr | |
| } | |
| acceptedCAs = append(acceptedCAs, &talosx509.PEMEncodedCertificate{Crt: rotateSecretsBundle.Certs.OS.Crt}) | |
| // At this stage all Talos nodes should have their acceptedCAs field updated. So we can create the client cert using the new CA. | |
| if secrets.TypedSpec().Value.RotationPhase == specs.ClusterSecretsRotationStatusSpec_ROTATE { | |
| clientCert, rotateErr = talossecrets.NewAdminCertificateAndKey(time.Now(), rotateSecretsBundle.Certs.OS, roles, certificateValidity) | |
| if rotateErr != nil { | |
| return nil, nil, fmt.Errorf("error generating Talos API certificate: %w", rotateErr) | |
| } | |
| } | |
| } | |
| return clientCert, bytes.Join( | |
| xslices.Map( | |
| acceptedCAs, | |
| func(cert *talosx509.PEMEncodedCertificate) []byte { | |
| return cert.Crt | |
| }, | |
| ), | |
| nil, | |
| ), nil | |
| clientCert, err := talossecrets.NewAdminCertificateAndKey(time.Now(), secrets.TypedSpec().Value.MachineCA, roles, certificateValidity) | |
| if err != nil { | |
| return nil, nil, fmt.Errorf("error generating Talos API certificate: %w", err) | |
| } | |
| acceptedCAs := secrets.TypedSpec().Value.AcceptedCAs | |
| return clientCert, acceptedCAs, nil |
Add support for Talos CA rotation
Closes: #220