Skip to content

Conversation

@oguzkilcan
Copy link
Member

Add support for Talos CA rotation

Closes: #220

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements support for rotating Talos CA (Certificate Authority) certificates across cluster nodes. The implementation adds a multi-phase rotation process (PRE_ROTATE, ROTATE, POST_ROTATE) that ensures zero-downtime certificate rotation by managing both old and new CAs during the transition period.

Key changes:

  • New controllers for managing secret rotation status and machine-specific rotation states
  • Extended ClusterSecrets resource to track rotation phases and store both current and rotated secrets
  • CLI commands for initiating and monitoring CA rotation
  • Frontend UI components to display rotation progress

Reviewed changes

Copilot reviewed 38 out of 40 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/backend/runtime/omni/controllers/omni/secrets/secret_rotation_status.go New controller managing cluster-wide rotation orchestration
internal/backend/runtime/omni/controllers/omni/secrets.go Extended to handle rotation phase transitions
internal/backend/runtime/omni/controllers/omni/cluster_machine_config.go Adds rotation logic to machine configurations
internal/pkg/siderolink/trustd/*.go Updates certificate handling to support dual CAs during rotation
client/api/omni/specs/omni.proto Defines new protobuf resources for rotation
client/pkg/omnictl/cluster/secret/*.go New CLI commands for rotation management
frontend/src/views/cluster/Overview/components/OverviewContent.vue UI for displaying rotation status

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@oguzkilcan oguzkilcan added the integration/e2e-rotate-ca Triggers all e2e CA rotation tests for Omni label Dec 15, 2025
@oguzkilcan oguzkilcan force-pushed the feature/support-talos-ca-rotation branch from 4a1c94f to db57ed6 Compare December 15, 2025 14:07
@oguzkilcan oguzkilcan marked this pull request as ready for review December 15, 2025 14:07
@oguzkilcan oguzkilcan requested a review from Slessi as a code owner December 15, 2025 14:07
@github-project-automation github-project-automation bot moved this to To Do in Planning Dec 15, 2025
@talos-bot talos-bot moved this from To Do to In Review in Planning Dec 15, 2025
@oguzkilcan oguzkilcan requested a review from Copilot December 15, 2025 14:08
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 62 out of 64 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

const getCurrentComponent = (item: Resource<OngoingTaskSpec>) => {
if (item.spec.secret_rotation) {
switch (item.spec.secret_rotation.component) {
case 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using case ClusterSecretsRotationStatusSpecComponent.TALOS_CA: would be more clear

Comment on lines +76 to +91
func (c *Candidates) filter(filterFunc func(candidate Candidate) bool) []string {
var cp, w []string

for _, candidate := range c.Candidates {
if filterFunc(candidate) {
if candidate.ControlPlane {
cp = append(cp, candidate.Hostname)
} else {
w = append(w, candidate.Hostname)
}
}
}

if len(cp) > 0 {
return cp
}

return w
}
Copy link
Member

@utkuozdemir utkuozdemir Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit unclear to me. This does something beyond filtering, there's business logic in it, but is not reflected in the function. I feel like this type can be reworked (maybe removed even) for simplicity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment explaining the why

Comment on lines +244 to +241
if state.IsNotFoundError(err) { // need to wait for the secret rotation status to be created
return xerrors.NewTagged[qtransform.SkipReconcileTag](err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably needs to be moved up

@oguzkilcan oguzkilcan force-pushed the feature/support-talos-ca-rotation branch 5 times, most recently from 0838b4d to b161380 Compare December 15, 2025 19:40
Copy link
Member

@Slessi Slessi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frontend LTGM 👍

@github-project-automation github-project-automation bot moved this from In Review to Approved in Planning Dec 15, 2025
@oguzkilcan oguzkilcan force-pushed the feature/support-talos-ca-rotation branch 2 times, most recently from d5b82f1 to c96b5e6 Compare December 16, 2025 13:43
qtransform.WithExtraMappedInput[*omni.ClusterMachineStatus](
mappers.MapByClusterLabel[*omni.ClusterSecrets](),
),
qtransform.WithExtraMappedInput[*omni.MachineStatus](
Copy link
Member

@Unix4ever Unix4ever Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using system.ResourceLabels[*omni.MachineStatus instead, as you don't use anything other than labels there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wait, it's not used anywhere actually. I guess we don't need this input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in the internal package secretrotation where we are building the talos client for validation.

	if err != nil {
		if state.IsNotFoundError(err) {
			return nil, xerrors.NewTagged[qtransform.SkipReconcileTag](err)
		}

		return nil, fmt.Errorf("failed to get machine status for machine %q: %w", machineID, err)
	}

	address := machineStatus.TypedSpec().Value.ManagementAddress```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see.

ClusterMachineStatus also has ManagementAddress IIRC.


for _, candidate := range rotationsToUpdate.Candidates {
secretRotation, ok := secretRotationsMap[candidate.MachineID]
if !ok || (secretRotation != nil && secretRotation.TypedSpec().Value.Phase == clusterSecrets.TypedSpec().Value.RotationPhase) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hard to understand, can we please add a comment here?

Comment on lines +233 to +253
allDestroyed := true

for cmStatus := range cmStatuses.All() {
if cmStatus.Metadata().Phase() == resource.PhaseRunning {
allDestroyed = false

continue
}

var destroyed bool

destroyed, err = ctrl.handleDestroy(ctx, r, cmStatus.Metadata().ID())
if err != nil {
return err
}

if !destroyed {
allDestroyed = false
}
}

if !allDestroyed {
return xerrors.NewTagged[qtransform.SkipReconcileTag](fmt.Errorf("waiting for all cluster machine statuses to be deleted before destroying machine secrets rotations resources"))
}

return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just do?

allDestroyed, err := helpers.TeardownAndDestroyAll(ctx, r, cmStatuses.Pointers())
if !allDestroyed {
  return xerrors.NewTagged[qtransform.SkipReconcileTag](fmt.Errorf("waiting for all cluster machine statuses to be deleted before destroying machine secrets rotations resources"))
}

for res := range cmStatuses.Pointers() {
  if err := r.RemoveFinalizer(ctx, omni.NewClusterMachineStatus(resources.DefaultNamespace, res.Metadata().ID()).Metadata(), ctrl.Name()); err != nil && !state.IsNotFoundError(err) {
    return err
  }
}

return nil

I know that it will remove all cluster machine status finalizers in the end, when all resources are deleted. Does it slow down something, or does something get stuck?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would block ClusterMachineStatuses from being deleted before ALL of the ClusterMachineSecretRotations are torn down and destroyed. It'd work with that limitation. Not sure if we should go this direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should go this direction.

I think it should be fine, but I'm not insisting. The change I proposed doesn't save too much anyways.

}

// ClusterMachineStatus is being deleted, delete the corresponding ClusterMachineSecretsRotation
if machineStatus.Metadata().Phase() == resource.PhaseTearingDown {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rename the variable too, to avoid confusion

rotationToUpdate := rotationsToUpdate.Candidates[0]

if !rotationToUpdate.Ready {
logger.Warn("Waiting for machine to become ready", zap.String("machine", rotationToUpdate.MachineID))
Copy link
Member

@Unix4ever Unix4ever Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should start with lowercase letter. I think we usually use lowercase everywhere

@oguzkilcan oguzkilcan force-pushed the feature/support-talos-ca-rotation branch 2 times, most recently from 6bf7c73 to b0f2127 Compare December 17, 2025 15:07
return downloadedBackupData, nil
}

func (s *SecretsController) handleCARotation(
Copy link
Member

@Unix4ever Unix4ever Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a thought: what if we put this swapping cert logic into the rotation_status controller?
Secrets will be kept clean this way --will just read whatever is set as primary cert and secondary from the rotation status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise the logic gets scattered across two controllers.

Add support for Talos CA rotation

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
@oguzkilcan oguzkilcan force-pushed the feature/support-talos-ca-rotation branch from b0f2127 to f2078d5 Compare December 18, 2025 08:33
Comment on lines +100 to +116
secretsBundle, err := omni.ToSecretsBundle(clusterSecrets.TypedSpec().Value.GetData())
if err != nil {
return nil, err
}

acceptedCAs := []*x509.PEMEncodedCertificate{{Crt: secretsBundle.Certs.OS.Crt}}

if clusterSecrets.TypedSpec().Value.RotationPhase != specs.ClusterSecretsRotationStatusSpec_OK {
var rotateSecretsBundle *secrets.Bundle

rotateSecretsBundle, err = omni.ToSecretsBundle(clusterSecrets.TypedSpec().Value.GetRotateData())
if err != nil {
return nil, err
}

acceptedCAs = append(acceptedCAs, &x509.PEMEncodedCertificate{Crt: rotateSecretsBundle.Certs.OS.Crt})
}
Copy link
Member

@Unix4ever Unix4ever Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll start with easy ones. I would propose to build the list of accepted CAs in the Secrets controller.

Suggested change
secretsBundle, err := omni.ToSecretsBundle(clusterSecrets.TypedSpec().Value.GetData())
if err != nil {
return nil, err
}
acceptedCAs := []*x509.PEMEncodedCertificate{{Crt: secretsBundle.Certs.OS.Crt}}
if clusterSecrets.TypedSpec().Value.RotationPhase != specs.ClusterSecretsRotationStatusSpec_OK {
var rotateSecretsBundle *secrets.Bundle
rotateSecretsBundle, err = omni.ToSecretsBundle(clusterSecrets.TypedSpec().Value.GetRotateData())
if err != nil {
return nil, err
}
acceptedCAs = append(acceptedCAs, &x509.PEMEncodedCertificate{Crt: rotateSecretsBundle.Certs.OS.Crt})
}
secretsBundle, err := getSecretsBundle(ctx, h.state, tcpAddr.IP.String())
if err != nil {
return err
}
acceptedCAs := clusterSecrets.TypedSpec().Value.AcceptedCAs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need acceptedCAs for each component. In this case this is CAs for Talos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also combine acceptedCAs into a single []byte slice to avoid doing join in all the places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +68 to +107
secretsBundle, err := omni.ToSecretsBundle(secrets.TypedSpec().Value.GetData())
if err != nil {
return nil, nil, err
}

clientCert, err := talossecrets.NewAdminCertificateAndKey(time.Now(), secretBundle.Certs.OS, roles, certificateValidity)
clientCert, err := talossecrets.NewAdminCertificateAndKey(time.Now(), secretsBundle.Certs.OS, roles, certificateValidity)
if err != nil {
return nil, nil, fmt.Errorf("error generating Talos API certificate: %w", err)
}

return clientCert, secretBundle.Certs.OS.Crt, nil
acceptedCAs := []*talosx509.PEMEncodedCertificate{{Crt: secretsBundle.Certs.OS.Crt}}

// While rotating secrets, use both the old and new CA certificates in Talosconfig
// This is to ensure that connectivity with Talos is never lost regardless of the issuing CA used for apid server certificate
if secrets.TypedSpec().Value.RotationPhase != specs.ClusterSecretsRotationStatusSpec_OK {
rotateSecretsBundle, rotateErr := omni.ToSecretsBundle(secrets.TypedSpec().Value.GetRotateData())
if rotateErr != nil {
return nil, nil, rotateErr
}

acceptedCAs = append(acceptedCAs, &talosx509.PEMEncodedCertificate{Crt: rotateSecretsBundle.Certs.OS.Crt})

// At this stage all Talos nodes should have their acceptedCAs field updated. So we can create the client cert using the new CA.
if secrets.TypedSpec().Value.RotationPhase == specs.ClusterSecretsRotationStatusSpec_ROTATE {
clientCert, rotateErr = talossecrets.NewAdminCertificateAndKey(time.Now(), rotateSecretsBundle.Certs.OS, roles, certificateValidity)
if rotateErr != nil {
return nil, nil, fmt.Errorf("error generating Talos API certificate: %w", rotateErr)
}
}
}

return clientCert, bytes.Join(
xslices.Map(
acceptedCAs,
func(cert *talosx509.PEMEncodedCertificate) []byte {
return cert.Crt
},
),
nil,
), nil
Copy link
Member

@Unix4ever Unix4ever Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can probably reuse the generated AcceptedCAs.
And also we can set the active MachineCA in the Secrets controller.

Suggested change
secretsBundle, err := omni.ToSecretsBundle(secrets.TypedSpec().Value.GetData())
if err != nil {
return nil, nil, err
}
clientCert, err := talossecrets.NewAdminCertificateAndKey(time.Now(), secretBundle.Certs.OS, roles, certificateValidity)
clientCert, err := talossecrets.NewAdminCertificateAndKey(time.Now(), secretsBundle.Certs.OS, roles, certificateValidity)
if err != nil {
return nil, nil, fmt.Errorf("error generating Talos API certificate: %w", err)
}
return clientCert, secretBundle.Certs.OS.Crt, nil
acceptedCAs := []*talosx509.PEMEncodedCertificate{{Crt: secretsBundle.Certs.OS.Crt}}
// While rotating secrets, use both the old and new CA certificates in Talosconfig
// This is to ensure that connectivity with Talos is never lost regardless of the issuing CA used for apid server certificate
if secrets.TypedSpec().Value.RotationPhase != specs.ClusterSecretsRotationStatusSpec_OK {
rotateSecretsBundle, rotateErr := omni.ToSecretsBundle(secrets.TypedSpec().Value.GetRotateData())
if rotateErr != nil {
return nil, nil, rotateErr
}
acceptedCAs = append(acceptedCAs, &talosx509.PEMEncodedCertificate{Crt: rotateSecretsBundle.Certs.OS.Crt})
// At this stage all Talos nodes should have their acceptedCAs field updated. So we can create the client cert using the new CA.
if secrets.TypedSpec().Value.RotationPhase == specs.ClusterSecretsRotationStatusSpec_ROTATE {
clientCert, rotateErr = talossecrets.NewAdminCertificateAndKey(time.Now(), rotateSecretsBundle.Certs.OS, roles, certificateValidity)
if rotateErr != nil {
return nil, nil, fmt.Errorf("error generating Talos API certificate: %w", rotateErr)
}
}
}
return clientCert, bytes.Join(
xslices.Map(
acceptedCAs,
func(cert *talosx509.PEMEncodedCertificate) []byte {
return cert.Crt
},
),
nil,
), nil
clientCert, err := talossecrets.NewAdminCertificateAndKey(time.Now(), secrets.TypedSpec().Value.MachineCA, roles, certificateValidity)
if err != nil {
return nil, nil, fmt.Errorf("error generating Talos API certificate: %w", err)
}
acceptedCAs := secrets.TypedSpec().Value.AcceptedCAs
return clientCert, acceptedCAs, nil

@smira smira moved this from Approved to In Review in Planning Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration/e2e-rotate-ca Triggers all e2e CA rotation tests for Omni status/ok-to-test

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

support Talos CA rotation

5 participants