Skip to content

OCPBUGS-43309: The "oc adm ocp-certificates regenerate-machine-config-server-serving-cert" command is failing#1900

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
djoshy:mcs-tls-fix
Oct 30, 2024
Merged

Conversation

@djoshy
Copy link
Copy Markdown
Contributor

@djoshy djoshy commented Oct 17, 2024

Closes: OCPBUGS-43309

This ensures that the machine-config-server-tls secret is of the kubernetes.io/tls type before attempting to rotate them.

This is only an issue in 4.18+, as the vendored cert controller package was updated during the 1.31.1 rebase. (#1877, diff)

In the current master of library-go, the cert controller only permits secrets of the type kubernetes.io/tls to be rotated, and therefore this "preflight check" is necessary before running the controller sync.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Oct 17, 2024
@openshift-ci-robot
Copy link
Copy Markdown

@djoshy: This pull request references Jira Issue OCPBUGS-43309, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Closes: OCPBUGS-43309

This ensures that the machine-config-server-tls secret is of the kubernetes.io/tls type before attempting to rotate them.

This is only an issue in 4.18+, as the vendored cert controller package was updated during the 1.31.1 rebase. (#1877, diff)

In the current master of library-go, the cert controller only permits secrets of the type kubernetes.io/tls to be rotated, and therefore this "preflight check" is necessary before running the controller sync.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented Oct 23, 2024

/retest-required

Copy link
Copy Markdown
Member

@ardaguclu ardaguclu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped a few comments. I haven't checked the CI failures related or not
/retest

Comment thread pkg/cli/admin/ocpcertificates/regeneratemco/rotatecerts.go Outdated
func (o *RegenerateMCOOptions) ensureMCSSecretType(c *kubernetes.Clientset, ctx context.Context) error {
// Retrieve the machine-config-server-tls secret
mcsTLSSecret, err := c.CoreV1().Secrets(mcoNamespace).Get(ctx, mcsTlsSecretName, metav1.GetOptions{})
if err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that notfound error for this secret is not expected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not expected(nodes cannot join the cluster if this secret doesn't exist in the MCO namespace), but good callout - I think it might be good if the command accounted for this case

Comment thread pkg/cli/admin/ocpcertificates/regeneratemco/rotatecerts.go Outdated
Data: mcsTLSSecret.Data,
Type: corev1.SecretTypeTLS,
}
if _, err := c.CoreV1().Secrets(mcoNamespace).Create(ctx, newSecret, metav1.CreateOptions{}); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the consequence of deletion succeeds but creation fails. Because in that case we won't have this secret anymore?. Besides, we are not handling not found errors. Thus, this command will always start failing?.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing catastrophic. This secret will be created or updated by the cert controller's sync which is manually called later in this command:

if err := cont.Sync(ctx, syncCtx); err != nil {

What seems to be problematic is, if the secret fed into the controller is not of the kubernetes.io/tls type, it causes the cert controller sync to fail. This wasn't the case before the 1.31 rebase, but it seems like they've made it more strict now.

I have added handling for the not found error, but if we are missing this secret, the cluster may have bigger problems. Also, just for some additional context, this command is used to manually rotate a cert that has a 10 year lifecycle. We are hoping to automate this rotation in the future, but our use rate for this command is quite low at the moment.

@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented Oct 29, 2024

/retest-required

@ardaguclu
Copy link
Copy Markdown
Member

Thank you
/lgtm
/retest

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Oct 30, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Oct 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ardaguclu, djoshy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 30, 2024
@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented Oct 30, 2024

Holding for QE review

/hold

/retest-required

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2024
@sergiordlr
Copy link
Copy Markdown

Verified using IPI on AWS

$ oc -n openshift-machine-config-operator get secret  machine-config-server-tls
NAME                        TYPE     DATA   AGE
machine-config-server-tls   Opaque   2      44m

$ oc version
Client Version: 4.18.0-0.test-2024-10-30-093806-ci-ln-rmyszn2-latest
Kustomize Version: v5.4.2
Server Version: 4.18.0-0.test-2024-10-30-093806-ci-ln-rmyszn2-latest
Kubernetes Version: v1.31.2

$ oc  adm ocp-certificates regenerate-machine-config-server-serving-cert
Migration to kubernetes.io/tls for machine-config-server-tls required
Migration to kubernetes.io/tls for machine-config-server-tls successful
I1030 10:46:04.858487 4043545 recorder_logging.go:44] &Event{ObjectMeta:{dummy.18033530a38366a2  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:TargetUpdateRequired,Message:"machine-config-server-tls" in "openshift-machine-config-operator" requires a new target cert/key pair: missing notAfter,Source:EventSource{Component:,Host:,},FirstTimestamp:2024-10-30 10:46:04.858402466 +0000 UTC m=+0.504114621,LastTimestamp:2024-10-30 10:46:04.858402466 +0000 UTC m=+0.504114621,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I1030 10:46:04.971233 4043545 recorder_logging.go:44] &Event{ObjectMeta:{dummy.18033530aa3bf370  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:SecretUpdated,Message:Updated Secret/machine-config-server-tls -n openshift-machine-config-operator because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2024-10-30 10:46:04.971160432 +0000 UTC m=+0.616872588,LastTimestamp:2024-10-30 10:46:04.971160432 +0000 UTC m=+0.616872588,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
Successfully rotated MCS CA + certs. Redeploying MCS and updating references.
Successfully modified user-data secret master-user-data
Successfully modified user-data secret worker-user-data


$ oc get secret  machine-config-server-tls
NAME                        TYPE                DATA   AGE
machine-config-server-tls   kubernetes.io/tls   2      22s

$ oc scale machineset sregidor-vrot-2jfgc-worker-us-east-2c --replicas 2
machineset.machine.openshift.io/sregidor-vrot-2jfgc-worker-us-east-2c scaled

$ oc get nodes
NAME                                        STATUS   ROLES                  AGE    VERSION
ip-10-0-23-49.us-east-2.compute.internal    Ready    control-plane,master   57m    v1.31.2
ip-10-0-26-215.us-east-2.compute.internal   Ready    worker                 52m    v1.31.2
ip-10-0-52-91.us-east-2.compute.internal    Ready    worker                 50m    v1.31.2
ip-10-0-53-225.us-east-2.compute.internal   Ready    control-plane,master   57m    v1.31.2
ip-10-0-70-5.us-east-2.compute.internal     Ready    control-plane,master   57m    v1.31.2
ip-10-0-80-176.us-east-2.compute.internal   Ready    worker                 50m    v1.31.2
ip-10-0-90-115.us-east-2.compute.internal   Ready    worker                 106s   v1.31.2   <--- new node

/label cherry-pick-approved

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Oct 30, 2024

@sergiordlr: Can not set label cherry-pick-approved: Must be member in one of these teams: []

Details

In response to this:

Verified using IPI on AWS

$ oc -n openshift-machine-config-operator get secret  machine-config-server-tls
NAME                        TYPE     DATA   AGE
machine-config-server-tls   Opaque   2      44m

$ oc version
Client Version: 4.18.0-0.test-2024-10-30-093806-ci-ln-rmyszn2-latest
Kustomize Version: v5.4.2
Server Version: 4.18.0-0.test-2024-10-30-093806-ci-ln-rmyszn2-latest
Kubernetes Version: v1.31.2

$ oc  adm ocp-certificates regenerate-machine-config-server-serving-cert
Migration to kubernetes.io/tls for machine-config-server-tls required
Migration to kubernetes.io/tls for machine-config-server-tls successful
I1030 10:46:04.858487 4043545 recorder_logging.go:44] &Event{ObjectMeta:{dummy.18033530a38366a2  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:TargetUpdateRequired,Message:"machine-config-server-tls" in "openshift-machine-config-operator" requires a new target cert/key pair: missing notAfter,Source:EventSource{Component:,Host:,},FirstTimestamp:2024-10-30 10:46:04.858402466 +0000 UTC m=+0.504114621,LastTimestamp:2024-10-30 10:46:04.858402466 +0000 UTC m=+0.504114621,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I1030 10:46:04.971233 4043545 recorder_logging.go:44] &Event{ObjectMeta:{dummy.18033530aa3bf370  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:SecretUpdated,Message:Updated Secret/machine-config-server-tls -n openshift-machine-config-operator because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2024-10-30 10:46:04.971160432 +0000 UTC m=+0.616872588,LastTimestamp:2024-10-30 10:46:04.971160432 +0000 UTC m=+0.616872588,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
Successfully rotated MCS CA + certs. Redeploying MCS and updating references.
Successfully modified user-data secret master-user-data
Successfully modified user-data secret worker-user-data


$ oc get secret  machine-config-server-tls
NAME                        TYPE                DATA   AGE
machine-config-server-tls   kubernetes.io/tls   2      22s

$ oc scale machineset sregidor-vrot-2jfgc-worker-us-east-2c --replicas 2
machineset.machine.openshift.io/sregidor-vrot-2jfgc-worker-us-east-2c scaled

$ oc get nodes
NAME                                        STATUS   ROLES                  AGE    VERSION
ip-10-0-23-49.us-east-2.compute.internal    Ready    control-plane,master   57m    v1.31.2
ip-10-0-26-215.us-east-2.compute.internal   Ready    worker                 52m    v1.31.2
ip-10-0-52-91.us-east-2.compute.internal    Ready    worker                 50m    v1.31.2
ip-10-0-53-225.us-east-2.compute.internal   Ready    control-plane,master   57m    v1.31.2
ip-10-0-70-5.us-east-2.compute.internal     Ready    control-plane,master   57m    v1.31.2
ip-10-0-80-176.us-east-2.compute.internal   Ready    worker                 50m    v1.31.2
ip-10-0-90-115.us-east-2.compute.internal   Ready    worker                 106s   v1.31.2   <--- new node

/label cherry-pick-approved

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented Oct 30, 2024

/unhold

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2024
@openshift-ci-robot
Copy link
Copy Markdown

/retest-required

Remaining retests: 0 against base HEAD 62471c2 and 2 for PR HEAD 833cc92 in total

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Oct 30, 2024

@djoshy: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 8e9a2be into openshift:master Oct 30, 2024
@openshift-ci-robot
Copy link
Copy Markdown

@djoshy: Jira Issue OCPBUGS-43309: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-43309 has been moved to the MODIFIED state.

Details

In response to this:

Closes: OCPBUGS-43309

This ensures that the machine-config-server-tls secret is of the kubernetes.io/tls type before attempting to rotate them.

This is only an issue in 4.18+, as the vendored cert controller package was updated during the 1.31.1 rebase. (#1877, diff)

In the current master of library-go, the cert controller only permits secrets of the type kubernetes.io/tls to be rotated, and therefore this "preflight check" is necessary before running the controller sync.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-cli
This PR has been included in build openshift-enterprise-cli-container-v4.18.0-202410302041.p0.g8e9a2be.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-tools
This PR has been included in build ose-tools-container-v4.18.0-202410302041.p0.g8e9a2be.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-deployer
This PR has been included in build openshift-enterprise-deployer-container-v4.18.0-202410302041.p0.g8e9a2be.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cli-artifacts
This PR has been included in build ose-cli-artifacts-container-v4.18.0-202410302041.p0.g8e9a2be.assembly.stream.el9.
All builds following this will include this PR.

@djoshy djoshy deleted the mcs-tls-fix branch November 8, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants