Skip to content

MCO-1482: pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition#4760

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:masterfrom
wking:drop-PoolUpdating-from-Upgradeable-calculation
Jan 8, 2025
Merged

MCO-1482: pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition#4760
openshift-merge-bot[bot] merged 1 commit into
openshift:masterfrom
wking:drop-PoolUpdating-from-Upgradeable-calculation

Conversation

@wking
Copy link
Copy Markdown
Member

@wking wking commented Dec 17, 2024

956e787 (#4012) had added the "this should no longer trigger when adding a node to a pool" comment, but unfortunately, it's still triggering. For example, in this serial 4.19 run:

$ curl -s https://storage.googleapis.com/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/build-log.txt | grep 'PoolUpdating' | sort | uniq
time="2024-12-16T01:43:52Z" level=info msg="operator status: processing event" event="Dec 16 00:55:35.662 W clusteroperator/machine-config condition/Upgradeable reason/PoolUpdating status/False One or more machine config pools are updating, please see `oc get mcp` for further details" operator=machine-config
``

Checking PromeCIeus, the `Upgradeable=False` window seems to have been 00:56 through 00:59, which correlates with the scale-up/scale-down of the serial suite:

```console
$ curl -s https://storage.googleapis.com/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/build-log.txt | grep 'Managed cluster should grow and decrease when scaling different machineSets simultaneously'
started: 0/20/74 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
passed: (5m42s) 2024-12-16T00:57:49 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

confirmed via MCC logs:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/artifacts/e2e-gcp-ovn-serial-crun/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-6f4f46457c-v8b2l_machine-config-controller.log | grep rendered-
I1216 00:55:35.430231       1 node_controller.go:584] Pool worker[zone=us-central1-f]: node ci-op-k8c03v6z-9149a-r27w7-worker-f-t7rmb: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:35.430252       1 node_controller.go:584] Pool worker[zone=us-central1-f]: node ci-op-k8c03v6z-9149a-r27w7-worker-f-t7rmb: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:36.174629       1 node_controller.go:584] Pool worker[zone=us-central1-a]: node ci-op-k8c03v6z-9149a-r27w7-worker-a-f7hkj: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:36.174738       1 node_controller.go:584] Pool worker[zone=us-central1-a]: node ci-op-k8c03v6z-9149a-r27w7-worker-a-f7hkj: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:41.296273       1 node_controller.go:584] Pool worker[zone=us-central1-b]: node ci-op-k8c03v6z-9149a-r27w7-worker-b-554bt: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:41.296306       1 node_controller.go:584] Pool worker[zone=us-central1-b]: node ci-op-k8c03v6z-9149a-r27w7-worker-b-554bt: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:47.106173       1 node_controller.go:584] Pool worker[zone=us-central1-c]: node ci-op-k8c03v6z-9149a-r27w7-worker-c-hshj2: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:47.106201       1 node_controller.go:584] Pool worker[zone=us-central1-c]: node ci-op-k8c03v6z-9149a-r27w7-worker-c-hshj2: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2

In this commit, I'm dropping the code that had been moving the ClusterOperator to Upgradeable=False on PoolUpdating entirely, instead of hoping that it doesn't trip. I haven't dug into why the code had still been tripping. But we want to stay Upgradeable=True while new nodes scale in, because clusters where nodes are joining should still be able to update to 4.(y+1). There are node-vs.-control-plane skew issues that should block updates to 4.(y+1), but they're enforced by the Kube API server operator (openshift/cluster-kube-apiserver-operator/pull/1199), and don't need the MCO chipping in.

- Description for the changelog

The machine-config ClusterOperator no longer goes Upgradeable=False on PoolUpdating when new Nodes join the cluster.

956e787 (Implement Upgrade-Monitor, FeatureGate, and
MachineConfigNode types, 2023-11-28, openshift#4012) had added the "this should
no longer trigger when adding a node to a pool" comment, but
unfortunately, it's still triggering.  For example, in [1]:

  $ curl -s https://storage.googleapis.com/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/build-log.txt | grep 'PoolUpdating' | sort | uniq
  time="2024-12-16T01:43:52Z" level=info msg="operator status: processing event" event="Dec 16 00:55:35.662 W clusteroperator/machine-config condition/Upgradeable reason/PoolUpdating status/False One or more machine config pools are updating, please see `oc get mcp` for further details" operator=machine-config

Checking PromeCIeus, the Upgradeable=False window seems to have been
00:56 through 00:59, which correlates with the scale-up/scale-down of
the serial suite:

  $ curl -s https://storage.googleapis.com/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/build-log.txt | grep 'Managed cluster should grow and decrease when scaling different machineSets simultaneously'
  started: 0/20/74 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
  passed: (5m42s) 2024-12-16T00:57:49 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

confirmed via MCC logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/artifacts/e2e-gcp-ovn-serial-crun/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-6f4f46457c-v8b2l_machine-config-controller.log | grep rendered-
  I1216 00:55:35.430231       1 node_controller.go:584] Pool worker[zone=us-central1-f]: node ci-op-k8c03v6z-9149a-r27w7-worker-f-t7rmb: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
  I1216 00:55:35.430252       1 node_controller.go:584] Pool worker[zone=us-central1-f]: node ci-op-k8c03v6z-9149a-r27w7-worker-f-t7rmb: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
  I1216 00:55:36.174629       1 node_controller.go:584] Pool worker[zone=us-central1-a]: node ci-op-k8c03v6z-9149a-r27w7-worker-a-f7hkj: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
  I1216 00:55:36.174738       1 node_controller.go:584] Pool worker[zone=us-central1-a]: node ci-op-k8c03v6z-9149a-r27w7-worker-a-f7hkj: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
  I1216 00:55:41.296273       1 node_controller.go:584] Pool worker[zone=us-central1-b]: node ci-op-k8c03v6z-9149a-r27w7-worker-b-554bt: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
  I1216 00:55:41.296306       1 node_controller.go:584] Pool worker[zone=us-central1-b]: node ci-op-k8c03v6z-9149a-r27w7-worker-b-554bt: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
  I1216 00:55:47.106173       1 node_controller.go:584] Pool worker[zone=us-central1-c]: node ci-op-k8c03v6z-9149a-r27w7-worker-c-hshj2: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
  I1216 00:55:47.106201       1 node_controller.go:584] Pool worker[zone=us-central1-c]: node ci-op-k8c03v6z-9149a-r27w7-worker-c-hshj2: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2

In this commit, I'm dropping the code that had been moving the
ClusterOperator to Upgradeable=False on PoolUpdating entirely, instead
of hoping that it doesn't trip.  I haven't dug into why the code had
still been tripping.  But we want to stay Upgradeable=True while new
nodes scale in, because clusters where nodes are joining should still
be able to update to 4.(y+1).  There are node-vs.-control-plane skew
issues that should block updates to 4.(y+1), but they're enforced by
the Kube API server operator [2], and don't need the MCO chipping in.

[1]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712
[2]: openshift/cluster-kube-apiserver-operator@9ce4f74
@wking
Copy link
Copy Markdown
Member Author

wking commented Dec 17, 2024

Unit test failure seems unrelated to my change:

$ curl -s https://storage.googleapis.com/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4760/pull-ci-openshift-machine-config-operator-master-unit/1868848364292935680/build-log.txt | grep 'build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab\|MachineConfig_changes_creates_a_new_MachineOSBuild'
=== RUN   TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild
=== PAUSE TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild
=== CONT  TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild
I1217 02:53:50.070870   27103 wrappedqueue.go:249] Error executing "<kind: \"MachineOSConfig\", name: \"worker-os-config\", func: \"(*OSBuildController).addMachineOSConfig\">" in queue TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild: Adding MachineOSConfig "worker-os-config" failed: could not sync MachineOSConfigs: sync MachineOSConfigs failed: could not sync MachineOSConfig "worker-os-config": Syncing MachineOSConfig "worker-os-config" failed: could not create new or reuse existing MachineOSBuild for MachineOSConfig "worker-os-config": could not create new MachineOSBuild "worker-os-config-4b619479eb172ec79b53c7f66901964a": machineosbuilds.machineconfiguration.openshift.io "worker-os-config-4b619479eb172ec79b53c7f66901964a" already exists
I1217 02:53:50.213825   27103 jobimagebuilder.go:103] Build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" created for MachineOSBuild "worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.213847   27103 reconciler.go:380] Started new build build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab for MachineOSBuild
I1217 02:53:50.214342   27103 reconciler.go:380] Started new build build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab for MachineOSBuild
I1217 02:53:50.214353   27103 reconciler.go:792] Adding Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.216973   27103 reconciler.go:179] Adding build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.219137   27103 reconciler.go:792] Updating Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.219347   27103 jobimagebuilder.go:191] Build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" status {Conditions:[] StartTime:<nil> CompletionTime:<nil> Active:0 Succeeded:0 Failed:0 Terminating:<nil> CompletedIndexes: FailedIndexes:<nil> UncountedTerminatedPods:nil Ready:<nil>} mapped to MachineOSBuild progress "Prepared"
I1217 02:53:50.219425   27103 jobimagebuilder.go:191] Build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" status {Conditions:[] StartTime:<nil> CompletionTime:<nil> Active:0 Succeeded:1 Failed:0 Terminating:<nil> CompletedIndexes: FailedIndexes:<nil> UncountedTerminatedPods:nil Ready:<nil>} mapped to MachineOSBuild progress "Succeeded"
I1217 02:53:50.219873   27103 reconciler.go:795] Finished updating Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" after 1.240185ms
I1217 02:53:50.220931   27103 jobimagebuilder.go:266] Deleted build job build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab for MachineOSBuild worker-os-config-57cd21cca292604d4624ef5c0f87d1ab
I1217 02:53:50.221571   27103 reconciler.go:792] Deleting Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.221700   27103 jobimagebuilder.go:191] Build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" status {Conditions:[] StartTime:<nil> CompletionTime:<nil> Active:0 Succeeded:1 Failed:0 Terminating:<nil> CompletedIndexes: FailedIndexes:<nil> UncountedTerminatedPods:nil Ready:<nil>} mapped to MachineOSBuild progress "Succeeded"
I1217 02:53:50.221812   27103 reconciler.go:200] Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" deleted
I1217 02:53:50.222112   27103 jobimagebuilder.go:103] Build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" created for MachineOSBuild "worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.222124   27103 reconciler.go:380] Started new build build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab for MachineOSBuild
I1217 02:53:50.222271   27103 reconciler.go:795] Finished adding Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" after 8.142732ms
I1217 02:53:50.222772   27103 reconciler.go:792] Adding Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.222785   27103 reconciler.go:179] Adding build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.223267   27103 reconciler.go:795] Finished deleting Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" after 1.950558ms
I1217 02:53:50.224482   27103 reconciler.go:795] Finished adding Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" after 1.933926ms
=== NAME  TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild
                Test:           TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild
                Messages:       Build job build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab did not reach specified state%!(EXTRA string=Expected the build job %s to be deleted, string=build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab)
    --- FAIL: TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild (5.00s)

And I also see that same test-case failing in the unit tests of other pulls, such as this run.

@wking
Copy link
Copy Markdown
Member Author

wking commented Dec 17, 2024

e2e-gcp-op failure build02 cluster issue, also unrelated to my pull:

error occurred handling build src-amd64: build didn't start running within 1h0m0s (phase: Pending): ...

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 17, 2024

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-upgrade-out-of-change 377a78b link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/okd-scos-e2e-aws-ovn 377a78b link false /test okd-scos-e2e-aws-ovn
ci/prow/unit 377a78b link true /test unit

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@wking
Copy link
Copy Markdown
Member Author

wking commented Dec 17, 2024

build02 had bumped into openshift/cincinnati-graph-data#6463, but has since been recovered. Trying again:

/retest-required

@wking
Copy link
Copy Markdown
Member Author

wking commented Dec 17, 2024

/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 17, 2024

@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e0e7ac40-bcb0-11ef-8916-660e893ad4c7-0

@wking
Copy link
Copy Markdown
Member Author

wking commented Dec 17, 2024

Previous payload job had trouble with build02 scheduling. Trying again:

/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 17, 2024

@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/939799e0-bcc5-11ef-9e7e-45e528fe631f-0

@wking wking changed the title pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition MCO-1482: pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition Dec 18, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 18, 2024
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Dec 18, 2024

@wking: This pull request references MCO-1482 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

Details

In response to this:

956e787 (#4012) had added the "this should no longer trigger when adding a node to a pool" comment, but unfortunately, it's still triggering. For example, in this serial 4.19 run:

$ curl -s https://storage.googleapis.com/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/build-log.txt | grep 'PoolUpdating' | sort | uniq
time="2024-12-16T01:43:52Z" level=info msg="operator status: processing event" event="Dec 16 00:55:35.662 W clusteroperator/machine-config condition/Upgradeable reason/PoolUpdating status/False One or more machine config pools are updating, please see `oc get mcp` for further details" operator=machine-config
``

Checking PromeCIeus, the `Upgradeable=False` window seems to have been 00:56 through 00:59, which correlates with the scale-up/scale-down of the serial suite:

```console
$ curl -s https://storage.googleapis.com/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/build-log.txt | grep 'Managed cluster should grow and decrease when scaling different machineSets simultaneously'
started: 0/20/74 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
passed: (5m42s) 2024-12-16T00:57:49 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

confirmed via MCC logs:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/artifacts/e2e-gcp-ovn-serial-crun/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-6f4f46457c-v8b2l_machine-config-controller.log | grep rendered-
I1216 00:55:35.430231       1 node_controller.go:584] Pool worker[zone=us-central1-f]: node ci-op-k8c03v6z-9149a-r27w7-worker-f-t7rmb: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:35.430252       1 node_controller.go:584] Pool worker[zone=us-central1-f]: node ci-op-k8c03v6z-9149a-r27w7-worker-f-t7rmb: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:36.174629       1 node_controller.go:584] Pool worker[zone=us-central1-a]: node ci-op-k8c03v6z-9149a-r27w7-worker-a-f7hkj: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:36.174738       1 node_controller.go:584] Pool worker[zone=us-central1-a]: node ci-op-k8c03v6z-9149a-r27w7-worker-a-f7hkj: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:41.296273       1 node_controller.go:584] Pool worker[zone=us-central1-b]: node ci-op-k8c03v6z-9149a-r27w7-worker-b-554bt: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:41.296306       1 node_controller.go:584] Pool worker[zone=us-central1-b]: node ci-op-k8c03v6z-9149a-r27w7-worker-b-554bt: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:47.106173       1 node_controller.go:584] Pool worker[zone=us-central1-c]: node ci-op-k8c03v6z-9149a-r27w7-worker-c-hshj2: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2
I1216 00:55:47.106201       1 node_controller.go:584] Pool worker[zone=us-central1-c]: node ci-op-k8c03v6z-9149a-r27w7-worker-c-hshj2: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2

In this commit, I'm dropping the code that had been moving the ClusterOperator to Upgradeable=False on PoolUpdating entirely, instead of hoping that it doesn't trip. I haven't dug into why the code had still been tripping. But we want to stay Upgradeable=True while new nodes scale in, because clusters where nodes are joining should still be able to update to 4.(y+1). There are node-vs.-control-plane skew issues that should block updates to 4.(y+1), but they're enforced by the Kube API server operator (openshift/cluster-kube-apiserver-operator/pull/1199), and don't need the MCO chipping in.

- Description for the changelog

The machine-config ClusterOperator no longer goes Upgradeable=False on PoolUpdating when new Nodes join the cluster.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@wking
Copy link
Copy Markdown
Member Author

wking commented Dec 18, 2024

New payload job failed, mostly on disruption that seems unrelated to my change. But PromeCIeus confirms the machine-config ClusterOperator was, as desired, Upgradeable=True the whole time, despite nodes scaling into the cluster:

max by (__name__, condition, reason) (cluster_operator_conditions{name="machine-config",condition="Upgradeable"})
or
max by (__name__, label_beta_kubernetes_io_instance_type) (cluster:node_instance_type_count:sum)

image

Copy link
Copy Markdown
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Looks like we haven't (non-cosmetically) updated that code in awhile, so I'm fine with removing the check. Degrades are probably what we should care about most of the time.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jan 7, 2025
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jan 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 7, 2025
@wking
Copy link
Copy Markdown
Member Author

wking commented Jan 7, 2025

/retest-required

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 599f6cd and 2 for PR HEAD 377a78b in total

@openshift-merge-bot openshift-merge-bot Bot merged commit df0b3ba into openshift:master Jan 8, 2025
@wking wking deleted the drop-PoolUpdating-from-Upgradeable-calculation branch January 8, 2025 01:43
@openshift-bot
Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.19.0-202501080413.p0.gdf0b3ba.assembly.stream.el9.
All builds following this will include this PR.

@wking
Copy link
Copy Markdown
Member Author

wking commented May 20, 2025

/cherrypick release-4.18

@openshift-cherrypick-robot
Copy link
Copy Markdown

@wking: new pull request created: #5065

Details

In response to this:

/cherrypick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wking added a commit to openshift-cherrypick-robot/machine-config-operator that referenced this pull request May 20, 2025
In 4.19:

* 377a78b (pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition, 2024-12-16, openshift#4760).
* 0c21907 (pkg/operator/status: Drop kubelet skew guard, 2025-04-03, openshift#4970).

But in 4.18, we're using the other order:

* 13cceb0 (pkg/operator/status: Drop kubelet skew guard, add RHEL guard, 2025-03-26, openshift#4956).
* 20fe075 (pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition, 2024-12-16, openshift#5065).

So I'm adding this follow-up commit within openshift#5065 to remove the
'updating' variable that both the kubelet-skew-guard and the
PoolUpdating guard had used, but which we no longer need now that both
are gone in 4.18.
@wking
Copy link
Copy Markdown
Member Author

wking commented Jun 5, 2025

/cherrypick release-4.17

@openshift-cherrypick-robot
Copy link
Copy Markdown

@wking: new pull request created: #5111

Details

In response to this:

/cherrypick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants