MCO-1482: pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition#4760
Conversation
956e787 (Implement Upgrade-Monitor, FeatureGate, and MachineConfigNode types, 2023-11-28, openshift#4012) had added the "this should no longer trigger when adding a node to a pool" comment, but unfortunately, it's still triggering. For example, in [1]: $ curl -s https://storage.googleapis.com/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/build-log.txt | grep 'PoolUpdating' | sort | uniq time="2024-12-16T01:43:52Z" level=info msg="operator status: processing event" event="Dec 16 00:55:35.662 W clusteroperator/machine-config condition/Upgradeable reason/PoolUpdating status/False One or more machine config pools are updating, please see `oc get mcp` for further details" operator=machine-config Checking PromeCIeus, the Upgradeable=False window seems to have been 00:56 through 00:59, which correlates with the scale-up/scale-down of the serial suite: $ curl -s https://storage.googleapis.com/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/build-log.txt | grep 'Managed cluster should grow and decrease when scaling different machineSets simultaneously' started: 0/20/74 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]" passed: (5m42s) 2024-12-16T00:57:49 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]" confirmed via MCC logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712/artifacts/e2e-gcp-ovn-serial-crun/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-6f4f46457c-v8b2l_machine-config-controller.log | grep rendered- I1216 00:55:35.430231 1 node_controller.go:584] Pool worker[zone=us-central1-f]: node ci-op-k8c03v6z-9149a-r27w7-worker-f-t7rmb: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2 I1216 00:55:35.430252 1 node_controller.go:584] Pool worker[zone=us-central1-f]: node ci-op-k8c03v6z-9149a-r27w7-worker-f-t7rmb: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2 I1216 00:55:36.174629 1 node_controller.go:584] Pool worker[zone=us-central1-a]: node ci-op-k8c03v6z-9149a-r27w7-worker-a-f7hkj: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2 I1216 00:55:36.174738 1 node_controller.go:584] Pool worker[zone=us-central1-a]: node ci-op-k8c03v6z-9149a-r27w7-worker-a-f7hkj: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2 I1216 00:55:41.296273 1 node_controller.go:584] Pool worker[zone=us-central1-b]: node ci-op-k8c03v6z-9149a-r27w7-worker-b-554bt: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2 I1216 00:55:41.296306 1 node_controller.go:584] Pool worker[zone=us-central1-b]: node ci-op-k8c03v6z-9149a-r27w7-worker-b-554bt: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2 I1216 00:55:47.106173 1 node_controller.go:584] Pool worker[zone=us-central1-c]: node ci-op-k8c03v6z-9149a-r27w7-worker-c-hshj2: changed annotation machineconfiguration.openshift.io/currentConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2 I1216 00:55:47.106201 1 node_controller.go:584] Pool worker[zone=us-central1-c]: node ci-op-k8c03v6z-9149a-r27w7-worker-c-hshj2: changed annotation machineconfiguration.openshift.io/desiredConfig = rendered-worker-6d0e61dc44f24db3272625b901024ed2 In this commit, I'm dropping the code that had been moving the ClusterOperator to Upgradeable=False on PoolUpdating entirely, instead of hoping that it doesn't trip. I haven't dug into why the code had still been tripping. But we want to stay Upgradeable=True while new nodes scale in, because clusters where nodes are joining should still be able to update to 4.(y+1). There are node-vs.-control-plane skew issues that should block updates to 4.(y+1), but they're enforced by the Kube API server operator [2], and don't need the MCO chipping in. [1]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun/1868424902256627712 [2]: openshift/cluster-kube-apiserver-operator@9ce4f74
|
Unit test failure seems unrelated to my change: $ curl -s https://storage.googleapis.com/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4760/pull-ci-openshift-machine-config-operator-master-unit/1868848364292935680/build-log.txt | grep 'build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab\|MachineConfig_changes_creates_a_new_MachineOSBuild'
=== RUN TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild
=== PAUSE TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild
=== CONT TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild
I1217 02:53:50.070870 27103 wrappedqueue.go:249] Error executing "<kind: \"MachineOSConfig\", name: \"worker-os-config\", func: \"(*OSBuildController).addMachineOSConfig\">" in queue TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild: Adding MachineOSConfig "worker-os-config" failed: could not sync MachineOSConfigs: sync MachineOSConfigs failed: could not sync MachineOSConfig "worker-os-config": Syncing MachineOSConfig "worker-os-config" failed: could not create new or reuse existing MachineOSBuild for MachineOSConfig "worker-os-config": could not create new MachineOSBuild "worker-os-config-4b619479eb172ec79b53c7f66901964a": machineosbuilds.machineconfiguration.openshift.io "worker-os-config-4b619479eb172ec79b53c7f66901964a" already exists
I1217 02:53:50.213825 27103 jobimagebuilder.go:103] Build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" created for MachineOSBuild "worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.213847 27103 reconciler.go:380] Started new build build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab for MachineOSBuild
I1217 02:53:50.214342 27103 reconciler.go:380] Started new build build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab for MachineOSBuild
I1217 02:53:50.214353 27103 reconciler.go:792] Adding Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.216973 27103 reconciler.go:179] Adding build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.219137 27103 reconciler.go:792] Updating Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.219347 27103 jobimagebuilder.go:191] Build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" status {Conditions:[] StartTime:<nil> CompletionTime:<nil> Active:0 Succeeded:0 Failed:0 Terminating:<nil> CompletedIndexes: FailedIndexes:<nil> UncountedTerminatedPods:nil Ready:<nil>} mapped to MachineOSBuild progress "Prepared"
I1217 02:53:50.219425 27103 jobimagebuilder.go:191] Build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" status {Conditions:[] StartTime:<nil> CompletionTime:<nil> Active:0 Succeeded:1 Failed:0 Terminating:<nil> CompletedIndexes: FailedIndexes:<nil> UncountedTerminatedPods:nil Ready:<nil>} mapped to MachineOSBuild progress "Succeeded"
I1217 02:53:50.219873 27103 reconciler.go:795] Finished updating Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" after 1.240185ms
I1217 02:53:50.220931 27103 jobimagebuilder.go:266] Deleted build job build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab for MachineOSBuild worker-os-config-57cd21cca292604d4624ef5c0f87d1ab
I1217 02:53:50.221571 27103 reconciler.go:792] Deleting Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.221700 27103 jobimagebuilder.go:191] Build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" status {Conditions:[] StartTime:<nil> CompletionTime:<nil> Active:0 Succeeded:1 Failed:0 Terminating:<nil> CompletedIndexes: FailedIndexes:<nil> UncountedTerminatedPods:nil Ready:<nil>} mapped to MachineOSBuild progress "Succeeded"
I1217 02:53:50.221812 27103 reconciler.go:200] Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" deleted
I1217 02:53:50.222112 27103 jobimagebuilder.go:103] Build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" created for MachineOSBuild "worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.222124 27103 reconciler.go:380] Started new build build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab for MachineOSBuild
I1217 02:53:50.222271 27103 reconciler.go:795] Finished adding Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" after 8.142732ms
I1217 02:53:50.222772 27103 reconciler.go:792] Adding Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.222785 27103 reconciler.go:179] Adding build job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab"
I1217 02:53:50.223267 27103 reconciler.go:795] Finished deleting Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" after 1.950558ms
I1217 02:53:50.224482 27103 reconciler.go:795] Finished adding Job "build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab" after 1.933926ms
=== NAME TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild
Test: TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild
Messages: Build job build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab did not reach specified state%!(EXTRA string=Expected the build job %s to be deleted, string=build-worker-os-config-57cd21cca292604d4624ef5c0f87d1ab)
--- FAIL: TestOSBuildController/MachineConfig_changes_creates_a_new_MachineOSBuild (5.00s)And I also see that same test-case failing in the unit tests of other pulls, such as this run. |
|
e2e-gcp-op failure build02 cluster issue, also unrelated to my pull: |
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
build02 had bumped into openshift/cincinnati-graph-data#6463, but has since been recovered. Trying again: /retest-required |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun |
|
@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e0e7ac40-bcb0-11ef-8916-660e893ad4c7-0 |
|
Previous payload job had trouble with build02 scheduling. Trying again: /payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial-crun |
|
@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/939799e0-bcc5-11ef-9e7e-45e528fe631f-0 |
|
@wking: This pull request references MCO-1482 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
New payload job failed, mostly on disruption that seems unrelated to my change. But PromeCIeus confirms the |
yuqi-zhang
left a comment
There was a problem hiding this comment.
/lgtm
Looks like we haven't (non-cosmetically) updated that code in awhile, so I'm fine with removing the check. Degrades are probably what we should care about most of the time.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
[ART PR BUILD NOTIFIER] Distgit: ose-machine-config-operator |
|
/cherrypick release-4.18 |
|
@wking: new pull request created: #5065 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
In 4.19: * 377a78b (pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition, 2024-12-16, openshift#4760). * 0c21907 (pkg/operator/status: Drop kubelet skew guard, 2025-04-03, openshift#4970). But in 4.18, we're using the other order: * 13cceb0 (pkg/operator/status: Drop kubelet skew guard, add RHEL guard, 2025-03-26, openshift#4956). * 20fe075 (pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition, 2024-12-16, openshift#5065). So I'm adding this follow-up commit within openshift#5065 to remove the 'updating' variable that both the kubelet-skew-guard and the PoolUpdating guard had used, but which we no longer need now that both are gone in 4.18.
|
/cherrypick release-4.17 |
|
@wking: new pull request created: #5111 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |

956e787 (#4012) had added the "this should no longer trigger when adding a node to a pool" comment, but unfortunately, it's still triggering. For example, in this serial 4.19 run:
confirmed via MCC logs:
In this commit, I'm dropping the code that had been moving the ClusterOperator to
Upgradeable=FalseonPoolUpdatingentirely, instead of hoping that it doesn't trip. I haven't dug into why the code had still been tripping. But we want to stayUpgradeable=Truewhile new nodes scale in, because clusters where nodes are joining should still be able to update to 4.(y+1). There are node-vs.-control-plane skew issues that should block updates to 4.(y+1), but they're enforced by the Kube API server operator (openshift/cluster-kube-apiserver-operator/pull/1199), and don't need the MCO chipping in.- Description for the changelog
The machine-config ClusterOperator no longer goes
Upgradeable=FalseonPoolUpdatingwhen new Nodes join the cluster.