Bug 1834895: pkg/daemon: Set AddFunc on the nodeInformer as well#1731
Conversation
|
@wking: This pull request references Bugzilla bug 1834895, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
AddFunc docs in [1,2]. We've been setting an UpdateFunc on this informer since we pivoted to node informers in d67e633 (daemon: use informers to check for updates, 2018-10-26, openshift#130). But we need to listen for add events to catch situations where the node's desiredConfig is altered before the informer's initial List call. This avoids situtations like [3,4]: I0514 03:27:54.773485 4242 update.go:1346] Starting to manage node: ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal ... I0514 03:42:37.392240 4242 request.go:1068] Response Body: {"kind":"Node","apiVersion":"v1","metadata":{"name":"ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal"..."machineconfiguration.openshift.io/desiredConfig":"rendered-node-log-level-14137156-e501-4d3b-9a93-7fff577b45e5-a048da8ad1c08dce6dc6aa3c1101fb44"... ... I0514 03:43:18.111048 1897 update.go:1346] Starting to manage node: ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal ... I0514 03:43:21.844234 1897 request.go:1068] Response Body: {"kind":"NodeList","apiVersion":"v1","metadata":{"selfLink":"/api/v1/nodes","resourceVersion":"28107"},"items":[{"metadata":{"name":"ci-op-9gpgc-m-0.c.openshift-gce-devel-ci.internal"..."machineconfiguration.openshift.io/currentConfig":"rendered-master-9f55fba0ad22a8069d04b6ddf87b8ed9","machineconfiguration.openshift.io/desiredConfig":"rendered-master-9f55fba0ad22a8069d04b6ddf87b8ed9"... ... I0514 03:48:15.568762 1897 daemon.go:1119] Updating Node ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal ... I0514 03:48:16.569041 1897 daemon.go:368] Started syncing node "ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal" (2020-05-14 03:48:16.569022688 +0> I0514 03:48:16.590279 1897 daemon.go:767] Current config: rendered-worker-a048da8ad1c08dce6dc6aa3c1101fb44 I0514 03:48:16.590302 1897 daemon.go:768] Desired config: rendered-node-log-level-14137156-e501-4d3b-9a93-7fff577b45e5-a048da8ad1c08dce6dc6aa3c1101fb44 That was: 1. Somebody sets desiredConfig to rendered-node-log-level-... while the old MCD (PID 4242) is still running. 2. New MCD (PID 1897) comes up, and fires up its Node informer. 3. Informer does a List while starting up, and sees the node with desiredConfig set to rendered-node-log-level-... Goes to call AddFunc, but until this commit we weren't setting it, so MCD does nothing. 4. Informer sets up its Watch. 5. Time passes... 6. Something else changes the managed Node (kubelet heartbeat? Who cares?), Watch trips, UpdateFunc called. 7. MCD notices the desiredConfig and begins applying rendered-node-log-level-... With this commit, we'll react at step 3 instead of 7, hopefully fixing [5]. [1]: https://pkg.go.dev/k8s.io/client-go/tools/cache?tab=doc#ResourceEventHandlerFuncs [2]: https://pkg.go.dev/k8s.io/client-go/tools/cache?tab=doc#ResourceEventHandlerFuncs.OnAdd [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1730/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2250 [4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1730/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2250/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-pq5wv_machine-config-daemon.log [5]: https://bugzilla.redhat.com/show_bug.cgi?id=1834895
f6b34d9 to
45e3558
Compare
|
wohoo! it succeeded 🎉 |
|
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Thanks so much for debugging this! /approve |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, sinnykumari, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: Some pull requests linked via external trackers have merged: . The following pull requests linked via external trackers have not merged:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
AddFuncdocs here and here. We've been setting anUpdateFuncon this informer since we pivoted to node informers in d67e633 (#130). But we need to listen for add events to catch situations where the node'sdesiredConfigis altered before the informer's initialListcall. This avoids situtations like:That was:
desiredConfigtorendered-node-log-level-...while the old MCD (PID 4242) is still running.Listwhile starting up, and sees the node withdesiredConfigset torendered-node-log-level-.... Goes to callAddFunc, but until this PR we weren't setting it, so MCD does nothing.Watch.Watchtrips,UpdateFunccalled.desiredConfigand begins applyingrendered-node-log-level-....With this PR, we'll react at step 3 instead of 7, hopefully fixing rhbz#1834895.