Skip to content

Bug 1834895: pkg/daemon: Set AddFunc on the nodeInformer as well#1731

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
wking:node-informer-AddFunc
May 14, 2020
Merged

Bug 1834895: pkg/daemon: Set AddFunc on the nodeInformer as well#1731
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
wking:node-informer-AddFunc

Conversation

@wking
Copy link
Copy Markdown
Member

@wking wking commented May 14, 2020

AddFunc docs here and here. We've been setting an UpdateFunc on this informer since we pivoted to node informers in d67e633 (#130). But we need to listen for add events to catch situations where the node's desiredConfig is altered before the informer's initial List call. This avoids situtations like:

I0514 03:27:54.773485    4242 update.go:1346] Starting to manage node: ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal
...
I0514 03:42:37.392240    4242 request.go:1068] Response Body: {"kind":"Node","apiVersion":"v1","metadata":{"name":"ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal"..."machineconfiguration.openshift.io/desiredConfig":"rendered-node-log-level-14137156-e501-4d3b-9a93-7fff577b45e5-a048da8ad1c08dce6dc6aa3c1101fb44"...
...
I0514 03:43:18.111048    1897 update.go:1346] Starting to manage node: ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal
...
I0514 03:43:21.844234    1897 request.go:1068] Response Body: {"kind":"NodeList","apiVersion":"v1","metadata":{"selfLink":"/api/v1/nodes","resourceVersion":"28107"},"items":[{"metadata":{"name":"ci-op-9gpgc-m-0.c.openshift-gce-devel-ci.internal"..."machineconfiguration.openshift.io/currentConfig":"rendered-master-9f55fba0ad22a8069d04b6ddf87b8ed9","machineconfiguration.openshift.io/desiredConfig":"rendered-master-9f55fba0ad22a8069d04b6ddf87b8ed9"...
...
I0514 03:48:15.568762    1897 daemon.go:1119] Updating Node ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal
...
I0514 03:48:16.569041    1897 daemon.go:368] Started syncing node "ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal" (2020-05-14 03:48:16.569022688 +0>
I0514 03:48:16.590279    1897 daemon.go:767] Current config: rendered-worker-a048da8ad1c08dce6dc6aa3c1101fb44
I0514 03:48:16.590302    1897 daemon.go:768] Desired config: rendered-node-log-level-14137156-e501-4d3b-9a93-7fff577b45e5-a048da8ad1c08dce6dc6aa3c1101fb44

That was:

  1. Somebody sets desiredConfig to rendered-node-log-level-... while the old MCD (PID 4242) is still running.
  2. New MCD (PID 1897) comes up, and fires up its Node informer.
  3. Informer does a List while starting up, and sees the node with desiredConfig set to rendered-node-log-level-.... Goes to call AddFunc, but until this PR we weren't setting it, so MCD does nothing.
  4. Informer sets up its Watch.
  5. Time passes...
  6. Something else changes the managed Node (kubelet heartbeat? Who cares?), Watch trips, UpdateFunc called.
  7. MCD notices the desiredConfig and begins applying rendered-node-log-level-....

With this PR, we'll react at step 3 instead of 7, hopefully fixing rhbz#1834895.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 14, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@wking: This pull request references Bugzilla bug 1834895, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1834895: pkg/daemon: Set AddFunc on the nodeInformer as well

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Copy Markdown
Member Author

wking commented May 14, 2020

This commit is also on the tail of the hacky log enhancement in #1730, so if it fails, the logs from the #1730 jobs might help show why.

AddFunc docs in [1,2].  We've been setting an UpdateFunc on this
informer since we pivoted to node informers in d67e633 (daemon: use
informers to check for updates, 2018-10-26, openshift#130).  But we need to
listen for add events to catch situations where the node's
desiredConfig is altered before the informer's initial List call.
This avoids situtations like [3,4]:

  I0514 03:27:54.773485    4242 update.go:1346] Starting to manage node: ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal
  ...
  I0514 03:42:37.392240    4242 request.go:1068] Response Body: {"kind":"Node","apiVersion":"v1","metadata":{"name":"ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal"..."machineconfiguration.openshift.io/desiredConfig":"rendered-node-log-level-14137156-e501-4d3b-9a93-7fff577b45e5-a048da8ad1c08dce6dc6aa3c1101fb44"...
  ...
  I0514 03:43:18.111048    1897 update.go:1346] Starting to manage node: ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal
  ...
  I0514 03:43:21.844234    1897 request.go:1068] Response Body: {"kind":"NodeList","apiVersion":"v1","metadata":{"selfLink":"/api/v1/nodes","resourceVersion":"28107"},"items":[{"metadata":{"name":"ci-op-9gpgc-m-0.c.openshift-gce-devel-ci.internal"..."machineconfiguration.openshift.io/currentConfig":"rendered-master-9f55fba0ad22a8069d04b6ddf87b8ed9","machineconfiguration.openshift.io/desiredConfig":"rendered-master-9f55fba0ad22a8069d04b6ddf87b8ed9"...
  ...
  I0514 03:48:15.568762    1897 daemon.go:1119] Updating Node ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal
  ...
  I0514 03:48:16.569041    1897 daemon.go:368] Started syncing node "ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal" (2020-05-14 03:48:16.569022688 +0>
  I0514 03:48:16.590279    1897 daemon.go:767] Current config: rendered-worker-a048da8ad1c08dce6dc6aa3c1101fb44
  I0514 03:48:16.590302    1897 daemon.go:768] Desired config: rendered-node-log-level-14137156-e501-4d3b-9a93-7fff577b45e5-a048da8ad1c08dce6dc6aa3c1101fb44

That was:

1. Somebody sets desiredConfig to rendered-node-log-level-... while
   the old MCD (PID 4242) is still running.
2. New MCD (PID 1897) comes up, and fires up its Node informer.
3. Informer does a List while starting up, and sees the node with
   desiredConfig set to rendered-node-log-level-...  Goes to call
   AddFunc, but until this commit we weren't setting it, so MCD does
   nothing.
4. Informer sets up its Watch.
5. Time passes...
6. Something else changes the managed Node (kubelet heartbeat?  Who cares?), Watch trips, UpdateFunc called.
7. MCD notices the desiredConfig and begins applying rendered-node-log-level-...

With this commit, we'll react at step 3 instead of 7, hopefully fixing
[5].

[1]: https://pkg.go.dev/k8s.io/client-go/tools/cache?tab=doc#ResourceEventHandlerFuncs
[2]: https://pkg.go.dev/k8s.io/client-go/tools/cache?tab=doc#ResourceEventHandlerFuncs.OnAdd
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1730/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2250
[4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1730/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2250/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-pq5wv_machine-config-daemon.log
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1834895
@wking wking force-pushed the node-informer-AddFunc branch from f6b34d9 to 45e3558 Compare May 14, 2020 05:38
@sinnykumari
Copy link
Copy Markdown
Contributor

wohoo! it succeeded 🎉
/retest

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@wking: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 45e3558 link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Copy Markdown
Member

Thanks so much for debugging this!

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2020
@sinnykumari
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, sinnykumari, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2020
@openshift-merge-robot openshift-merge-robot merged commit a2681ad into openshift:master May 14, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@wking: Some pull requests linked via external trackers have merged: . The following pull requests linked via external trackers have not merged:

Details

In response to this:

Bug 1834895: pkg/daemon: Set AddFunc on the nodeInformer as well

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the node-informer-AddFunc branch May 14, 2020 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants