bug 1732120: pkg/daemon: reconcile being killed by kube on drain+reboot by runcom · Pull Request #952 · openshift/machine-config-operator

runcom · 2019-07-11T13:28:49Z

Signed-off-by: Antonio Murdaca runcom@linux.com

- What I did

WIP mainly because I want to test it out on a live cluster by forcing a grace period of something really low to kick the kubelet killing the MCD

This patch removes the perma-failure we have in case:

we were draining
we exceeded 600s terminationGracePeriod
the kubelet killed (9) us

The above can be deducted by:

pending config is on disk
bootID equality

In such case, we can go ahead and re-kick and drain+reboot routine till it eventually succedes (modulo: as Colin said in https://bugzilla.redhat.com/show_bug.cgi?id=1728873 we're in a permanent situation where we can't really drain forever)

- How to verify it

testing it manually for now

- Description for the changelog

cgwalters · 2019-07-11T13:37:08Z

These seem like distinct cases. If we failed to drain, we should easily be able to detect that.

Failure to reboot is where we get killed by kube.

uhm according to logs, we were draining, it just took too long so we're in the middle of a drain and can't even get past the Drain function to reboot. That's where we also get sigkill'ed by kubelet

a normal failure to drain doesn't incur in the BZ, if we get a Drain error, we just rollback everything and retry, this is about being stuck at drain and getting killed by exceeding the 600s grace term period

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom · 2019-07-18T16:53:52Z

dropped wip, I think this can go through a round of review

cgwalters · 2019-07-18T17:44:54Z

Sorry can you walk me through this one a bit more? You say

we were draining
we exceeded 600s terminationGracePeriod

But our drain shouldn't kill the MCD as it's a daemonset, which we ignore right? Or does that "ignore" mean "try and drain but don't wait for it"?

runcom · 2019-07-18T17:51:17Z

Sorry can you walk me through this one a bit more? You say

we were draining
we exceeded 600s terminationGracePeriod

But our drain shouldn't kill the MCD as it's a daemonset, which we ignore right? Or does that "ignore" mean "try and drain but don't wait for it"?

The issue here is specific to "doing something in MCD after being requested to shutdown (MCD)" and I'm not sure who asked that (can't really say from logs but will keep investigating). The logs show that the MCD has been requested to shut down but it was still draining (again not sure who asked for that) and after 600s we've been killed by kubelet and we now have the same bootid cause we didn't reboot.

cgwalters · 2019-07-18T17:54:18Z

The logs show that the MCD has been requested to shut down but it was still draining (again not sure who asked for that) and after 600s we've been killed by kubelet and we now have the same bootid cause we didn't reboot.

Hmmm. OK, it's making more sense. Maybe it's something like someone manually doing systemctl restart kubelet? It's not the admin doing reboot directly because then we'd reboot right?

runcom · 2019-07-18T17:54:32Z

These are the relevant logs (you can see the pid changed as well for the MCD):

Jul 02 00:05:46 ip-192-168-179-98 root[106368]: machine-config-daemon[98533]: controller syncing started
Jul 02 00:05:46 ip-192-168-179-98 root[106369]: machine-config-daemon[98533]: Starting update from rendered-worker-7f899c2c109ce7eea3cde169e3f51092 to rendered-worker-71095e466f5b906cdbcc9ff8cb04414c
Jul 02 00:05:46 ip-192-168-179-98 root[106371]: machine-config-daemon[98533]: Update prepared; beginning drain
Jul 02 00:22:14 ip-192-168-179-98 root[127189]: machine-config-daemon[127147]: Starting to manage node: ip-192-168-179-98.us-west-1.compute.internal

cgwalters · 2019-07-18T17:55:26Z

Side note, will conflict some with #935 but oh well.

runcom · 2019-07-18T17:56:09Z

Hmmm. OK, it's making more sense. Maybe it's something like someone manually doing systemctl restart kubelet? It's not the admin doing reboot directly because then we'd reboot right?

could indeed be, or the kubelet shut down alone? is there anyway to check journal for systemctl restart kubelet?

runcom · 2019-07-18T18:04:24Z

Jul 02 00:11:52 ip-192-168-179-98 hyperkube[95974]: I0702 00:11:52.431717   95974 kubelet.go:1931] SyncLoop (DELETE, "api"): "machine-config-daemon-cs5dk_openshift-machine-config-operator(2327da1d-8245-11e9-88d5-022e44d1f502)"
Jul 02 00:11:52 ip-192-168-179-98 hyperkube[95974]: I0702 00:11:52.431921   95974 kubelet_pods.go:1329] Generating status for "machine-config-daemon-cs5dk_openshift-machine-config-operator(2327da1d-8245-11e9-88d5-022e44d1f502)"
Jul 02 00:11:52 ip-192-168-179-98 hyperkube[95974]: I0702 00:11:52.432168   95974 kuberuntime_container.go:559] Killing container "cri-o://e10d9f13a3927c3a239264f4edd81e0a73fbf6de8c10778f0acf2e4f011c384d" with 600 second grace period
Jul 02 00:11:52 ip-192-168-179-98 hyperkube[95974]: I0702 00:11:52.432341   95974 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-config-operator", Name:"machine-config-daemon-cs5dk", UID:"2327da1d-8245-11e9-88d5-022e44d1f502", APIVersion:"v1", ResourceVersion:"19110157", FieldPath:"spec.containers{machine-config-daemon}"}): type: 'Normal' reason: 'Killing' Stopping container machine-config-daemon

runcom · 2019-07-18T18:19:31Z

so there's no sign of admin rebooting - rather, something interesting popped up, we've been asked to shut down after ~300s from starting drain

runcom · 2019-07-18T18:25:29Z

so there's no sign of admin rebooting - rather, something interesting popped up, we've been asked to shut down after ~300s from starting drain

so meanwhile speaking to Ravi and others to understand how to grab useful logs as to why the kubelet wants us to shut down.

Besides....I think this patch does make sense as a stopgap/best effort reconciliation - @cgwalters wdyt?

cgwalters · 2019-07-18T18:31:47Z

/lgtm

openshift-ci-robot · 2019-07-18T18:32:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2019-07-18T21:11:24Z

/retest

openshift-bot · 2019-07-18T23:45:49Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-07-19T02:09:06Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-07-19T07:20:37Z

/retest

Please review the full test history for this PR and help us cut down flakes.

runcom · 2019-07-19T08:47:44Z

/retest

runcom · 2019-07-19T10:24:25Z

/retest

openshift-bot · 2019-07-19T13:28:28Z

/retest

Please review the full test history for this PR and help us cut down flakes.

runcom · 2019-07-19T17:23:23Z

/retest

runcom · 2019-07-19T18:41:55Z

/retest

runcom · 2019-07-19T20:05:47Z

/retest

runcom · 2019-07-20T04:39:59Z

/retest

openshift-ci-robot · 2019-07-22T18:02:16Z

@runcom: All pull requests linked via external trackers have merged. The Bugzilla bug has been moved to the MODIFIED state.

Details

In response to this:

bug 1732120: pkg/daemon: reconcile being killed by kube on drain+reboot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from cgwalters and ericavonb July 11, 2019 13:29

cgwalters reviewed Jul 11, 2019

View reviewed changes

runcom force-pushed the killed-kube-reconciile branch 2 times, most recently from c9bba5a to cd104b4 Compare July 11, 2019 15:39

pkg/daemon: reconcile being killed by kube on drain+reboot

1112b30

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom force-pushed the killed-kube-reconciile branch from cd104b4 to 1112b30 Compare July 12, 2019 13:01

runcom changed the title ~~WIP: pkg/daemon: reconcile being killed by kube on drain+reboot~~ pkg/daemon: reconcile being killed by kube on drain+reboot Jul 18, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 18, 2019

openshift-ci-robot assigned cgwalters Jul 18, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 18, 2019

runcom commented Jul 18, 2019

View reviewed changes

Comment thread pkg/daemon/daemon.go

openshift-merge-robot merged commit bf51921 into openshift:master Jul 20, 2019

runcom deleted the killed-kube-reconciile branch July 20, 2019 09:30

LorbusChris mentioned this pull request Jul 22, 2019

Bug 1728873: pkg/daemon: reconcile killed just prior drain+reboot #995

Merged

eparis changed the title ~~pkg/daemon: reconcile being killed by kube on drain+reboot~~ bug 1732120: pkg/daemon: reconcile being killed by kube on drain+reboot Jul 22, 2019

Conversation

runcom commented Jul 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgwalters Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

runcom Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

runcom Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

runcom commented Jul 18, 2019

Uh oh!

cgwalters commented Jul 18, 2019

Uh oh!

runcom commented Jul 18, 2019

Uh oh!

cgwalters commented Jul 18, 2019

Uh oh!

runcom commented Jul 18, 2019

Uh oh!

cgwalters commented Jul 18, 2019

Uh oh!

runcom commented Jul 18, 2019

Uh oh!

runcom commented Jul 18, 2019

Uh oh!

runcom commented Jul 18, 2019

Uh oh!

runcom commented Jul 18, 2019

Uh oh!

cgwalters commented Jul 18, 2019

Uh oh!

openshift-ci-robot commented Jul 18, 2019

Uh oh!

Uh oh!

runcom commented Jul 18, 2019

Uh oh!

openshift-bot commented Jul 18, 2019

Uh oh!

openshift-bot commented Jul 19, 2019

Uh oh!

openshift-bot commented Jul 19, 2019

Uh oh!

runcom commented Jul 19, 2019

Uh oh!

runcom commented Jul 19, 2019

Uh oh!

openshift-bot commented Jul 19, 2019

Uh oh!

runcom commented Jul 19, 2019

Uh oh!

runcom commented Jul 19, 2019

Uh oh!

runcom commented Jul 19, 2019

Uh oh!

runcom commented Jul 20, 2019

Uh oh!

openshift-ci-robot commented Jul 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

runcom commented Jul 11, 2019 •

edited

Loading