Skip to content

bug 1732120: pkg/daemon: reconcile being killed by kube on drain+reboot#952

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
runcom:killed-kube-reconciile
Jul 20, 2019
Merged

bug 1732120: pkg/daemon: reconcile being killed by kube on drain+reboot#952
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
runcom:killed-kube-reconciile

Conversation

@runcom
Copy link
Copy Markdown
Member

@runcom runcom commented Jul 11, 2019

Signed-off-by: Antonio Murdaca runcom@linux.com

- What I did

WIP mainly because I want to test it out on a live cluster by forcing a grace period of something really low to kick the kubelet killing the MCD

This patch removes the perma-failure we have in case:

  • we were draining
  • we exceeded 600s terminationGracePeriod
  • the kubelet killed (9) us

The above can be deducted by:

  • pending config is on disk
  • bootID equality

In such case, we can go ahead and re-kick and drain+reboot routine till it eventually succedes (modulo: as Colin said in https://bugzilla.redhat.com/show_bug.cgi?id=1728873 we're in a permanent situation where we can't really drain forever)

- How to verify it

testing it manually for now

- Description for the changelog

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 11, 2019
Comment thread pkg/daemon/daemon.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like distinct cases. If we failed to drain, we should easily be able to detect that.

Failure to reboot is where we get killed by kube.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm according to logs, we were draining, it just took too long so we're in the middle of a drain and can't even get past the Drain function to reboot. That's where we also get sigkill'ed by kubelet

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a normal failure to drain doesn't incur in the BZ, if we get a Drain error, we just rollback everything and retry, this is about being stuck at drain and getting killed by exceeding the 600s grace term period

@runcom runcom force-pushed the killed-kube-reconciile branch 2 times, most recently from c9bba5a to cd104b4 Compare July 11, 2019 15:39
Signed-off-by: Antonio Murdaca <runcom@linux.com>
@runcom runcom force-pushed the killed-kube-reconciile branch from cd104b4 to 1112b30 Compare July 12, 2019 13:01
@runcom runcom changed the title WIP: pkg/daemon: reconcile being killed by kube on drain+reboot pkg/daemon: reconcile being killed by kube on drain+reboot Jul 18, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 18, 2019
@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 18, 2019

dropped wip, I think this can go through a round of review

@cgwalters
Copy link
Copy Markdown
Member

Sorry can you walk me through this one a bit more? You say

we were draining
we exceeded 600s terminationGracePeriod

But our drain shouldn't kill the MCD as it's a daemonset, which we ignore right? Or does that "ignore" mean "try and drain but don't wait for it"?

@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 18, 2019

Sorry can you walk me through this one a bit more? You say

we were draining
we exceeded 600s terminationGracePeriod

But our drain shouldn't kill the MCD as it's a daemonset, which we ignore right? Or does that "ignore" mean "try and drain but don't wait for it"?

The issue here is specific to "doing something in MCD after being requested to shutdown (MCD)" and I'm not sure who asked that (can't really say from logs but will keep investigating). The logs show that the MCD has been requested to shut down but it was still draining (again not sure who asked for that) and after 600s we've been killed by kubelet and we now have the same bootid cause we didn't reboot.

@cgwalters
Copy link
Copy Markdown
Member

The logs show that the MCD has been requested to shut down but it was still draining (again not sure who asked for that) and after 600s we've been killed by kubelet and we now have the same bootid cause we didn't reboot.

Hmmm. OK, it's making more sense. Maybe it's something like someone manually doing systemctl restart kubelet? It's not the admin doing reboot directly because then we'd reboot right?

@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 18, 2019

These are the relevant logs (you can see the pid changed as well for the MCD):

Jul 02 00:05:46 ip-192-168-179-98 root[106368]: machine-config-daemon[98533]: controller syncing started
Jul 02 00:05:46 ip-192-168-179-98 root[106369]: machine-config-daemon[98533]: Starting update from rendered-worker-7f899c2c109ce7eea3cde169e3f51092 to rendered-worker-71095e466f5b906cdbcc9ff8cb04414c
Jul 02 00:05:46 ip-192-168-179-98 root[106371]: machine-config-daemon[98533]: Update prepared; beginning drain
Jul 02 00:22:14 ip-192-168-179-98 root[127189]: machine-config-daemon[127147]: Starting to manage node: ip-192-168-179-98.us-west-1.compute.internal

@cgwalters
Copy link
Copy Markdown
Member

Side note, will conflict some with #935 but oh well.

@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 18, 2019

Hmmm. OK, it's making more sense. Maybe it's something like someone manually doing systemctl restart kubelet? It's not the admin doing reboot directly because then we'd reboot right?

could indeed be, or the kubelet shut down alone? is there anyway to check journal for systemctl restart kubelet?

@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 18, 2019

Jul 02 00:11:52 ip-192-168-179-98 hyperkube[95974]: I0702 00:11:52.431717   95974 kubelet.go:1931] SyncLoop (DELETE, "api"): "machine-config-daemon-cs5dk_openshift-machine-config-operator(2327da1d-8245-11e9-88d5-022e44d1f502)"
Jul 02 00:11:52 ip-192-168-179-98 hyperkube[95974]: I0702 00:11:52.431921   95974 kubelet_pods.go:1329] Generating status for "machine-config-daemon-cs5dk_openshift-machine-config-operator(2327da1d-8245-11e9-88d5-022e44d1f502)"
Jul 02 00:11:52 ip-192-168-179-98 hyperkube[95974]: I0702 00:11:52.432168   95974 kuberuntime_container.go:559] Killing container "cri-o://e10d9f13a3927c3a239264f4edd81e0a73fbf6de8c10778f0acf2e4f011c384d" with 600 second grace period
Jul 02 00:11:52 ip-192-168-179-98 hyperkube[95974]: I0702 00:11:52.432341   95974 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-config-operator", Name:"machine-config-daemon-cs5dk", UID:"2327da1d-8245-11e9-88d5-022e44d1f502", APIVersion:"v1", ResourceVersion:"19110157", FieldPath:"spec.containers{machine-config-daemon}"}): type: 'Normal' reason: 'Killing' Stopping container machine-config-daemon

@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 18, 2019

so there's no sign of admin rebooting - rather, something interesting popped up, we've been asked to shut down after ~300s from starting drain

@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 18, 2019

so there's no sign of admin rebooting - rather, something interesting popped up, we've been asked to shut down after ~300s from starting drain

so meanwhile speaking to Ravi and others to understand how to grab useful logs as to why the kubelet wants us to shut down.

Besides....I think this patch does make sense as a stopgap/best effort reconciliation - @cgwalters wdyt?

@cgwalters
Copy link
Copy Markdown
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 18, 2019
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread pkg/daemon/daemon.go
@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 18, 2019

/retest

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 19, 2019

/retest

1 similar comment
@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 19, 2019

/retest

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 19, 2019

/retest

3 similar comments
@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 19, 2019

/retest

@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 19, 2019

/retest

@runcom
Copy link
Copy Markdown
Member Author

runcom commented Jul 20, 2019

/retest

@openshift-merge-robot openshift-merge-robot merged commit bf51921 into openshift:master Jul 20, 2019
@runcom runcom deleted the killed-kube-reconciile branch July 20, 2019 09:30
@eparis eparis changed the title pkg/daemon: reconcile being killed by kube on drain+reboot bug 1732120: pkg/daemon: reconcile being killed by kube on drain+reboot Jul 22, 2019
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@runcom: All pull requests linked via external trackers have merged. The Bugzilla bug has been moved to the MODIFIED state.

Details

In response to this:

bug 1732120: pkg/daemon: reconcile being killed by kube on drain+reboot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants