bug 1732120: pkg/daemon: reconcile being killed by kube on drain+reboot#952
Conversation
There was a problem hiding this comment.
These seem like distinct cases. If we failed to drain, we should easily be able to detect that.
Failure to reboot is where we get killed by kube.
There was a problem hiding this comment.
uhm according to logs, we were draining, it just took too long so we're in the middle of a drain and can't even get past the Drain function to reboot. That's where we also get sigkill'ed by kubelet
There was a problem hiding this comment.
a normal failure to drain doesn't incur in the BZ, if we get a Drain error, we just rollback everything and retry, this is about being stuck at drain and getting killed by exceeding the 600s grace term period
c9bba5a to
cd104b4
Compare
Signed-off-by: Antonio Murdaca <runcom@linux.com>
cd104b4 to
1112b30
Compare
|
dropped wip, I think this can go through a round of review |
|
Sorry can you walk me through this one a bit more? You say
But our drain shouldn't kill the MCD as it's a daemonset, which we ignore right? Or does that "ignore" mean "try and drain but don't wait for it"? |
The issue here is specific to "doing something in MCD after being requested to shutdown (MCD)" and I'm not sure who asked that (can't really say from logs but will keep investigating). The logs show that the MCD has been requested to shut down but it was still draining (again not sure who asked for that) and after 600s we've been killed by kubelet and we now have the same bootid cause we didn't reboot. |
Hmmm. OK, it's making more sense. Maybe it's something like someone manually doing |
|
These are the relevant logs (you can see the pid changed as well for the MCD): |
|
Side note, will conflict some with #935 but oh well. |
could indeed be, or the kubelet shut down alone? is there anyway to check journal for |
|
|
so there's no sign of admin rebooting - rather, something interesting popped up, we've been asked to shut down after ~300s from starting drain |
so meanwhile speaking to Ravi and others to understand how to grab useful logs as to why the kubelet wants us to shut down. Besides....I think this patch does make sense as a stopgap/best effort reconciliation - @cgwalters wdyt? |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest |
1 similar comment
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest |
3 similar comments
|
/retest |
|
/retest |
|
/retest |
|
@runcom: All pull requests linked via external trackers have merged. The Bugzilla bug has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Antonio Murdaca runcom@linux.com
- What I did
WIP mainly because I want to test it out on a live cluster by forcing a grace period of something really low to kick the kubelet killing the MCD
This patch removes the perma-failure we have in case:
The above can be deducted by:
In such case, we can go ahead and re-kick and drain+reboot routine till it eventually succedes (modulo: as Colin said in https://bugzilla.redhat.com/show_bug.cgi?id=1728873 we're in a permanent situation where we can't really drain forever)
- How to verify it
testing it manually for now
- Description for the changelog