(fix) registry pods do not come up again after node failure#3366
Conversation
|
So far it looks pretty good =D should we add any unit tests for the new behaviour? and maybe the test we missed in the original bug fix? |
I was scared you'll ask this. :p I did think about it, looked into it, and it looks like it's going to be a little bit of effort to write a test for this. Mainly because it's not really unit testable so an e2e test is the viable option. However, I'm not sure we can mimic a node going down from within our tests. I just figured the squeeze might not be worth the juice, however if someone has a better idea (or really insist we include a test for this even if it'll require some effort), I'm happy to include one in this PR. |
215fe08 to
db34956
Compare
|
What's the signal
What's the signal we get in the code that a node has gone down? Are some pods unreachable? |
We don't really get any signal that a node's gone down, we just discover pod/s that have been And that "discovery" is done here. We could add a test for that, but it's a pretty simple function. The real useful test would have been if we could mimic pods "hanging around" in an e2e setting. In fact, the entire |
|
I may have found a way to e2e test this......working on it now.... |
Would it be possible to mock those client responses? e.g. list says they are there, get says they aren't? |
|
Okay I have some tests now.
This did not work out. I was thinking about using
I went with the unit tests route, that mocks these interactions. So the entire feature is tested by 3 components of tests:
This is the absolute best we can do. e2e isn't possible without much more digging in, and frankly at this point not necessary at all. |
26a24c7 to
be108df
Compare
|
Pivoted the PR to include the logic inside |
| continue | ||
| } | ||
| healthy = false | ||
| forceDeletionErrs = append(forceDeletionErrs, pkgerrors.Errorf("found %s in a deleted but not removed state", pod.Name)) |
There was a problem hiding this comment.
Do we need to return an error from this function when we find a wedged pod and successfully delete it? My initial thought is that we would only error from this function if:
- we failed to determine healthy
- we failed to delete wedged pods.
There was a problem hiding this comment.
I started out thinking the same, but then realized that that'd be an "artificial" error. If we include both of the scenarios you mentioned in the definition of "error", we're sort of sullying the definition which is "something went wrong". In the case of "we failed to determine healthy", we are already sending that signal through the boolean variable anyway. So decided to not include both, and only include the second scenario of "we failed to delete the wedged pods".
be108df to
beeedd5
Compare
| logger.WithFields(logrus.Fields{"pod.namespace": sourceNamespace, "pod.name": pod.GetName()}).Debug("pod is alive") | ||
| continue | ||
| } | ||
| foundDeadPod = true |
There was a problem hiding this comment.
This is trickier than I originally thought. If we have:
- at least one alive pod
- the alive pods are otherwise deemed healthy
- we successfully delete the dead pods
What should we return from CheckRegistryServer? Seems like we could say healthy in that case?
There was a problem hiding this comment.
Yea this is a bit confusing, I'm thinking if we detect even one dead pod, we just force delete all the pods, and let EnsureRegistryServer recreate everything. Otherwise we're in a very non-deterministic state...
There was a problem hiding this comment.
That seems like that could have significant performance implications. There's a lot of disk/cpu/memory costs that are paid when catalog pods are started, so we should minimized that as much as possible.
I think it is okay to force delete the dead pods, and if there are any alive pods, we just move forward as if those were the only pods that were there to begin with.
There was a problem hiding this comment.
That's fair too. I was a bit confused about there being multiple pods in the first place, since my impression is that we only create one registry pod per catalog, but turns out we have pods mainly because we use the lister to list via label selector. In any case this should return one pod, unless I'm unaware of some feature that allows us to spin up multiple registry pods per catalog.
Changed it back to just deleting dead pods.
There was a problem hiding this comment.
There are definitely multiple pods when we're polling. In that case, we spin up a new pod to see what the digest is, and we compare the digests to see if we need to use the new pod (if it has a new digest) or keep using the old pod (when the digests match)
There may also be multiple pods when the pod spec changes (e.g. when the catalog source image is changed).
There was a problem hiding this comment.
Right, both are transitory states though, and we're in that state because EnsureRegistryServer has already been called. But yes I think this is in a good state now.
a9b8cd5 to
20a1716
Compare
|
@joelanford are you good with this? |
| return false, nil | ||
| } | ||
|
|
||
| if deadPodsDetected, e := detectAndDeleteDeadPods(logger, c.OpClient, currentPods, source.GetNamespace()); deadPodsDetected { |
There was a problem hiding this comment.
I still think we should return true, nil from this function in the case where:
- There is at least one healthy pod
- There are multiple "dead" pods
- We deleted all of the "dead" pods successfully.
That sort of behavior would mean that we would short circuit and avoid calling EnsureRegistryServer, which would:
- end up making some no-op calls to the apiserver:
- (maybe, not sure) have some strange issues related to caching, where subsequent calls to
currentPodswould still return the dead pods because the deletion has no propagated back to our informer cache yet (I'm assumingc.Listeris backed by a cache. If it is not, then we'd be making more apiserver calls unnecessarily)
There was a problem hiding this comment.
Essentially, I think both of the following should result in identical behavior:
- There are 1 or more alive pods, and no dead pods
- There are 1 or more alive pods, and we deleted all of the dead pods (hence: there are no dead pods, so this is actually a variant of (1))
There was a problem hiding this comment.
@joelanford thanks for the discussion yesterday. Thought about it a little more and we can do this. I've had to add an additional change though, highlighting it in the next comment...
20a1716 to
c8162e7
Compare
80cd5ae to
507e0de
Compare
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
507e0de to
406fede
Compare
Description of the change:
PR 3201 attempted to solve for the issue by deleting the pods stuck in
Terminatingdue to unreachable node. However, the logic to do that wasincluded in
EnsureRegistryServer, which only gets executed if polling inrequested by the user.
This PR moves the logic of checking for dead pods out of
EnsureRegistryServer,and puts it in
CheckRegistryServerinstead. This way, if there are any dead podsdetected during
CheckRegistryServer, the value ofhealthyis returnedfalse,which inturn triggers
EnsureRegistryServer.Motivation for the change:
Architectural changes:
Testing remarks:
Reviewer Checklist
/doc[FLAKE]are truly flaky and have an issue