fix crash when there are attachments with null network#2819
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2819 +/- ##
==========================================
+ Coverage 61.94% 61.95% +0.01%
==========================================
Files 139 139
Lines 22120 22122 +2
==========================================
+ Hits 13702 13706 +4
- Misses 6937 6939 +2
+ Partials 1481 1477 -4 |
thaJeztah
left a comment
There was a problem hiding this comment.
lol, went to look through the logic to see if we could safely ignore this case (leading to some PR's I opened 😂)
LGTM - we should probably backport this one to relevant release branches
|
ping @dperny @anshulpundir PTAL |
|
whoops, needs a rebase @andrey-ko 🤗 |
| // no network matches, then the the attachment should be removed. | ||
| if attach.Network == nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Should arguably be placed before the comment on lines 1052-1054, since that comment is about what follows this.
| // for every attachment, go through every network. if the attachment | ||
| // belongs to one of the networks, then go to the next attachment. if | ||
| // no network matches, then the the attachment should be removed. | ||
| if attach.Network == nil { |
There was a problem hiding this comment.
How can this happen? When attaching a network that's not properly working (eg conflicts with another subnet)? Shouldn't we remove it from the node's attachments then?
In any case, seems worth at least a comment :)
There was a problem hiding this comment.
yes, that's great point, but may be we should just return error and fail operation?
anshulpundir
left a comment
There was a problem hiding this comment.
Looks good after comments addressed and rebased.
| // for every attachment, go through every network. if the attachment | ||
| // belongs to one of the networks, then go to the next attachment. if | ||
| // no network matches, then the the attachment should be removed. | ||
| if attach.Network == nil { |
|
i wonder if this is related to #2764...? |
Signed-off-by: Andrey Kolomentsev <andrey.kolomentsev@docker.com>
Doesn't look like it would cause the issue, because in "restoring" mode, it would return before reaching that part https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L1041-L1043 In the "non-restore" case, the attachment would be created (if a given network was not yet found in the list of attachments https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L1020-L1028 I do wonder though if (in the "restoring" case), we should such attachments in (or before) the first loop; https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L995-L1007 (i.e., remove all empty attachments first) |
|
I don't want to merge this without a root cause. The network allocator is absurdly fragile, and making changes without understanding them almost always causes further breakage. I'm looking at the code now. |
|
@andrey-ko Have you reproduced this crash anywhere? |
|
Is it possible that a network was deleted around the same time that the leadership change occurred? |
|
Are we even sure it's the attachment network that's nil, and not the attachment itself? The line includes |
|
I hate the allocator. |
So it definitely happened after leadership change occurred, but I don't remember if deleted any network around the same time when this event happened |
I'm going to try to reproduce it again, but it may take some time, I don't remember exact sequence of events that led to this |
|
one of the reasoning to add this check, was that there is similar check in this function a little bit earlier |
|
ping @dperny PTAL |
|
Hi guys, i think i'm having the same issue on my production swarm. Almost every time i deploy/remove a stack i would get the same error. So firstly my docker leader would exit followed by the other managers. However after waiting a certain period i would be able to successfully deploy all the stacks to the swarm. docker node ls: Logs: |
fix for crash that I caught while playing with swarm cluster.
here the stack trace of this crash:
stack trace is very similar to issue described here:
moby/moby#38287
so in my case I had a swarm cluster with 3 managers, than I demote 2 of them.
And after this docker start crashing 100% on the manager node.