Fix for crash on WaitGroup #1857

fcrisciani · 2017-07-26T21:44:42Z

panic: sync: WaitGroup is reused before previous Wait has returned

The panic is due to an EventNotify that arrives in conjunction with
a peerAdd or peerDelete. Wait is followed by an Add

The current WaitGroup logic does not add any further protection.
The pMap mutex should be enough

The update operations have to happen inside the pMap lock to be safe,
else there is a race between the update operations and new peerAdd and peerDelete

Signed-off-by: Flavio Crisciani flavio.crisciani@docker.com

panic: sync: WaitGroup is reused before previous Wait has returned The panic is due to an EventNotify that arrives in conjunction with a peerAdd or peerDelete. Wait is followed by an Add The current WaitGroup logic does not add any further protection. The pMap mutex should be enough The update operations have to happen inside the pMap lock to be safe, else there is a race between the update operations and new peerAdd and peerDelete Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>

vieux · 2017-07-27T00:06:13Z

ping @sanimej @abhinandanpb

abhi · 2017-07-27T00:28:25Z

LGTM

vieux · 2017-07-27T01:17:25Z

LGTM

mavenugo · 2017-07-27T04:46:55Z

drivers/overlay/peerdb.go

 	for _, op := range peerOps {
 		op()
 	}
+	pMap.Unlock()


As discussed, holding a lock around the calls that adds the vtep neighbor entries needs to be done carefully.
am not against this change. But It needs some more eyes. @sanimej wdyt ?

Thats the part I am looking at as well. Usage of the WaitGroup needs to be fixed. But its better to avoid taking a lock around peerAdd which is a pretty expensive call, with multiple netlink calls to the kernel.

@fcrisciani @mavenugo peerDbWg.Add is only in the sandbox init path. In the pre-swarm mode overlay using serf there is no network scoped joining of the cluster. So what used to happen was all the peer events for a network which doesn't have local tasks is stored in the peerDB. When the first container comes up on the network peerDbUpdateSandbox would program all those entries into the overlay network. If we didn't do this, till the next self pushpull there won't be connectivity to the peers. I think the only reason peerDbWg is needed is to make sure this one shot peerDbUpdateSandbox is done before we start processing the serf events. For ex: the serf could be for a delete of an existing endpoint that we received earlier and in the peerDB.

With the networkdb in swarm mode since the network gets recreated on a task run on the node and get all the events from the peers, I don't see a strong reason for using the peerDbWg. One option here is to add do make the peerDbWg calls (Add/Wait/Done) conditional and only for non-swarm mode so that we won't change the flow for the non-swarm mode and avoid the panic for the swarm mode.

For the non-swarm mode the panic is still possible. But there is very little usage for it now and we can take some time to fix it in a different approach (or may be we might even deprecate that mode) rather than completely removing that logic.

@mavenugo On the locking, the scope of the lock is changed only in peerDbUpdateSandbox which was already serialized using the WaitGroup. So that looks ok. As I mentioned in the earlier comment its good to keep the old logic for serf mode till its deprecated.

fcrisciani · 2017-07-31T18:48:01Z

Closing this one, we will go with: #1861

mavenugo reviewed Jul 27, 2017

View reviewed changes

fcrisciani closed this Jul 31, 2017

fcrisciani deleted the peeradd-crash branch July 31, 2017 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for crash on WaitGroup #1857

Fix for crash on WaitGroup #1857

Uh oh!

fcrisciani commented Jul 26, 2017

Uh oh!

vieux commented Jul 27, 2017

Uh oh!

abhi commented Jul 27, 2017

Uh oh!

vieux commented Jul 27, 2017

Uh oh!

mavenugo Jul 27, 2017

Uh oh!

sanimej Jul 27, 2017

Uh oh!

sanimej Jul 27, 2017

Uh oh!

sanimej Jul 27, 2017

Uh oh!

fcrisciani commented Jul 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix for crash on WaitGroup #1857

Fix for crash on WaitGroup #1857

Uh oh!

Conversation

fcrisciani commented Jul 26, 2017

Uh oh!

vieux commented Jul 27, 2017

Uh oh!

abhi commented Jul 27, 2017

Uh oh!

vieux commented Jul 27, 2017

Uh oh!

mavenugo Jul 27, 2017

Choose a reason for hiding this comment

Uh oh!

sanimej Jul 27, 2017

Choose a reason for hiding this comment

Uh oh!

sanimej Jul 27, 2017

Choose a reason for hiding this comment

Uh oh!

sanimej Jul 27, 2017

Choose a reason for hiding this comment

Uh oh!

fcrisciani commented Jul 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants