Skip to content

Conversation

@fcrisciani
Copy link

panic: sync: WaitGroup is reused before previous Wait has returned

The panic is due to an EventNotify that arrives in conjunction with
a peerAdd or peerDelete. Wait is followed by an Add

The current WaitGroup logic does not add any further protection.
The pMap mutex should be enough

The update operations have to happen inside the pMap lock to be safe,
else there is a race between the update operations and new peerAdd and peerDelete

Signed-off-by: Flavio Crisciani flavio.crisciani@docker.com

panic: sync: WaitGroup is reused before previous Wait has returned

The panic is due to an EventNotify that arrives in conjunction with
a peerAdd or peerDelete. Wait is followed by an Add

The current WaitGroup logic does not add any further protection.
The pMap mutex should be enough

The update operations have to happen inside the pMap lock to be safe,
else there is a race between the update operations and new peerAdd and peerDelete

Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
@vieux
Copy link
Contributor

vieux commented Jul 27, 2017

ping @sanimej @abhinandanpb

@abhi
Copy link
Contributor

abhi commented Jul 27, 2017

LGTM

1 similar comment
@vieux
Copy link
Contributor

vieux commented Jul 27, 2017

LGTM

for _, op := range peerOps {
op()
}
pMap.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, holding a lock around the calls that adds the vtep neighbor entries needs to be done carefully.
am not against this change. But It needs some more eyes. @sanimej wdyt ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats the part I am looking at as well. Usage of the WaitGroup needs to be fixed. But its better to avoid taking a lock around peerAdd which is a pretty expensive call, with multiple netlink calls to the kernel.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fcrisciani @mavenugo peerDbWg.Add is only in the sandbox init path. In the pre-swarm mode overlay using serf there is no network scoped joining of the cluster. So what used to happen was all the peer events for a network which doesn't have local tasks is stored in the peerDB. When the first container comes up on the network peerDbUpdateSandbox would program all those entries into the overlay network. If we didn't do this, till the next self pushpull there won't be connectivity to the peers. I think the only reason peerDbWg is needed is to make sure this one shot peerDbUpdateSandbox is done before we start processing the serf events. For ex: the serf could be for a delete of an existing endpoint that we received earlier and in the peerDB.

With the networkdb in swarm mode since the network gets recreated on a task run on the node and get all the events from the peers, I don't see a strong reason for using the peerDbWg. One option here is to add do make the peerDbWg calls (Add/Wait/Done) conditional and only for non-swarm mode so that we won't change the flow for the non-swarm mode and avoid the panic for the swarm mode.

For the non-swarm mode the panic is still possible. But there is very little usage for it now and we can take some time to fix it in a different approach (or may be we might even deprecate that mode) rather than completely removing that logic.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mavenugo On the locking, the scope of the lock is changed only in peerDbUpdateSandbox which was already serialized using the WaitGroup. So that looks ok. As I mentioned in the earlier comment its good to keep the old logic for serf mode till its deprecated.

@fcrisciani
Copy link
Author

Closing this one, we will go with: #1861

@fcrisciani fcrisciani closed this Jul 31, 2017
@fcrisciani fcrisciani deleted the peeradd-crash branch July 31, 2017 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants