-
Notifications
You must be signed in to change notification settings - Fork 886
Fix for crash on WaitGroup #1857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
panic: sync: WaitGroup is reused before previous Wait has returned The panic is due to an EventNotify that arrives in conjunction with a peerAdd or peerDelete. Wait is followed by an Add The current WaitGroup logic does not add any further protection. The pMap mutex should be enough The update operations have to happen inside the pMap lock to be safe, else there is a race between the update operations and new peerAdd and peerDelete Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
|
ping @sanimej @abhinandanpb |
|
LGTM |
1 similar comment
|
LGTM |
| for _, op := range peerOps { | ||
| op() | ||
| } | ||
| pMap.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, holding a lock around the calls that adds the vtep neighbor entries needs to be done carefully.
am not against this change. But It needs some more eyes. @sanimej wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats the part I am looking at as well. Usage of the WaitGroup needs to be fixed. But its better to avoid taking a lock around peerAdd which is a pretty expensive call, with multiple netlink calls to the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fcrisciani @mavenugo peerDbWg.Add is only in the sandbox init path. In the pre-swarm mode overlay using serf there is no network scoped joining of the cluster. So what used to happen was all the peer events for a network which doesn't have local tasks is stored in the peerDB. When the first container comes up on the network peerDbUpdateSandbox would program all those entries into the overlay network. If we didn't do this, till the next self pushpull there won't be connectivity to the peers. I think the only reason peerDbWg is needed is to make sure this one shot peerDbUpdateSandbox is done before we start processing the serf events. For ex: the serf could be for a delete of an existing endpoint that we received earlier and in the peerDB.
With the networkdb in swarm mode since the network gets recreated on a task run on the node and get all the events from the peers, I don't see a strong reason for using the peerDbWg. One option here is to add do make the peerDbWg calls (Add/Wait/Done) conditional and only for non-swarm mode so that we won't change the flow for the non-swarm mode and avoid the panic for the swarm mode.
For the non-swarm mode the panic is still possible. But there is very little usage for it now and we can take some time to fix it in a different approach (or may be we might even deprecate that mode) rather than completely removing that logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mavenugo On the locking, the scope of the lock is changed only in peerDbUpdateSandbox which was already serialized using the WaitGroup. So that looks ok. As I mentioned in the earlier comment its good to keep the old logic for serf mode till its deprecated.
|
Closing this one, we will go with: #1861 |
panic: sync: WaitGroup is reused before previous Wait has returned
The panic is due to an EventNotify that arrives in conjunction with
a peerAdd or peerDelete. Wait is followed by an Add
The current WaitGroup logic does not add any further protection.
The pMap mutex should be enough
The update operations have to happen inside the pMap lock to be safe,
else there is a race between the update operations and new peerAdd and peerDelete
Signed-off-by: Flavio Crisciani flavio.crisciani@docker.com